# THE ORIGIN OF PLANT CHEMODIVERSITY - CONCEPTUAL AND EMPIRICAL INSIGHTS

EDITED BY : Kazuki Saito PUBLISHED IN : Frontiers in Plant Science

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-923-6 DOI 10.3389/978-2-88963-923-6

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# THE ORIGIN OF PLANT CHEMODIVERSITY - CONCEPTUAL AND EMPIRICAL INSIGHTS

Topic Editor: Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan

Citation: Saito, K., ed. (2020). The Origin of Plant Chemodiversity - Conceptual and Empirical Insights. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-923-6

# Table of Contents

*05 Editorial: The Origin of Plant Chemodiversity – Conceptual and Empirical Insights*

Kazuki Saito


Tsubasa Shoji

*32 Chemodiversity of the Glucosinolate-Myrosinase System at the Single Cell Type Resolution*

Shweta Chhajed, Biswapriya B. Misra, Nathalia Tello and Sixue Chen

*43 Evolutionary Developments in Plant Specialized Metabolism, Exemplified by Two Transferase Families*

Hiroaki Kusano, Hao Li, Hiroshi Minami, Yoshihiro Kato, Homare Tabata and Kazufumi Yazaki

*49 Acceleration of Mechanistic Investigation of Plant Secondary Metabolism Based on Computational Chemistry*

Hajime Sato, Kazuki Saito and Mami Yamazaki

*65 Large-Scale Profiling of Saponins in Different Ecotypes of* Medicago truncatula

Zhentian Lei, Bonnie S. Watson, David Huhman, Dong Sik Yang and Lloyd W. Sumner


*159 Molecular Origins of Functional Diversity in Benzylisoquinoline Alkaloid Methyltransferases*

Jeremy S. Morris and Peter J. Facchini


Sneha Gupta, Thusitha Rupasinghe, Damien L. Callahan, Siria H. A. Natera, Penelope M. C. Smith, Camilla B. Hill, Ute Roessner and Berin A. Boughton

*215 Terpene Synthases as Metabolic Gatekeepers in the Evolution of Plant Terpenoid Chemical Diversity*

Prema S. Karunanithi and Philipp Zerbe

*238 Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation*

Adriano Rutz, Miwa Dounoue-Kubo, Simon Ollivier, Jonathan Bisson, Mohsen Bagheri, Tongchai Saesong, Samad Nejad Ebrahimi, Kornkanok Ingkaninan, Jean-Luc Wolfender and Pierre-Marie Allard

*253 Untargeted Metabolomics of* Nicotiana tabacum *Grown in United States and India Characterizes the Association of Plant Metabolomes With Natural Climate and Geography*

Dong-Ming Ma, Saiprasad V. S. Gandra, Raman Manoharlal, Christophe La Hovary and De-Yu Xie


# Editorial: The Origin of Plant Chemodiversity – Conceptual and Empirical Insights

Kazuki Saito1,2 \*

*<sup>1</sup> RIKEN Center for Sustainable Resource Science, Yokohama, Japan, <sup>2</sup> Plant Molecular Science Center, Chiba University, Chiba, Japan*

Keywords: phytochemicals, plant metabolism, metabolomics, biosynthesis, specialized metabolite, evolution, chemodiversity, plant chemicals

**Editorial on the Research Topic**

#### **The Origin of Plant Chemodiversity – Conceptual and Empirical Insights**

Whenever one looks plant -made chemicals, everyone is fascinated with their huge chemodiversity, being estimated at over one million metabolites (Afendi et al., 2012), and then impressed how people lives rely on those plant chemicals as food, drugs, flavors, cosmetics, and industrial raw materials. Although it is generally accepted that such vast plant chemodiversity is the consequence of the evolutional history of plants as sessile organisms, the queries regarding the origin of plant chemodiversity are not fully addressed.

By taking the advantage of recently advanced theory and technology of genomics and associated fields (Rai et al., 2017), such as comparative genomics, metabolomics, bioinformatics and molecular evolution, one may address the questions mentioned above quite directly. Besides answering such fundamental questions, the knowledge gained will contribute to the development of sustainable society, which is formulated as 2030 Agenda for Sustainable Development and its 17 Sustainable Development Goals (United Nations, 2018).

Only outstanding experts in this field were invited to contribute their articles to this Research Topic. The collected papers widely cover the topic from conceptual discernment to empirical insights. Finally, in total, 23 articles were published and divided into several categories: 9 in Review, 2 in Mini Review, 2 in Perspective, 1 in Hypothesis and Theory, 8 in Original Research, and 1 in Correction.

Metabolomics has now maturely held an indispensable position to decipher the chemodiversity of plants. However, peak annotation of acquired data by untargeted analysis is still challenging. Rutz et al. report an improved method for annotation of natural products with taxonomic information. Spatio-temporal metabolite and elemental profiling were applied to investigate salt-stress response in barley (Gupta et al.). The impacts of natural climate and geography on the metabolome of tobacco is reported by Ma et al.. The paper by Šamec et al. deals with the metabolome of the Psilotales, which exhibit unique anatomical character in the fern lineage.

Several articles deal with an evolutionary consideration in the expansion of plant chemodiversity. Two Perspective articles discuss on evolution aspects: The report by Kusano et al. provides an evolutional perspective regarding two transferases often involved in plant secondary (recently often referred to as "specialized") metabolism; another perspective paper by Shirai and Hanada discusses the effects of copy number variations on functional divergence in the cross-species and inside-species diversity. Shoji proposes the evolutional model based on recruitment of genes by focusing on jasmonate-responsive transcription factors. The Mini-Review article by Maeda provides insight on the evolutional diversification of plant primary metabolism

#### Edited and reviewed by:

*Kirsi-Marja Oksman-Caldentey, VTT Technical Research Centre of Finland Ltd, Finland*

\*Correspondence:

*Kazuki Saito kazuki.saito@riken.jp*

#### Specialty section:

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

> Received: *05 May 2020* Accepted: *29 May 2020* Published: *30 June 2020*

#### Citation:

*Saito K (2020) Editorial: The Origin of Plant Chemodiversity – Conceptual and Empirical Insights. Front. Plant Sci. 11:890. doi: 10.3389/fpls.2020.00890*

**5**

concerning the robustness of plant metabolism and further diversification in specialized metabolism.

Two papers deal with the novel concept and approach to extend our understanding of plant metabolism. The Hypothesis and Theory article by Schwachtje et al. discusses the imprinting and priming plant metabolism against upcoming environmental challenges as a novel concept for adaptation and diversification of plant metabolism responding to stresses. To obtain further insights on mechanisms of plant natural products biosynthesis, computational chemistry may play a central role—Sato et al. present an overview of cutting-edge aspects of application of computational chemistry on the biosynthetic mechanisms of plant specialized metabolites.

The rest of the 13 articles focuses on relatively the pathwayspecific metabolism or metabolites with some generalization across the biosynthetic pathways. Four papers deal with flavonoids and phenylpropanoids. Yonekura-Sakakibara et al. discuss the origin of flavonoid biosynthetic enzymes with functional genomics implication in their comprehensive Review Article. Another Review Article (Davies et al.) provides the current information about the flavonoid pathway in the bryophytes (liverworts, hornworts, and mosses). The role of the flavonoid metabolons formed by specific protein-protein interactions of the biosynthetic enzymes is discussed in the Review Article (Nakayama et al.). The Original Research Article by Cheevarungnapakul et al. reports the genes for the biosynthesis of caffeoylquinic acids in sunflower.

Terpenoids form the largest sub-family of specialized metabolites with their immense possibility in the expansion of structural diversity. The Review Article by Karunanithi and Zerbe provides up-to-date knowledge on the functional diversity and molecular evolution of the plant terpene synthases widely across the plant kingdom. The Original Article contributed by Muchlinski et al. reports biosynthesis and emission of mixtures of monoterpenes and sesquiterpenes triggered by the generalist herbivore in Switchgrass. Three papers focus on triterpenoid saponins. Lei et al. studied the large-scale profiling of saponins in 201 ecotypes of Medicago truncatula. The Original Article by Fanani et al. reports the detailed investigation on the molecular basis for C-30 oxidation of triterpenoids leading to the production of high-value saponins such as glycyrrhizin in Legume licorice. Cárdenas et al. in their Mini-Review Article

## REFERENCES


discuss convergent and divergent evolution in triterpenoid biosynthesis, and the mechanisms increasing structural diversity within and across plant species.

Three contributions dealing with nitrogen- and sulfurcontaining specialized metabolites are included. Morris and Facchini(a); Morris and Facchini(b) provide an overview of the functional diversification of methyltransferases in the biosynthesis of benzylisoquinoline alkaloids. The Review by Chhajed et al. discusses the cellular and subcellular organization of the glucosinolates -myrosinase system, its chemodiversity and functions in different cell types, emphasizing single-cell-type studies. Sugiyama and Hirai provide an up-to-date information on the diversity and the role of atypical myrosinases beyond the classical model of glucosinolates -myrosinase system.

Overall, this Research Topic becomes an excellent anthology to exhibit state-of-the-art on the theme of Topic by contributions from world-experts of this field. The Research Topic can play a role as a sort of flagship topic of the Specialty Section of Plant Metabolism and Chemodiversity. The curiositydriven fundamental research on the origin and evolution of diversification of plant chemicals is primarily essential. Besides, the basic knowledge obtained could contribute to solving the current global problems such as climate crisis and pandemic diseases. I hope this Research Topic could be a landmark of future research in this field.

# AUTHOR CONTRIBUTIONS

KS contributes entirely to the paper.

# FUNDING

This work was supported by the JSPS KAKENHI program (grant number 19H05652 to KS) and by the Strategic Priority Research Promotion Program of Chiba University.

# ACKNOWLEDGMENTS

I thank all authors and reviewers who contributed to the success of this Research Topic. I also acknowledge the Frontiers Editorial Office for their technical support.

**Conflict of Interest:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Saito. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Induced, Imprinted, and Primed Responses to Changing Environments: Does Metabolism Store and Process Information?

*Jens Schwachtje\*, Sarah J. Whitcomb, Alexandre Augusto Pereira Firmino, Ellen Zuther, Dirk K. Hincha and Joachim Kopka*

*Department of Molecular Physiology, Applied Metabolome Analysis, Max-Planck-Institute of Molecular Plant Physiology, Potsdam, Germany*

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Alex Williams, University of Sheffield, United Kingdom Masami Yokota Hirai, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

*\*Correspondence:* 

*Jens Schwachtje schw8je@gmail.com*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 28 September 2018 Accepted: 23 January 2019 Published: 13 February 2019*

#### *Citation:*

*Schwachtje J, Whitcomb SJ, Firmino AAP, Zuther E, Hincha DK and Kopka J (2019) Induced, Imprinted, and Primed Responses to Changing Environments: Does Metabolism Store and Process Information? Front. Plant Sci. 10:106. doi: 10.3389/fpls.2019.00106*

Metabolism is the system layer that determines growth by the rate of matter uptake and conversion into biomass. The scaffold of enzymatic reaction rates drives the metabolic network in a given physico-chemical environment. In response to the diverse environmental stresses, plants have evolved the capability of integrating macro- and micro-environmental events to be prepared, i.e., to be primed for upcoming environmental challenges. The hierarchical view on stress signaling, where metabolites are seen as final downstream products, has recently been complemented by findings that metabolites themselves function as stress signals. We present a systematic concept of metabolic responses that are induced by environmental stresses and persist in the plant system. Such metabolic imprints may prime metabolic responses of plants for subsequent environmental stresses. We describe response types with examples of biotic and abiotic environmental stresses and suggest that plants use metabolic imprints, the metabolic changes that last beyond recovery from stress events, and priming, the imprints that function to prepare for upcoming stresses, to integrate diverse environmental stress histories. As a consequence, even genetically identical plants should be studied and understood as phenotypically plastic organisms that continuously adjust their metabolic state in response to their individually experienced local environment. To explore the occurrence and to unravel functions of metabolic imprints, we encourage researchers to extend stress studies by including detailed metabolic and stress response monitoring into extended recovery phases.

Keywords: priming, stress response, stress signaling, metabolism, metabolic imprint, plant physiology

# INTRODUCTION

Sessile plants are forced to respond to adverse biotic and abiotic conditions in their local environment. Depending on the nature and intensity of such conditions, a plant's physiology can change markedly, generally because of stress-activated signaling that leads to specific physiological responses. These responses protect against or mitigate deleterious effects of stress. In a top-to-bottom view, stress-related cues induce signaling cascades, followed by activities at genetic and protein levels. Metabolite changes are generally considered to be the last step in this event cascade. Typically, large parts of metabolism are affected. Stress-related cues may directly influence enzyme activities that modify the metabolic state, independent of the transcription/ translation machinery. Finally, metabolic changes result from cues that do not induce classical stress-signaling. Such direct changes of metabolism are caused by the fluctuating physico-chemical environment of a plant, such as varying climate parameters, e.g., temperature, or soil properties (Hawkes and Sullivan, 2001; De Deyn et al., 2008; Sampaio et al., 2015; Onwuka and Mang, 2018).

Most environmental stresses are transient, such as temporally limited temperature extremes or drought phases, insect attacks or microbial infections. After a stress has ended, plants may recover from the stress and reset metabolism to growth and reproduction modes; however, recovery may not be complete. Even short-term environmental stress may have long-lasting effects on the plant system. A growing number of studies suggest that storing information about a past stress event benefits plants by preparing them for the same or a similar stress in the future. This phenomenon is called priming (Frost et al., 2008; Gamir et al., 2014a; Conrath et al., 2015; Hilker et al., 2015; Mauch-Mani et al., 2017). Several plant priming mechanisms are well investigated (**Table 1**).

Priming mechanisms are described at different levels: at the epigenetic level (e.g., by DNA and histone modification) and at the transcript or protein level (e.g., persisting changes in the abundance of transcripts, including transcription factors, and proteins or modulation of enzyme activities). However, the metabolic level as a mediator of priming has remained largely unexplored, even though large parts of metabolism are altered during stress (e.g., Schwachtje and Baldwin, 2006; Bolton, 2009; Krasensky and Jonak, 2012; Fraire-Velázquez and Balderas-Hernández, 2013). Here, we hypothesize that persistent stress-induced changes in metabolite concentrations,

TABLE 1 | Examples of stress priming scenarios. Many abiotic and biotic stresses lead to imprints that improve the plant's response to a subsequent stress.


metabolite ratios, or metabolic fluxes represent a **metabolic imprint** of prior environmental impacts and that these imprints can prime responses to future environmental events. We present evidence that supports our hypothesis and suggest environmental shift experiments that not only monitor metabolic responses during a first stress exposure (the priming event) or during a second stress response (the primed response), but also monitor short- and long-term recovery phases after stress events. Such experimental designs may characterize and identify functions of **metabolic imprints** at the level of **metabolic priming.**

Metabolite-primed responses are only properly defined by the timing, nature, and dose of the preceding environmental change, the duration of the recovery period, and in addition by the nature and dose of the subsequent stress for which metabolism is primed. In agreement with generalized concepts of non-metabolic primed stress responses (Hilker et al., 2015), several scenarios of metabolite-primed stress responses are conceivable:


These scenarios are thought to be generally applicable and have been recently discussed (Hilker et al., 2015). In the following, we focus on the roles of metabolites during stress responses, recovery, and priming. We shortly highlight effects of metabolites at all system levels of plant physiology and subsequently review metabolic changes that are caused by stress and last during stress recovery as metabolic imprints. We link metabolic imprints to a wide range of abiotic and biotic stresses. Finally, we discuss experimental approaches that enable discovery of metabolic imprints and functional analyses of these imprints.

# INDUCED METABOLIC RESPONSES

Changes in the biotic and abiotic environment are reflected by the metabolic state of a plant. Plants have a multitude of plastic responses hardwired into their genomes (Sultan, 2000). These responses are induced concomitantly with the stress and function as defense, tolerance, or repair mechanisms (e.g., Dangl and Jones, 2001; Suzuki et al., 2011; Schuman and Baldwin, 2015). These mechanisms can be defined as **stresssignaling dependent metabolic responses.** Additionally, physical and chemical conditions such as temperature and soil nutrients influence metabolism, albeit mostly to a lesser degree than stresses. Temperature affects all reaction and transport rates. Soil nutrients influence physiology according to their availability. Consequently, changes in the physico-chemical environment of a plant will cause concomitant metabolic responses which

induction and primed responses without prior metabolic induction or imprinting are conceivable.

influence single or multiple nutrient fluxes through the plant system. These changes may ultimately become apparent as changes in metabolic pool sizes or fluxes. Therefore, **induced metabolic responses** need to be interpreted as the synergistic effects of stress-signaling dependent plant responses and of external physico-chemical influences.

The study of induced metabolic responses implies an *initial metabolic state* that transitions into an *induced metabolic state* (**Figure 1**); however, these states cannot be viewed as steady states, because they are integrated into non-static physiological processes governed by diurnal environmental cycles, circadian rhythms, and developmental progression of specific tissues and of the whole plant system. Due to these interactions, any observed induced metabolic response may represent a direct modification of metabolism that is on top of the underlying physiological programs of the plant.

# METABOLIC IMPRINTS AND METABOLIC MEMORY

After an environmental stress has ceased, plant metabolism typically returns to a recovered state that is highly similar to the initial state (e.g., Hemme et al., 2014; Crisp et al., 2016; Pagter et al., 2017). A very basic metabolic perturbation and recovery process may resemble a hysteresis curve, where the metabolic transition during the perturbation and recovery take characteristically different paths. The kinetics of metabolic remodeling are largely dependent on the nature of the environmental stress, transcriptional activities, the architecture of the affected metabolic networks, and the set of transport rates and enzyme activities that act on the induced metabolite levels. As an example of differential metabolic remodeling kinetics, glycolysis intermediates reached pre-stress levels quicker than TCA cycle intermediates in *Arabidopsis* roots upon recovery from oxidative stress (Lehmann et al., 2012).

Some induced metabolic responses may persist after the global metabolic state of the plant has recovered to the initial state (**Figure 1**). A simple case would be delayed adjustment of a metabolite to the initial state. Such a delay causes a metabolite to be more abundant at the onset of a second stress event. Furthermore, some induced metabolic responses may be effectively permanent or even cumulative, e.g., due to the absence of a catabolic pathway or sequestration mechanism (Mackie et al., 2013). Two phases of metabolic response are evident: (1) metabolic changes that are induced as immediate, specific responses to the stress, e.g., sugar levels, sucrose/hexose ratio, precursors for secondary metabolites or energy-related metabolites. During stress recovery (2) some of these changes may last as mid- or long-term imprints. Imprints may influence upcoming stress responses. Here, we define *metabolic imprints* to encompass all metabolic changes that persist after recovery and thus differentiate the imprinted from the initial, pre-stress state.

According to this definition, metabolic imprints are indicators of past environmental conditions and/or stress events. In this sense, metabolic imprints may store information. Imprinted information becomes *metabolic memory*, when it is maintained and used by the plant system to improve future stress responses, for example to enhance or accelerate metabolite-induced signaling (Thellier and Lüttge, 2013). Metabolic imprints may be caused by one or more environmental events in the individual history of a plant. Metabolic imprints have been postulated previously to act as a stress memory (Bruce et al., 2007). A metabolic memory may act alone or more likely in synergy with priming and memory mechanisms at other system levels that are highlighted in the following.

# METABOLITES INFLUENCE BIOLOGICAL SYSTEMS AT ALL LEVELS

If metabolites are involved in stress responses and represent stored information, metabolites should in turn influence metabolic or signaling pathways and other parts of plant physiology to modify a stress response. Generally, stress metabolism is seen as hierarchically organized, where external cues initiate signaling pathways that *via* transcription, translation, posttranslational modifications such as phosphorylation and further regulatory steps ultimately affect metabolism. Metabolites are generally thought to represent the downstream "end" products of this hierarchy. Interestingly, this view is currently complemented by findings that suggest bottom-to-top signaling mechanisms. Specific metabolites can exert regulatory influence or feedback on the stress-signaling network and physiology. Such mechanisms open possibilities for cross talk between stress-induced metabolites and other levels of physiological regulation (Bonawitz et al., 2012; Farre and Weise, 2012; Xiao et al., 2012; Gaudinier et al., 2015; Katz et al., 2015; Francisco et al., 2016; Malinovski et al., 2017). Further complexity is indicated by fluxes of central metabolism that are not necessarily explained by transcript abundances of the corresponding enzymes (Chubukov et al., 2013; Schwender et al., 2014). In addition to classical allosteric feedback responses, such as the suppression of enzyme activity by high levels of reaction products, metabolite ratios, and possibly also metabolite fluxes may thus play important roles by directly affecting multiple levels of the stress response hierarchy.

For example, branched chain amino acids are required for phosphorylation of G proteins during osmotic stress signaling in yeast (Shellhammer et al., 2017). Other findings suggest that certain primary metabolites can influence physiology at the transcriptional level. In yeast, Pinson et al. (2009) found that a metabolic intermediate of purine metabolism influences the interaction of transcription factors and thereby modulates purine and phosphate metabolism. Amino acids and polyamines are suggested to directly modify translation because they can be sensed by translating ribosomes *via* interactions with nascent polypeptides, specifically with so-called arrest peptides (Seip and Innis, 2016). Furthermore, several metabolites are cofactors or co-substrates of chromatin-modifying enzymes and thus represent a potential regulatory interface between the metabolic and chromatin states of the cell (Shen et al., 2016; Van der Knaap and Verrijzer, 2016). For some primary metabolites, a role in chromatin modulation is suggested, e.g., fumarate, succinate, α-ketoglutarate, and acetyl-CoA. Fumarate, for example, is a competitive inhibitor of α-ketoglutarate, which is a co-substrate of histone demethylases and TET DNA methylases. Changes in cellular fumarate levels or ratios of fumarate, e.g., to ketoglutarate, may therefore contribute to altered histone modification. Different methylated histone residues are sensitive to changes in the α-ketoglutarate/ succinate ratio (Van der Knaap and Verrijzer, 2016). These effects may be specific for certain genetic regions.

The TOR and SnRK kinases are sensors of the cellular energy state and can regulate large parts of metabolism. Plants adapt to changes in energy requirements during stress using these sensors (Baena-González, 2010; Hey et al., 2010; Rexin et al., 2015). The kinases are suggested to respond to sugars and other metabolites, even though the molecular mechanisms are not yet unraveled in all details. TOR is known to be regulated by nutrient sensing of nitrogen and carbon metabolites in plants, yeast, and mammals (e.g., Dobrenel et al., 2016; González and Hall, 2017). But plant defense metabolites of the glucosinolate family, 3-hydroxypropylglucosinolate, and/or its derivatives, can also activate TOR kinases (Malinovski et al., 2017). A precursor of plastidial isoprenoids that is induced by abiotic stress can induce nuclear stress-responsive genes *via* retrograde signaling (Xiao et al., 2012). The mediator complex, which regulates gene transcription, is involved in phenylpropanoid metabolism and is suggested to respond in a feedback loop to changes in this defense-related class of compounds (Gaudinier et al., 2015).

# METABOLITES AND STRESS RESPONSES

Metabolic responses to stress are ubiquitous and well described for many plant systems. Time-series experiments revealed that metabolic activities can respond to stress more quickly than transcriptional activities (Kaplan et al., 2007; Guy et al., 2008; Caldana et al., 2011; Fraire-Velázquez and Balderas-Hernández, 2013), thus making metabolic changes an important part of early stress responses. Several metabolites can directly influence plant stress responses (Rojas et al., 2014). A few important examples are discussed below.

# Carbohydrate Metabolism

During freezing and drought, soluble sugars, such as sucrose, trehalose, fructans (fructose-based oligo- and polysaccharides), and the raffinose family of oligosaccharides can stabilize phospholipid membrane vesicles (Hincha et al., 2006; Livingston et al., 2009; Tarkowski and Van den Ende, 2015). Several sugars emerged as important factors also during biotic stress signaling (Baena-González and Sheen, 2008; Figueroa and Lunn, 2016; Li and Sheen, 2016). Sucrose has been shown to regulate various stress-related responses including circadian clock genes, phytohormones, energy metabolism, cell wall and anthocyanin synthesis (Thibaud et al., 2004; Gomez-Ariza et al., 2007; Tognetti et al., 2013; Tauzin and Giardina, 2014). Glucose induces the pathogen defense proteins PR-1 and PR-5 in *Arabidopsis via* hexokinase1 (HXK1) signaling (Xiao et al., 2000; Moore et al., 2003; Cho et al., 2009). For fructose, a specific pathway has been proposed that involves abscisic acid and ethylene signaling (Cho and Yoo, 2011; Li et al., 2011). Engelsdorf et al. (2013) showed that carbohydrate availability influences defense against a hemibiotrophic fungus during its necrotrophic phase. The more carbohydrate is available, the better the plant defends. Similarly, increased relative fructose content enhances defense against the pathogen *Botrytis cinerea* in tomato (Lecompte et al., 2017). Besides the importance of sugars for stress signaling, plants organize sugar distribution in a way that pathogens have reduced access to carbohydrates. Local depletion of nutrients appears to cause a starvation effect that reduces pathogen propagation (Bezruczyk et al., 2018).

# Amino Acid Pathways

In a 2010 paper, Liu and colleagues knocked out the amino acid transporter lht1 and correlated cellular depletion of the amino acid glutamine with altered redox status and more effective defense against several pathogens (Liu et al., 2010). These authors proposed a yet unknown negative effect of glutamine on defense signaling and a reduction in the pathogen's access to essential nutrients, similar to recent findings on *Pseudomonas*-primed systemic responses of *Arabidopsis* (Schwachtje et al., 2018). Stuttmann et al. (2011) described threonine as a potential growth inhibitor of the biotrophic oomycete *Hyaloperonospora arabidopsidis*, even though the underlying mechanism is yet unknown. It is suggested that indole-3-carboxylic acid activates defenses against *Plectosphaerella cucumerina* in *Arabidopsis* by inducing papillae deposition and H2O2 production, independently of salicylic acid and jasmonic acid (Gamir et al., 2014b). γ-Aminobutyric acid (GABA) interacts with quorum sensing of *Agrobacterium tumefaciens*, thus reducing pathogen virulence in tobacco (Chevrot et al., 2006). GABA also functions as a direct anti-herbivore defense in *Arabidopsis* (Scholz et al., 2015). Furthermore, the proline and the pyrroline-5-carboxylate (P5C) cycle are crucial for defense responses against pathogens and abiotic stresses (Liang et al., 2013; Qamar et al., 2015). Proline is involved in redox balance, osmoprotection, and stress signaling (Szabados and Savouré, 2009).

# Polyamine Metabolism

Polyamines (e.g., spermine, spermidine, and putrescine) are aliphatic compounds that are synthesized from amino acids (e.g., arginine and ornithine). Polyamines are involved in many crucial processes of cell metabolism and the translation/transcription machinery, and induce ROS, Ca, and NO signaling (Alcázar et al., 2010). Salt, heat, and drought stress induce genes for polyamine synthesis (Tiburcio et al., 2014; Liu et al., 2015; Miller-Fleming et al., 2015) and enhanced tolerance of abiotic stresses is correlated with elevated levels of polyamines (Alcázar et al., 2010). Putrescine induces abscisic acid synthesis at the transcriptional level during cold stress (Cuevas et al., 2008). Spermine appears to protect *Arabidopsis* from heat stress by increasing the expression of genes encoding heat shock proteins (Sagor et al., 2013). The pretreatment of tomato fruits with spermine before heat shock promoted an increase in expression of signal transduction genes (e.g., calmodulin, serine/threonine protein kinase) along with genes related to phytohormone pathways. Moreover, polyamines can modulate chromatin structure (Pasini et al., 2014). It is also suggested that the connected putrescine, GABA, and proline pathways play an important role during abiotic stresses (Shelp et al., 2012, Legocka et al., 2017).

# External Application of Natural Compounds

A few metabolites have been shown to prime pathogen-induced stress responses when externally applied to plants, e.g., thiamine (Ahn et al., 2007), riboflavin (Zhang et al., 2009), quercetin (Jia et al., 2010), and hexanoic acid (Aranega-Bou et al., 2014). Supposedly, all have in common an activation of the redox system that supports stress signaling. In a recent study, fumarate and citrate applications were shown to induce priming against *Pseudomonas syringae* in *Arabidopsis*, in the case of fumarate without changes in classical defense-related genes and hormones (Balmer et al., 2018). Thereby, Balmer et al. (2018) confirmed an earlier observation of systemic fumarate priming upon first exposure to the bacterial pathogen (Schwachtje et al., 2018). Further, a broad induction of plant defense systems in *Arabidopsis* was demonstrated after the application of melatonin (Weeda et al., 2014).

# ARE STRESS-RELATED METABOLITES STORED IN THE VACUOLE?

To exert a long-term effect that primes future stress responses, relevant metabolites must be stored in a way that prevents them from negatively interfering with metabolism during the recovery phase and that inhibits degradation. Accumulation of metabolites in chloroplasts, mitochondria, or in the cytosol would likely disturb core metabolic processes that are necessary for recovery (Plaxton, 2005). Possible circumventions would be a reversible conjugation that alters the chemical property of the metabolite, or storage in a membrane-enclosed cellular compartment, such as the vacuole. The vacuole can occupy more than 80% of the cell's volume and is involved in multiple critical cellular functions including storage of metabolites and modification of cytosolic metabolism according to physiological requirements (Martinoia et al., 2012). The tonoplast enclosing the vacuole contains many identified membrane proteins that are responsible for loading and unloading a diverse set of metabolites (Martinoia et al., 2012). These transporters are integrated in a larger cellular network that responds to physiological requirements and stress responses (Martinoia Schwachtje et al. Metabolic Imprint and Priming

et al., 2007; Pommerrenig et al., 2018). The carboxylic acids fumarate, malate, and citrate represent major components of the vacuolar metabolome. For example, malate is transported by two proteins, a specific anion channel (Hafke et al., 2003) and a solute carrier (Emmerlich et al., 2003). Many other primary metabolites, such as amino acids and sugars, are stored in the vacuole (Neuhaus, 2007; Tohge et al., 2011; Szekowka et al., 2013; Hedrich et al., 2015). Vacuoles are also involved in plant defenses against herbivores and pathogens by storing and sequestering toxic metabolites. Thus, specific analyses of vacuolar metabolite compositions, e.g., by non-aqueous fractionation (Klie et al., 2011), are required to unravel a possible long-term storage of stress-induced metabolites that represent metabolic imprints and may function as a metabolic memory.

# EXAMPLES OF STRESS-INDUCED METABOLIC IMPRINTS

The functional analysis of stress-induced metabolic responses has been a long-standing focus of plant physiology. In contrast, metabolic imprints following stress have rarely received attention but can now be described and analyzed in detail by large-scale experiments that combine metabolic phenotyping with global screening of other system levels and physiological analyses (e.g., Hemme et al., 2014). The fragmented knowledge on recovery processes may result from the simplifying assumption that plants will revert to the identical initial endogenous state after a transient environmental perturbation. The resetting of the initial state is thought to alle*via*te the need to expend energy for the maintenance of the stress-adapted state. While this assumption may be correct for the majority of metabolites, the past perturbations may leave an imprint on metabolism that lasts longer than may be expected from a system level that is notorious for its extremely rapid fluctuations (Urbanczyk-Wochniak et al., 2005; Kim et al., 2011). In the following, we will review evidence of metabolic imprinting and functions of imprints for priming of systems in the context of abiotic and biotic stresses.

# Abiotic Stress

Abiotic stresses are known to prime plant systems for an enhanced stress response to a recurrent stress. Recent reviews highlight abiotic stress priming of temperature, drought, and other factors (Bruce et al., 2007; Hincha and Zuther, 2014; Hilker et al., 2015). In the following, we will first highlight proline imprints that were observed in the context of various stresses before we address more stress-specific metabolic imprints.

# Proline Imprints Are Caused by Various Abiotic Stresses

Proline accumulation is one of the most studied metabolic stress responses. Upon environmental stress, proline is mainly generated from glutamic acid in chloroplasts and increases up to 100-fold in plants (Liang et al., 2013). Proline has several functions during stress responses, e.g., as an osmoprotectant, antioxidant, molecular chaperone to protect protein integrity, pH buffer, or in some cases it may serve as a carbon and nitrogen source during stress recovery. Proline also enhances enzyme activities, triggers gene expression, and modulates mitochondrial functions (Szabados and Savouré, 2009). By increasing ROS production in mitochondria *via* the electron transport chain, proline regulates processes that support cell survival or induce apoptosis (Liang et al., 2013). Suppression of proline catabolism, for example *via* reduction of proline dehydrogenase gene expression, enhances tolerance toward salt and drought stress (Ibragimova et al., 2012). In *Arabidopsis*, proline accumulates strongly during a 4-day drought phase and declines to initial levels during a subsequent 4-day recovery phase (Sharma and Verslues, 2010). In contrast, the drought-resistant *Periploca sepium* increases proline levels continuously during a similar 4-day drought stress but maintains a proline imprint during a 4-day recovery phase (An et al., 2013). Even after a 8-day recovery, the newly developed buds of *Periploca sepium* still contained twice as much proline as control plants. Proline is apparently also important for recovery of tobacco plants from drought stress by suppressing a senescence-related promoter (Vancova et al., 2012). In addition, the proline concentration in salt stress-resistant salt cress (*Thellungiella halophila,* renamed to *Thellungiella salsuginea,* and *Eutrema salsugineum*) is significantly higher than in *Arabidopsis* already under control conditions (Taji et al., 2004; Benina et al., 2013; Lee et al., 2016). In this case, high proline levels may serve as a constitutive stress adaptation of an extremophile plant. Also, in *T. halophila*, proline levels increased during a 3-day recovery from cold stress but not during the 3-day stress phase itself. The imprint of the proline pool is accompanied by other metabolites, such as 5-hydroxyproline and sucrose (Benina et al., 2013). In *Arabidopsis*, 3 days after de-acclimation from cold acclimation, proline levels were still elevated in leaves. More freeze-tolerant *Arabidopsis* accessions showed higher levels than susceptible accessions after 3 days of recovery (Zuther et al., 2015).

# Drought

Several studies describe imprints of metabolite pools other than proline after exposure to drought. Primary metabolites, e.g., sugars and organic acids, as well as several secondary metabolites maintain a characteristic imprint in the resurrection plant *Haberlea rhodopensis* after 2 days of recovery from an 8-day drought period (Moyankova et al., 2014). A similar duration of drought stress and recovery, 8–10-day stress and 2-day recovery, causes a different metabolic imprint of *Medicago sativa* nodules (Naya et al., 2007). In this symbiotic system, pools of several primary metabolites remain reduced during drought recovery. A recent study describes the metabolic recovery of drought-stressed sugar beets (Wedeking et al., 2018). The authors found a transient normalization of most of the measured metabolites after 8 days of recovery from a 13-day drought stress period. Interestingly, during the following 4 days, several amino acids (e.g., phenylalanine, tyrosine, and leucine) again accumulated in leaves, indicating a metabolic stress imprint that may be beneficial for a subsequent second drought phase.

## Low Temperature

Faster than drought stress, temperature stress may rapidly revert. Temperature changes cause metabolic responses that are particularly well characterized after heat shock or extended cold exposure (e.g., Kaplan et al., 2004, 2007; Guy et al., 2008). The temperature-induced metabolic responses comprise strong changes in a wide range of metabolite pools that indicate global reprogramming of primary metabolism in *Arabidopsis* rosettes. Data on metabolomic and transcriptomic cold recovery describe a 24-h metabolic imprint after 4 days of exposure to 4°C (Kaplan et al., 2004, 2007) and reveal several interesting aspects. Firstly, most of the cold-induced transcript changes returned to the pre-stress state after 24 h. In contrast, primary metabolism was only partially recovered. These differences between transcriptional and metabolic recovery from cold stress were recently confirmed in greater detail by Pagter et al. (2017). Secondly, the metabolite profile of the recovery phase differed significantly from all measured time points during cold exposure, supporting the observation that the metabolic reorganization after stress exhibits different kinetics than the stress response. Cold de-acclimating metabolism was associated with partially maintained enhanced freeze tolerance, which apparently remains active at least 3 days into cold-recovery (Kaplan et al., 2004; Zuther et al., 2015). Zuther et al. (2018) reported genetic differences in the transcriptomic and metabolic patterns during cold memory of the *Arabidopsis* ecotypes, Col-0 and N14.

A complex picture of metabolic reorganization during recovery was described after freezing stress of crown tissue of oat (*Avena sativa* L.) by Henson et al. (2014). After 3 weeks of cold acclimation and 1 day of freezing, plants were monitored during 14 days of recovery. At the end of recovery, several amino acids were largely increased compared to non-stressed plants, and several sugars and organic acids were reduced. Moreover, the metabolic profile differed markedly from what is observed after cold stress recovery, indicating that this overwintering species relies on specific regulations for freezing resistance.

Analyses of *Hordeum vulgare* also show stress-imprinted metabolites that are linked to frost tolerance (Mazucotelli et al., 2006). For example, 8-day-old barley seedlings were freeze-stressed at −3°C for 16 h and allowed to recover for 48 h at 22°C. This treatment resulted in 16-fold higher GABA levels at the end of the recovery phase. GABA and its precursor glutamate are part of the GABA-shunt that is linked to the tricarboxylic acid cycle where it bypasses two reaction steps from α-ketoglutarate to succinate. Besides glutamate, putrescine and proline can also be catabolized *via* GABA (Shelp et al., 2012; Signorelli et al., 2015). The GABA-shunt has a central role in carbon/nitrogen metabolism and stress signaling, for example for cell death promotion in response to pathogens or for cold tolerance (Mazucotelli et al., 2006; Fait et al., 2007; Michaeli and Fromm, 2015). However, the role of GABA during cold−/frost-stress and possible GABA pool imprints are still not fully understood.

# High Temperature

Elevated temperatures leave metabolic imprints in photosynthetic microorganisms. A recent large-scale study describes the temporal succession of heat stress responses of *Chlamydomonas reinhardtii* during a 24-h induction phase after shift from 25 to 42°C and the fate of system imprints during 8-h recovery at 25°C (Hemme et al., 2014). In this experiment, cell division stopped during heat treatment and remained so during the 8 h of recovery, resulting in measurements that represent the metabolic state of cells that all individually experienced the heat stress. Similar to the example of 4°C cold stress in *Arabidopsis* (Kaplan et al., 2004, Pagter et al., 2017), the metabolome, as well as the proteome, recovered only in part and retained imprints, regarding, e.g., TCA intermediates and sugar phosphates. Importantly, the pattern of metabolic induction again differed from the pattern of metabolic recovery.

Besides temperature perturbations, other abiotic stresses have been shown to generate lasting imprints. A 6-h oxidative stress that was induced by menadione generated an imprint on primary metabolism in *Arabidopsis* roots that lasted at least 30 h into recovery (Lehmann et al., 2012). GABA was part of this imprint, like proline and other amino acids which remained at a high level, as well as several sugars and sugar phosphates.

# Biotic Stress

Biotic stresses are perhaps the best understood stresses regarding primed plant systems. Recent reviews highlight the importance of biotic stress priming for enhanced responses toward a broad range of insects and pathogens that negatively influence plant performance and crop production (Frost et al., 2008; Conrath et al., 2015; Hilker et al., 2015; Mauch-Mani et al., 2017). At the metabolite level, biotic priming is mainly studied with respect to volatile organic compounds, oviposition, and beneficial or pathogenic microorganisms that are associated with a plant and may prime systemic tissue. Even though it has often been shown that during biotic stresses, metabolism in local and systemic plant parts is severely affected (Schwachtje and Baldwin, 2006; Lemoine et al., 2013; Zhou et al., 2015), studies on persistent metabolic changes during and after recovery from pathogen or insect stress are rare. This applies specifically for interactions of plants with microorganisms, since these are continuously associated with the plant, either as beneficial root colonizers or as leaf pathogens, thus making a clearly defined recovery phase after a time-limited stress or induction phase unfeasible. Nevertheless, several metabolites have so far been associated with priming against biotic stresses.

Plant amino acid metabolism is well known to contribute to the priming of defense responses (Gamir et al., 2014a). For example, the lysine catabolite pipecolic acid can act as a key regulator of SAR (Návarová et al., 2012; Zeier, 2013). Several amino acids and intermediates of the TCA cycle are regulated during priming with pathogenic *Pseudomonas syringae* or the chemical β-aminobutyric acid, i.e., BABA (Pastor et al., 2014). The content of most amino acids was reduced in these experiments, but cysteine, methionine, tryptophan, and tyrosine were specifically induced by bacteria or BABA during 48 h. Fumarate and malate were induced by BABA. These two organic acids were also induced in another study that investigated systemic priming-related effects of infection of *Arabidopsis* with *Pseudomonas syringae* (Schwachtje et al., 2018). A systemic increase of fumarate and malate was observed for 4 days, whereas the transcriptional profile did not explain the altered metabolite levels. This study suggested a lasting metabolic priming effect in systemic tissue that includes storage of metabolites, e.g., fumarate and malate. These metabolites may be readily available to support energy and carbon demands during a subsequent pathogen (*Pseudomonas syringae*) infection.

Altered amino acid levels after contact with a pathogen can have multiple and possibly conflicting functions. Several amino acids are precursors of important defense metabolites, e.g., alkaloids, phenylpropanoids, and glucosinolates. On the other hand, invasive pathogens like *Pseudomonas syringae* propagate in the apoplast and are exclusively dependent on extracellular plant metabolites. Reduction of sugars and amino acids in the apoplast and, as indicated by transcript changes, likely also other nitrogen resources should be an effective defense strategy that attenuates pathogen propagation and thereby increases the efficiency of other defense mechanisms (Seifi et al., 2013; Bezruczyk et al., 2018; Schwachtje et al., 2018). Several imprinted metabolic signals have been identified that may contribute to the modulation of SAR, including a glycerol-3-phosphate-dependent yet non-identified signal, azelaic acid, dehydroabietinal, jasmonic acid, and methyl salicylate (Dempsey and Klessig, 2012). The metabolic signals that are linked to SAR or other primed responses will yield intriguing novel insights into imprinted primary metabolism/ energy status and the function of such imprints for efficiently primed plant responses.

In their natural environment, plants usually face more than one type of stress. The physiological responses toward various stress combinations, simultaneous or successive, have been addressed by recent studies (reviewed in Suzuki et al., 2014). The effects on plant performance can be synergistic, neutral, or conflicting (Crisp et al., 2016; Lawas et al., 2018) and it will be a demanding task to unravel how metabolite-based priming and priming in general by a certain stress may influence plant responses to other types of stress.

# EXPERIMENTAL APPROACHES

The successful search for priming-related metabolites relies on the timing of experiments. Mostly, stress studies focus on the immediate response of the plant system toward an applied stress, but rarely focus on the long-term effects on plant metabolism. The recovery phase after a stress event is crucial for the establishment of priming and should thus be studied more extensively. The history of plants prior to stress experiments is rarely controlled and comparable between experiments. These methodology details must be described in detail to enhance reproducibility of stress experiments.

Several factors interfere with the metabolic state of a plant during the recovery phase and must be experimentally addressed. As described above, the metabolic composition of plant tissues is an integral of perceived environmental stresses (Gratani, 2014) and may lead to variation among individual plants even under standardized conditions (e.g., Sanchez et al., 2010). Metabolism is continuously regulated by the circadian clock (Farre and Weise, 2012) and this regulation also affects stress responses themselves on genetic and metabolic levels (Lu et al., 2017). For example, glucosinolate accumulation follows the circadian rhythm in *Arabidopsis* and jasmonic acid-based defenses are synchronized with the likeliness of herbivore attack (Goodspeed et al., 2012). In return, several biotic stressors have recently been shown to influence the cycle length of the circadian clock, e.g., pathogens and insects (Sharma and Bhatt, 2015; Joo et al., 2018; Li et al., 2018). This requires experimental setups with extended sampling time points during the day. Also, the influence of the ontogenetic stage, i.e., the effects of endogenous physiological aging mechanisms, on induced metabolic responses should be addressed. Furthermore, as described above, priming-related metabolites may be stored in certain cell compartments (e.g., the vacuole). Subcellular localization of metabolites is difficult to assess but can be addressed by non-aqueous fractionation (Klie et al., 2011). Because temporal effects are essential for the assessment of induced, imprinted, and primed responses, care should be taken to design time-series experiments with extended and high temporal resolution including the coverage of diurnal changes. High replication is advised due to varying histories and developmental variation of individual plants (e.g., Peters et al., 2018), this applies particularly to field experiments. Further, the high chemodiversity of plant metabolites entails many different chemical properties. To find new candidates for priming, the application of multiple chromatography systems for untargeted metabolic profiling should be taken into account (e.g., Nakabayshi and Saito, 2015; Vasilev et al., 2016).

# CONCLUSION

Recent publications tackle the study of metabolic imprints by analyses of recurrent perturbations or recovering plant systems and discover functions of novel primed metabolites and metabolic pathways (Gamir et al., 2014a; Balmer et al., 2015). Intensified research on the potential functions of metabolic imprints should be highly fruitful and yield novel insights into priming phenomena. This view is supported by recent findings that demonstrate surprisingly diverse effects of metabolites on stress metabolism, signaling, and transcription. The vast chemical diversity of plants will likely yield new candidates of metabolic regulation or priming.

Currently, the knowledge of the short- to long-term kinetics of metabolic imprints is fragmented. This fact renders vague the link between observed metabolic imprints and their potential function as priming signals or memory of past stress events. Except for the known signaling metabolites that are involved in primed plant responses, the nature, characteristics, and role of metabolic imprinting or priming remain mostly unknown not least because stress recovery is rarely investigated in depth by studies employing modern large-scale metabolomic, proteomic, transcriptomic, or epigenetic tools. From advanced analyses of metabolic imprints, we expect to discover new priming mechanisms and to gain insight into the major contributions of metabolism to priming and potentially short-lived or even longer-lasting non-neural, cellular memory.

# AUTHOR CONTRIBUTIONS

JS and JK developed the concepts and wrote the manuscript with contributions of all other co-authors.

## REFERENCES


## FUNDING

We acknowledge the Max-Planck Society and the German Research Foundation (DFG) for funding the Collaborative Research Centre 973 "Priming and Memory of Organismic Responses to Stress" (www.sfb973.de). We thank the National Council for the Improvement of Higher Education—CAPES of Brazil—for the scholarship provided to AF.

## ACKNOWLEDGMENTS

We acknowledge the long-standing support by Prof. Dr. L. Willmitzer, Prof. Dr. M. Stitt, and Prof. Dr. R. Bock (Max-Planck-Institute of Molecular Plant Physiology, Potsdam, Germany).


is the vacuolar malate carrier. *PNAS* 100, 11122–11126. doi: 10.1073/ pnas.1832002100


heat stress and recovery in the photosynthetic model organism *Chlamydomonas reinhardtii*. *Plant Cell* 26, 4270–4297. doi: 10.1105/tpc.114.130997


environmental metabolomics of plants. *Sci. Rep.* 6:29265. doi: 10.1038/ srep29265


Zuther, E., Schaarschmidt, S., Fischer, A., Erban, A., Pagter, M., Mubeen, U., et al. (2018). Molecular determinants of increased freezing tolerance due to low temperature memory in Arabidopsis. *Plant Cell Environ.* doi: 10.1111/pce.13502

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Schwachtje, Whitcomb, Firmino, Zuther, Hincha and Kopka. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# The Recruitment Model of Metabolic Evolution: Jasmonate-Responsive Transcription Factors and a Conceptual Model for the Evolution of Metabolic Pathways

### Tsubasa Shoji\*

*Department of Biological Science, Nara Institute of Science and Technology (NAIST), Ikoma, Japan*

#### *Edited by:*

*Philipp Zerbe, University of California, Davis, United States*

#### *Reviewed by:*

*Justin William Walley, Iowa State University, United States Nobutaka Mitsuda, National Institute of Advanced Industrial Science and Technology (AIST), Japan Jacob Pollier, Flanders Institute for Biotechnology, Belgium*

> *\*Correspondence: Tsubasa Shoji t-shouji@bs.naist.jp*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

> *Received: 21 November 2018 Accepted: 12 April 2019 Published: 14 May 2019*

#### *Citation:*

*Shoji T (2019) The Recruitment Model of Metabolic Evolution: Jasmonate-Responsive Transcription Factors and a Conceptual Model for the Evolution of Metabolic Pathways. Front. Plant Sci. 10:560. doi: 10.3389/fpls.2019.00560* Plants produce a vast array of structurally diverse specialized metabolites with various biological activities, including medicinal alkaloids and terpenoids, from relatively simple precursors through a series of enzymatic steps. Massive metabolic flow through these pathways usually depends on the transcriptional coordination of a large set of metabolic, transport, and regulatory genes known as a regulon. The coexpression of genes involved in certain metabolic pathways in a wide range of developmental and environmental contexts has been investigated through transcriptomic analysis, which has been successfully exploited to mine the genes involved in various metabolic processes. Transcription factors are DNA-binding proteins that recognize relatively short sequences known as *cis*-regulatory elements residing in the promoter regions of target genes. Transcription factors have positive or negative effects on gene transcription mediated by RNA polymerase II. Evolutionarily conserved transcription factors of the APETALA2/ETHYLENE RESPONSE FACTOR (AP2/ERF) and basic helix-loop-helix (bHLH) families have been identified as jasmonate (JA)-responsive transcriptional regulators of unrelated specialized pathways in distinct plant lineages. Here, I review the current knowledge and propose a conceptual model for the evolution of metabolic pathways, termed "recruitment model of metabolic evolution." According to this model, structural genes are repeatedly recruited into regulons under the control of conserved transcription factors through the generation of cognate *cis*-regulatory elements in the promoters of these genes. This leads to the adjustment of catalytic activities that improve metabolic flow through newly established passages.

Keywords: alkaloids, *cis*-regulatory element, jasmonates, recruitment model of metabolic evolution, regulon, specialized metabolism, terpenoids, transcription factor

# INTRODUCTION

Biological processes generally depend on the coordinated expression of multiple genes. Transcription factors play a central role in controlling the RNA polymerase-mediated transcription of downstream genes. These genes form gene networks, or regulons, with transcription factors recognizing specific cis-regulatory elements in the promoter regions of these groups of target genes. Complex, often long, metabolic pathways rely on the proper functioning of a large series of metabolic enzymes, membrane transporters, and other proteins. The activity of transcription factors often underlies the coordination of gene expression during various metabolic processes (Patra et al., 2013; Chezem and Clay, 2016; Zhou and Memelink, 2016).

A diverse range of specialized metabolites, such as bioactive alkaloids and terpenoids, are produced and accumulate in various plant species. These metabolites contribute to plant defense and reproduction in a changing environment. Due to their useful attributes, many natural products derived from plants or phytochemicals are utilized as medicines, drugs, dyes, perfumes, or other industrial materials. In contrast to universally present primary metabolites, the occurrence of specialized (or so-called secondary) metabolites is usually restricted to certain taxonomic groups. Metabolite levels are often highly variable, even within a single species or individual plant, reflecting the temporal and spatial dynamics of their production. Genomic and molecular approaches, often involving coexpression network analysis to select candidate genes (Yonekura-Sakakibara and Saito, 2013), have greatly facilitated the identification of structural genes involved in metabolic pathways. By contrast, the regulatory aspects of these genes, such as regulatory mechanisms at the transcriptional and other levels, have remained unexplored, representing a promising area of research in the coming years.

The central roles of MYB and basic helix-loop-helix (bHLH) family transcription factors in regulating the anthocyanin and related flavonoid pathways of many plant species have been a cornerstone example of the persistence of metabolic regulons comprising master transcriptional regulators with their downstream structural genes (Patra et al., 2013; Chezem and Clay, 2016). Well-studied instances include the regulation of the glucosinolate pathway by MYB family factors in Arabidopsis (Chezem and Clay, 2016). In notable contrast to the regulators that target a specific metabolic pathway (or set of related pathways), jasmonate (JA)-responsive factors in certain subgroups of the APETALA2/ETHYLENE RESPONSE FACTOR (AP2/ERF) and bHLH families have been found to be master regulators for a diverse range of specialized pathways, mostly for important alkaloids and terpenoids, in distinct plant linages (Zhou and Memelink, 2016).

JAs are phytohormones derived from the octadecanoid pathway that play central roles as signaling molecules during biotic and abiotic stress responses in plants (Goossens et al., 2016a). Many specialized pathways are readily elicited by JA treatment (Goossens et al., 2016a; Zhou and Memelink, 2016). Indeed, many phytochemicals are thought to be involved in plant defense responses against pathogens and herbivores based on the JA-dependent elicitation of their biosynthetic pathways, along with their toxicity to biological agents. The perception of JA signals and the resulting cascades leading to gene regulation primarily occur via proteasome-dependent degradation of JAZ repressor proteins and the subsequent liberation of a few key transcription factors, including bHLH family member MYC2, from JAZ-mediated repression (Thines et al., 2007; Sheard et al., 2010; Zhang et al., 2015; Goossens et al., 2016a). It is important to address how the upstream JA signaling circuit is anchored to downstream defense metabolism. A handful of the JA-responsive transcription factors of AP2/ERF and bHLH families, have been identified as missing links between the highly conserved JA signaling module and more divergent downstream pathways (Zhou and Memelink, 2016).

In this article, I provide an overview of the JA-responsive factors and their target metabolic pathways, which encompass a substantial portion of the specialized pathways for which transcriptional regulators have been defined (Patra et al., 2013; Zhou and Memelink, 2016). The identification of such evolutionarily conserved regulators targeting divergent pathways prompt me to contemplate how these metabolic regulons have been established during the evolution. This evolutionary issue is discussed and a conceptual model is proposed, mainly focusing on the JA-responsive factors and their regulons.

# CLADE II, SUBGROUP IXa ERF TRANSCRIPTION FACTORS

Transcription factors of the AP2/ERF family are widespread in plants. The GCC-box (5′ -AGCCGCC-3′ ) element is a typical sequence recognized by AP2/ERF transcription factors. The DNA-binding AP2/ERF domain contains a three-stranded βsheet followed by an α-helix, which form a unique interface required for DNA binding (Allen et al., 1998).

A group of AP2/ERF family transcription factors, including Octadecanoid-derivative Responsive Catharanthus AP2-domain (ORCA) proteins from Catharanthus roseus (Van der Fits and Memelink, 2000; Paul et al., 2017), OpERF2 from Ophiorrhiza pumila (Udomsom et al., 2016), ERF189 and ORC1 from tobacco (Nicotiana tabacum; Shoji et al., 2010; De Boer et al., 2011), JASMONATE RESPONSIVE ERF4 (JRE4)/GLYCOALKALOID METABOLISM9 (GAME9) from tomato and potato (Cárdenas et al., 2016; Thagun et al., 2016; Nakayasu et al., 2018), and AaORA from Artemisia annua (Lu et al., 2013), are classified into clade II of subgroup IXa (Nakano et al., 2006; Shoji et al., 2010, 2013). These transcription factors are involved in regulating JA-mediated defense metabolism in various plants. The JA-responsive ERF genes are present in a wide range of eudicots, usually as multicopy genes (**Figure 1A**). Multiple ERF genes are tandemly clustered on chromosomes in some plant genomes (**Figure 1A**). The phylogenetic relationships of ERFs from different species (**Figure 1A**) imply that independent generations of these gene clusters in distinct plant families through tandem gene duplication.

# ORCAs in *C. roseus*

Terpenoid indole alkaloids (TIAs) are a large group of specialized products, including the valuable chemotherapy drugs vinblastine and vincristine. A variety of TIAs are derived from the key intermediate strictosidine, which is formed by condensation between tryptamine (a product of the shikimate pathway) and the seco-iridoid compound secologanin. TIA biosynthesis and its regulation have been intensively studied in the medicinally important species C. roseus (Apocynaceae) (Zhu et al., 2014). In C. roseus, ORCA2 and ORCA3 function as transcriptional

regulators that induce the expression of TIA biosynthesis genes,

tobacco cultivar (Kajikawa et al., 2017) is highlighted in light green.

including strictosidine synthase and tryptophan decarboxylase, encoding key enzymes in this pathway (Menke et al., 1999; Van der Fits and Memelink, 2000; Li et al., 2013). ORCA3 is physically linked to ORCA4 and ORCA5, forming a gene cluster in the genome (**Figure 1A**, Paul et al., 2017). ORCA4 shares an overlapping function with ORCA2 and ORCA3, but ORCA4 also targets additional TIA genes (Paul et al., 2017). Unlike ORCA2 (Li et al., 2013) and ORCA3 (Van der Fits and Memelink, 2000), the overexpression of ORCA4 causes a drastic increase in TIA accumulation (Paul et al., 2017). C. roseus MYC2 (CrMYC2) directly upregulates the expression of ORCA3 by recognizing a Gbox-like element in its promoter (Zhang et al., 2011), and it also coregulates TIA structural genes with ORCA3 (Paul et al., 2017). In addition to their role in transcriptional regulation, ORCAs and CrMYC2 are activated by phosphorylation by a kinase involved in a JA-activated MAP kinase cascade (Paul et al., 2017).

# OpERF2 in *Ophiorrhiza pumila*

Camptothecin is an antitumor TIA that functions by inhibiting DNA topoisomerase I activity. This clinically important TIA is produced by various angiosperms from taxonomically distant families, including Ophiorrhiza pumila (Rubiaceae) (Sirikantaramas et al., 2007). OpERF2 was originally isolated from O. pumila hairy roots. The suppressed expression of this gene resulted in the reduced expression of genes involved in seco-iridoid and upstream methylerythritol phosphate (MEP) pathways, which supply secologanin for downstream camptothecin production, although this did not have a significant impact on TIA accumulation (Udomsom et al., 2016).

# ERF189 and ORC1 in Tobacco

Nicotine is composed with two heterocyclic rings, the ornithinederived pyrrolidine ring and the nicotinate-derived pyridine ring. In tobacco (Nicotiana tabacum, Solanaceae), this toxic alkaloid is produced in roots and primarily accumulates in leaves, functioning as a defense compound against herbivores (Dewey and Xie, 2013). Tobacco ERF189, ORC1, and related ERF genes are clustered together in the tobacco genome (**Figure 1A**, Shoji et al., 2010; De Boer et al., 2011; Kajikawa et al., 2017). A cluster of ERFs including ERF189 and ORC1 were found to be deleted to a large extent in a tobacco cultivar with low nicotine content (**Figure 1A**, Shoji et al., 2010; Kajikawa et al., 2017). Although not yet proven, ERF189 is considered to be a primary transcriptional regulator of nicotine biosynthesis, given that its expression profiles are similar to those of the downstream biosynthesis genes: strong expression in roots (Kajikawa et al., 2017), no induction in response to NaCl (Shoji and Hashimoto, 2015; Kajikawa et al., 2017), and the suppression of JA-dependent induction by ethylene (Shoji et al., 2000, 2010). A large series of metabolic and transport genes in this pathway are upregulated by ERF189, which recognizes P-box elements, but not the typical GCC-box elements, in their promoters (Shoji et al., 2010, 2013; Shoji and Hashimoto, 2011a). Tobacco MYC2 regulates the expression of ERF189 and directly activates the transcription of nicotine biosynthesis genes together with ERF189 (Shoji and Hashimoto, 2011b; Zhang et al., 2012).

# JRE4/GAME9 in Tomato and Potato

Steroidal glycoalkaloids (SGAs) are cholesterol-derived, nitrogen-containing metabolites found in the inedible parts of Solanaceae plants such as tomato (Solanum lycopersicum) and potato (S. tuberosum) (Cárdenas et al., 2015). In the tomato and potato genomes, the JRE4/GAME9 gene is present in a cluster with related ERF genes (Cárdenas et al., 2016; Thagun et al., 2016). JRE4/GAME9 regulates nearly an entire series of SGA metabolic steps, including those in the upstream isoprenoid-producing mevalonate (MVA) pathway (Cárdenas et al., 2016; Thagun et al., 2016; Nakayasu et al., 2018). A loss of JRE4/GAME9 function drastically reduced SGA accumulation and resistance to chewing insects in tomato, demonstrating the major role of this transcription factor in defense-related SGA formation (Nakayasu et al., 2018). Tomato MYC2 and JRE4/GAME9 synergistically activated the promoters of SGA genes in tobacco protoplasts (Cárdenas et al., 2016). In agreement with the results of promoter binding studies, cognate cis-regulatory elements are significantly enriched in the proximal promoter regions of SGA biosynthesis genes, supporting the direct regulation of these genes by JRE4/GAME9 (Thagun et al., 2016). A comparison of the genomes of ancestral and cultivated species of the Solanum genus pointed to the possible selection of certain alleles of JRE4/GAME9 during domestication, which might have contributed to the decrease in antinutritional SGA levels in cultivated Solanum species (Hardigan et al., 2017; Zhu et al., 2018).

# AaORA in *Artemisia annua*

Artemisinin, a sesquiterpene lactone produced by the traditional Chinese herb Artemisia annua (Asteraceae), has been exploited as an effective anti-malaria agent (Tang et al., 2014). A. annua ORA (AaORA) is a transcriptional regulator of artemisinin biosynthesis that upregulates the expression of genes involved in this pathway, including amorpha-4,11-diene synthase, CYP71AV1, and double bond reductase 2 (Lu et al., 2013). AaORA is specifically expressed in the trichomes of aerial organs, including artemisinin-producing glandular trichomes (Olofsson et al., 2011; Lu et al., 2013). Since numerous transcription factors from various families (e.g., bHLH, ERF, bZIP, and WRKY), in addition to AaORA, were shown to regulate artemisinin biosynthesis (Tang et al., 2014; Lv et al., 2017), the relative importance of each transcription factor and their functional relationships in this process should be addressed.

# *Arabidopsis thaliana* ERF13

AtERF13 is the only member of clade II, subgroup IXa in Arabidopsis thaliana (Brassicaceae). In contrast to the ERFs mentioned above, AtERF13 was not yet shown to be involved in a specific metabolic pathway. AtERF13 is induced in response to a range of biotic and abiotic stresses, such as JA, wounding, insect feeding, colonization of beneficial bacteria, high osmolality, and NaCl (Lee et al., 2010; Sogabe et al., 2011; Srivastava et al., 2012; Schweizer et al., 2013). AtERF13 binds to COUPLING ELEMENT1 (CE1), a cis-regulatory element required for abscisic acid (ABA)-responsive gene expression, and the overexpression of AtERF13 confers increased sensitivity to ABA in Arabidopsis, suggesting this gene functions in abiotic stress resistance (Lee et al., 2010). AtERF13 is also involved in resistance to insect herbivores, acting downstream of MYC2 (a central player in JA signaling) and mediating the expression of a subset of MYC2-regulated defense genes (Schweizer et al., 2013). AtERF13 is phosphorylated at its tyrosine residues, as revealed by phosphoproteomic analysis, suggesting that its activity is regulated via post-translational modification (Nemoto et al., 2015).

# SUBGROUP IVa bHLH TRANSCRIPTION FACTORS

Another group of JA-responsive transcription factors is attracting attention as regulators of metabolic pathways in diverse plants. These transcription factors include bHLH IRIDOID SYNTHESIS1 (BIS1) and BIS2 from C. roseus (Van Moerkercke et al., 2015, 2016), TRITERPENE SAPONIN BIOSYNTHESIS ACTIVATING REGULATOR1 (TSAR1) and TSAR2 from Medicago truncatula (Mertens et al., 2016a), TSAR-LIKE1 (TSARL1) from Chenopodium quinoa (Jarvis et al., 2017), GubHLH3 from Glycyrrhiza uralensis (Tamura et al., 2018), and bHLH18, bHLH19, bHLH20/NAI1, and bHLH25 from Arabidopsis (Matsushima et al., 2004), which all belong to subgroup IVa of the bHLH family (**Figure 1B**, Heim et al., 2003; Goossens et al., 2016b).

Unlike AP2/ERF family members, which are specific to plants, the bHLH transcription factor family is widely present in eukaryotic organisms and has expanded, especially in land plants (Feller et al., 2011). The signature bHLH domain is composed of an N-terminal basic region that binds to negatively charged DNA and a helix-loop-helix motif responsible for protein dimerization. bHLH transcription factors form homo- or heterodimers that typically bind to E-box (5′ -CANNTG-3′ ) elements, such as Gbox (5′ -CACGTG-3′ ) and N-box (5′ -CACGAG-3′ ) elements, in the promoter regions of their target genes.

# BISs in *C. roseus*

In addition to ORCAs and CrMYC2, BIS1 and BIS2, a pair of homologous JA-responsive bHLH transcription factors, are involved in regulating TIA formation in C. roseus. BISs specifically act on a branch of the TIA pathway that supplies the seco-iridoid intermediate, secologanin, for incorporation into TIAs. Overexpression of BIS1 or BIS2 results in the upregulation of genes involved in seco-iridoid and upstream MEP pathways, thereby increasing the accumulation of downstream TIAs (Van Moerkercke et al., 2015, 2016). The finding that BIS2 is induced by BIS1 or BIS2 overexpression points to the existence of a positive feedback loop (Van Moerkercke et al., 2016). In contrast to subgroup IIIe MYC2 transcription factors, BISs cannot interact with JAZ proteins and are thus not direct targets of the repressors integrated into the JA signaling module (Van Moerkercke et al., 2016).

# TSARs in *Medicago truncatula*

The model legume plant Medicago truncatula (Fabaceae) produces oleanane-type triterpenoid saponins. These amphipathic glycosides, containing triterpenoid aglycones, exhibit a diverse range of biological activities (Osbourn et al., 2011). In M. truncatula, TSAR1 and TSAR2, two homologous bHLH transcription factors, are JA-responsive transcriptional regulators of triterpenoid saponin biosynthesis (Mertens et al., 2016a). While the isoprenoid-producing MVA pathway is commonly targeted by both TSARs, TSAR1 and TSAR2 specifically regulate two distinct downstream branches of this pathway, producing nonhemolytic and hemolytic saponins, respectively (Mertens et al., 2016a). TSARs activate the gene encoding 3-Hydroxy-3-Methylglutaryl-CoA Reductase, a ratelimiting enzyme in the MVA pathway, by directly recognizing the N-box element in its promoter (Mertens et al., 2016a).

# TSARL1 in *Chenopodium qunoa*

Chenopodium quinoa (Chenopodiaceae), or quinoa, is a staple food crop in Andean countries. Quinoa seeds have high nutritional value, but bitterness of the seeds due to the accumulation of triperpenoid saponins (oleanane-type) is disadvantageous (Kuljanabhagavad et al., 2008). In C. quinoa, TSARL1 and TSARL2 are clustered together (**Figure 1B**) and are expressed in seeds and roots, respectively (Jarvis et al., 2017). In sweet quinoa strains, loss-of-function mutations of TSARL1, including one that appears to cause alternative splicing, allowed the down-regulation of genes involved in the production of the antinutritional saponins (Jarvis et al., 2017).

# GubHLH3 in *Glycyrrhiza uralensis*

The medicinal legume Glycyrrhiza uralensis (Fabaceae) is rich in oleanane-type triterpenoid saponins, such as glycyrrhizin, which is used as a pharmaceutical compound and sweetener, as well as soyasaponins (Hayashi and Sudo, 2009). G. uralensis bHLH3 (GubHLH3), a JA-responsive bHLH transcription factor, upregulates the expression of soyasaponin biosynthesis genes, such as those encoding CYP93E3 and CYP72A566, which are involved in oxidative modifications of the triterpenoid backbone (Tamura et al., 2018). Consistently, the overexpression of GubHLH3 increased the levels of soyasapogenol B and other intermediates of the soyasaponin pathway in G. uralensis hairy roots (Tamura et al., 2018).

# bHLH18, bHLH19, bHLH20/NAI1, and bHLH25 in *Arabidopsis*

In Arabidopsis, four genes, bHLH18, bHLH19, bHLH20/NAI1, and bHLH25, encode subgroup IVa bHLH transcription factors; three of them, except bHLH25, form a gene cluster (**Figure 1B**). NAI1, which resides in this three-gene cluster, is indispensable for the formation of the ER body, an ERderived rod-shape organelle found in plants of the Brassicales order (Matsushima et al., 2004). ER bodies are constitutively present in Arabidopsis seedlings and roots. By contrast, in rosette leaves, wounding and JA treatment induce the formation of this defense-related organelle, which accumulates large amounts of β-glucosidases, whose activities increase when the compartment is disrupted (Nakano et al., 2014). NAI1 regulates the expression of genes encoding proteins required for ER body formation and activity, including PYK10, a major β-glucosidase in this organelle. PYK10 functions as a myrosinase that hydrolyzes indole glucosinolates, a group of important defense compounds in Arabidopsis and related species (Nakano et al., 2017). The phylogenetic co-occurrence of ER bodies and indole glucosinolates and the co-expression of the associated genes also support the functional coordination between this organelle and glucosinolate metabolism (Nakano et al., 2017).

A previous study suggested the involvement of bHLH18, bHLH19, bHLH20/NAI1, and bHLH25 in JA-mediated inhibition of iron uptake in Arabidopsis (Cui et al., 2018). JA represses iron uptake by promoting the degradation of FIT/bHLH29, a central transcriptional regulator of iron-uptake genes critical to metal homeostasis (Cui et al., 2018). The four subgroup IVa bHLHs interact with FIT protein and promote its JA-stimulated removal through proteasome-dependent degradation (Cui et al., 2018).

# THE GAIN OF *cis*-REGULATORY ELEMENTS

The recruitment of metabolic genes into regulons likely requires the gain of transcription factor-binding cis-regulatory elements in the appropriate promoter regions. Such a process is fairly likely, considering the relatively frequent, simple generation of short sequence elements in noncoding promoter regions that can have a degree of redundancy and acquire functions through mutational changes, such as point mutations and transpositions (Wray, 2007; Swinnen et al., 2016).

A case study of a tobacco gene involved in nicotine biosynthesis supports such a scenario. Quinolinate phosphoribosyltransferase (QPT) is a primary metabolic enzyme involved in NAD biosynthesis in all organisms. However, in tobacco, QPT also supplies a significant amount of intermediates required for downstream nicotine biosynthesis (**Figure 2A**). To satisfy such a metabolic demand, tandem duplication of QPT has occurred in the Nicotiana lineage, generating a cluster of QPT1 and QPT2 genes (**Figure 2**, Shoji and Hashimoto, 2019). These genes are thought to be involved in NAD and nicotine biosynthesis, respectively, based on their distinct expression

(blue) and *QPT2* (red) genes, which are thought to contribute to NAD and nicotine formation, respectively (Shoji and Hashimoto, 2011a). *QPT2* and downstream steps specific to nicotine formation (red arrows) are regulated by ERF189 in tobacco. Steps including multiple enzymes and undefined reactions are represented by broken arrows. (B) Schematic depiction of the evolution of *QPT* genes in the tomato and tobacco lineages. In the tobacco genome, *QPT1* and *QPT2*, which are thought to have arisen through tandem duplication, are located ∼75 kb apart on the chromosome. Tomato contains one *QPT* gene copy in a genomic region syntenic to the tobacco cluster (Shoji and Hashimoto, 2019). One of the duplicates, *QPT2*, has become regulated by an evolutionarily conserved ERF transcription factor by gaining *ERF*-binding *cis*-regulatory elements in its promoter. Three functional P-box elements bound by ERF189 are present in the proximal promoter region of *QPT2* in extant tobacco (Shoji and Hashimoto, 2011a).

patterns (Shoji and Hashimoto, 2011a). QPT2 harbors multiple ERF189-binding P-box elements in its promoter required for its transcriptional activation by ERF189 (Shoji and Hashimoto, 2011a). The progressive acquisition of these elements after gene duplication has ensured the involvement of QPT2 in the ERF189-controlled regulon of the nicotine pathway (**Figure 2B**). The frequent occurrence of JRE4/GAME9 binding elements in the proximal promoter regions of SGA biosynthesis genes in tomato implies that such a notion is also applicable to these genes (Thagun et al., 2016).

# EVOLUTIONARY CHANGES IN TRANSCRIPTION FACTORS

In contrast to the gains (and losses) of cis-regulatory elements that strongly contribute to the rewiring of gene regulatory networks, mutational changes in transcription factors, which have profound, pleiotropic effects on numerous downstream genes, are relatively constrained. Nevertheless, there are also examples of the modification of the functionalities and expression patterns of trans-acting factors (Maerkl and Quake, 2009).

A series of subgroup IXa ERFs have divergent DNA-binding specificities to GCC-box elements and to related but distinct Pbox and CS1-box elements. Such distinct binding specificities can be accounted for by a few amino acid differences in a small stretch of the DNA-binding domain (**Figure 3**, Shoji et al., 2013). It appears that a progressive evolutionary trajectory has led from transcription factors that recognize only a canonical GCCbox to Nicotiana-specific ERF189-type transcription factors that bind to P-box but not GCC-box elements via functional intermediates, such as ORCA3-type transcription factors, which bind to multiple elements, including both GCC-box and Pbox elements (**Figure 3**, Shoji et al., 2013). The development of unique combinations of cis-elements and trans-factors may have been indispensable for avoiding missed connections among unrelated regulatory circuits and, thus, the establishment of lineage-specific specialized pathways. This process appears to have occurred independently of the development of a broad range of ERFs targeting GCC-box elements involved in general defense responses.

Nicotine and SGA biosynthesis pathways in distinct lineages of the same Solanaceae family, which are regulated by orthologous ERFs, share many properties, such as JA-dependent induction and suppression by ethylene (Shoji et al., 2010; Nakayasu et al., 2018). By contrast, the site of their biosynthesis differs between the two lineages: nicotine is synthesized exclusively in tobacco roots, whereas SGAs are produced in nearly all inedible parts of tomato and potato, including leaves and roots. This difference depends on the differential expression patterns of the transcriptional regulators ERF189

and JRE4/GAME9 (Cárdenas et al., 2016; Thagun et al., 2016; Kajikawa et al., 2017). To guarantee the function of each group of metabolites in plant defense, the tissue-specific expression patterns of these regulators may have developed independently after the separation of the two lineages, whereas their responses to JA and other features have been conserved between lineages. These ideas point to the elastic evolution of sets of a particular transcription factor and its downstream metabolic genes as independent units with specialized roles in chemical defense in a lineage-specific manner.

Despite the functional differences noted above, the JAresponsive transcription factors are considered components of conserved regulatory mechanisms present in various species. For instance, in transgenic tomato plants, a promoter reporter of tobacco QPT2 regulated by ERF189 was expressed in a JAresponsive and cell type-specific manner, as in tobacco, and this expression was mediated by JRE4 (Shoji and Hashimoto, 2019). TSARs and BISs were shown to be functionally exchangeable, as the orthologous bHLH factors regulate each other's target genes in C. roseus and M. truncatula (Mertens et al., 2016b). Both of these examples clearly point to the interchangeable nature of these factors among species with entirely different pathways (e.g., ornithine-derived nicotine vs. MVA pathway-derived SGAs for ERF189 and JRE4, and MVA pathway-derived saponins vs. MEP pathway-derived TIAs for TSARs and BISs), supporting the functional conservation of these factors.

# RECRUITMENT MODEL OF METABOLIC EVOLUTION

Metabolism is a fundamental requirement of all living organisms. Primeval metabolism, or the simple conversion of substances, is thought to rely on a small number of proteinaceous or other catalysts with low reaction specificities and efficiencies (**Figure 4**i). Metabolic systems have evolved toward increasing order and efficiency (Weng et al., 2012). Contemporary primary metabolism, which was established early and has been maintained, is carried out by robust systems mediated by enzymes with high specificities and efficiencies (**Figure 4**ii).

Enzymes involved in specialized metabolism are thought to have emerged through duplication, beginning with sophisticated primary enzymes as progenitors, followed by mutational changes in the duplicates and leading to neofunctionalization of the enzymes (**Figure 4**iii). Relaxed constraints on the specificities and efficiencies of newly generated duplicates allow them to explore a wide range of catalytic possibilities. The promiscuity of multifunctional enzymes, with broader specificities emerging through this process, contributes to the expansion of the metabolic web (**Figure 4**iii). This web even includes the virtual activities of hidden enzymes (dotted lines in **Figure 4**) that do not contribute to actual metabolic flow due to limited substrate availability or marginal enzymatic activity; such hidden activities are not readily eliminated by (and are more tolerant to) selection.

If these changes are not deleterious or neutral and are thus not eliminated by purifying selection, the metabolic grids continue to build through neutral evolutionary processes such as genetic drift. It seems reasonable that autotrophic plants accumulate low-molecular-weight metabolites derived from photosynthetic assimilates, which usually have antioxidant properties to some extent and are often sequestered in cellular compartments such as vacuoles. These natural products, including those that accumulate in trace amounts, do not necessarily have adaptive significance (Koonnin, 2016).

The emergence of specialized pathways allowing for the efficient production and accumulation of substantial amounts of metabolites requires the selection of specific flows from the expanded metabolic web, again increasing order and efficiency (**Figure 4**iv). This process largely relies on positive natural selection rather than neutral evolution, which is dependent on randomness. I propose a conceptual model, recruitment model of metabolic evolution, describing this process. According to this model, structural genes are repeatedly recruited into regulons under the control of evolutionarily conserved transcription factors (which should be activators rather than repressors), such as the JA-responsive ERFs and bHLHs (**Figure 5**). When a gene in the metabolic web becomes regulated by a transcription factor, obtaining cognate cis-regulatory elements, metabolic flows are generated or altered accordingly. Although such events readily occur at high frequency, most of these mutational changes are immediately eliminated and are not maintained in the population. On the other hand, when the newly generated flows result in the accumulation of beneficial products, such as defense compounds, conferring adaptive advantages to the plant, the probability that such changes will be maintained and eventually fixed in the population increases tremendously. Once the beneficial flows occur, the likelihood that mutational events that enhance these flows (such as the transcriptional activation of other metabolic genes and the optimization of catalytic specificities and efficiencies associated with the flows) is expected to rise markedly as well. An initial, mostly accidental, event creating new metabolic flows may trigger cascading mutational changes associated with and improving the flows, eventually leading to the establishment of metabolic regulons and pathways, perhaps within a relatively short evolutionary timescale. The bioinformatic and mathematical bases of the model remain to be explored.

The extensive rewiring of transcriptional circuits alters metabolic regulons that were once established under the original transcription factors (**Figure 6**). During such processes, takeover of the regulons by new transcription factors (including those derived from the original transcription factors by duplication) and the associated changes in the connections in the circuits occur frequently through changes in cis-regulatory elements and transcription factors (**Figure 6**, Johnson, 2017). Contemporary regulatory networks are often complex and include multiple transcription factors, which act as either activators or repressors, and in some cases regulate only specific parts of pathways (e.g., ORCA3 regulates some but not all TIA genes). Extensive rewiring of circuits may contribute to the advent of the complicated regulatory organization found in extant metabolic pathways, which also could account for the fact that few metabolic pathways have simple regulons comprising only a single master regulator and its downstream structural genes.

Analyses of metabolic evolution have emphasized mutational changes to catalytic enzymes (Weng et al., 2012; Moghe and Last, 2015). If metabolic genes are

not functionally expressed and no flows are associated with these genes, they are not subjected to positive or purifying selection. The recruitment model, which presumes that transcription factors activate genes prior to changes in catalytic activity, adequately addresses this point as well.

Metabolic evolution is driven by functional changes in catalytic enzymes and changes in the expression patterns of metabolic genes. There appear to be limits to the changes made to the catalytic functionalities of enzymes belonging to a limited number of protein families without hampering the structural and functional stability of protein frames. Therefore, changes in the combinations of metabolic genes with specific temporal and spatial expression patterns might also have significantly contributed to the rise in chemodiversity found in plants. Plants have exploited their limited repertoire of enzymes in a combinatorial manner to produce these diverse compounds.

## PERSPECTIVES OF "EVO-META" BIOLOGY

"Nothing in biology makes sense except in the light of evolution" is a famous quotation by Dr. Theodosius Dobzhansky, a Ukrainian-American evolutionary geneticist (Dobzhansky, 1973). The discovery of homeotic genes encoding a group of transcription factors that direct the organization of the body plans of vertebrates and invertebrates has helped elucidate the evolutionarily conserved mechanisms governing development, leading to the rise of Evolutionary Developmental (Evo-Devo) biology. Classic anatomy and embryology, as well as modern developmental biology, share some affinity with evolutionary biology. Paleontology based on fossil evidence is one of the main areas of focus in evolutionary biology.

Chemodiversity of specialized products in plants has been shaped through (and is a product of) evolution. Unfortunately, it is difficult to predict the hues and fragrances of ancient flowers from extinct plants. Nevertheless, the chemodiversity found in extant species and the diverse series of plant genomes, including those yet to be explored, is highly informative (Nakamura et al., 2014). A long period of collection of natural products with divergent chemical structures and biological activities, along with a better understanding of the grouping of biosynthetic pathways and associated enzymes, has led to an awareness of some sort of order behind this chemodiversity. Elucidating the molecular biology behind regulatory factors, such as master

# REFERENCES


transcription factors, that orchestrate these metabolic processes is expected to reveal the universal principles ruling the metabolic processes that produce a diverse range of specialized products. The challenges of "Evo-Meta" (Evolutionary Metabolic) biology aimed at uncovering the origins of this chemodiversity are just beginning.

# AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and has approved it for publication.

# FUNDING

This work was supported in part by the Japan Society for the Promotion of Science (Grants-in-Aids for Scientific Research number 17K07447 to TS).

# ACKNOWLEDGMENTS

I thank Drs. Takashi Hashimoto and Yasuyuki Yamada (NAIST) for their long-term collaboration and continuous encouragement. I appreciate the encouraging and constructive remarks by Dr. Shigetada Nakanishi (SUNBOR) on our research related to this article.


specialized metabolism in Ophiorrhiza pumila revealed by transcriptomics and metabolomics. Front Plant Sci. 7:1861. doi: 10.3389/fpls.2016.01861


alkaloid biosynthesis in Catharanthus roseus. Plant J. 67, 61–71. doi: 10.1111/j.1365-313X.2011.04575.x


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Shoji. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Chemodiversity of the Glucosinolate-Myrosinase System at the Single Cell Type Resolution

*Shweta Chhajed1,2† , Biswapriya B. Misra1,3† , Nathalia Tello1,2 and Sixue Chen1,2,4,5 \**

*1Department of Biology, University of Florida, Gainesville, FL, United States, 2Genetics Institute, University of Florida, Gainesville, FL, United States, 3Section on Molecular Medicine, Department of Internal Medicine, Center for Precision Medicine, Wake Forest School of Medicine, Winston-Salem, NC, United States, 4Plant Molecular and Cellular Biology, University of Florida, Gainesville, FL, United States, 5Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL, United States*

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Ryosuke Sugiyama, RIKEN Center for Sustainable Resource Science (CSRS), Japan Franziska S. Hanschen, Leibniz-Institut für Gemüse-und Zierpflanzenbau (IGZ), Germany*

*\*Correspondence:* 

*Sixue Chen schen@ufl.edu These authors have contributed equally to this work*

*†*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 28 February 2019 Accepted: 25 April 2019 Published: 21 May 2019*

#### *Citation:*

*Chhajed S, Misra BB, Tello N and Chen S (2019) Chemodiversity of the Glucosinolate-Myrosinase System at the Single Cell Type Resolution. Front. Plant Sci. 10:618. doi: 10.3389/fpls.2019.00618*

Glucosinolates (GLSs) are a well-defined group of specialized metabolites, and like any other plant specialized metabolites, their presence does not directly affect the plant survival in terms of growth and development. However, specialized metabolites are essential to combat environmental stresses, such as pathogens and herbivores. GLSs naturally occur in many pungent plants in the order of Brassicales. To date, more than 200 different GLS structures have been characterized and their distribution differs from species to species. GLSs co-exist with classical and atypical myrosinases, which can hydrolyze GLS into an unstable aglycone thiohydroximate-O-sulfonate, which rearranges to produce different degradation products. GLSs, myrosinases, myrosinase interacting proteins, and GLS degradation products constitute the GLS-myrosinase (GM) system ("mustard oil bomb"). This review discusses the cellular and subcellular organization of the GM system, its chemodiversity, and functions in different cell types. Although there are many studies on the functions of GLSs and/or myrosinases at the tissue and whole plant levels, very few studies have focused on different single cell types. Single cell type studies will help to reveal specific functions that are missed at the tissue and organismal level. This review aims to highlight (1) recent progress in cellular and subcellular compartmentation of GLSs, myrosinases, and myrosinase interacting proteins; (2) molecular and biochemical diversity of GLSs and myrosinases; and (3) myrosinase interaction with its interacting proteins, and how it regulates the degradation of GLSs and thus the biological functions (e.g., plant defense against pathogens). Future prospects may include targeted approaches for engineering/breeding of plants and crops in the cell typespecific manner toward enhanced plant defense and nutrition.

Keywords: glucosinolate, myrosinase, cell type, metabolism, protein-protein interaction

# INTRODUCTION

One of the most extensively studied classes of anti-herbivore chemical defenses in plants is glucosinolates (GLSs), a group of sulfur-rich, amino acid-derived metabolites combining a β-d-glucopyranose residue linked *via* a sulfur atom to an *N*-hydroxyimino sulfate ester, which are plant-derived natural products (Halkier and Gershenzon, 2006; Halkier, 2016). GLSs are widely distributed in the order Brassicales, which includes vegetables (cabbage, cauliflower, and broccoli), spice plants supplying condiments (mustard, horseradish, and wasabi), and reference species, *Arabidopsis thaliana* (Fahey et al., 2001; Reichelt et al., 2002). Upon insect feeding or mechanical disruption, GLSs are hydrolyzed by myrosinases (thioglucoside glucohydrolase, TGG, EC 3.2.1.147) into unstable thiohydroximate-O-sulfonates, which rearrange to form different hydrolytic products such as isothiocyanates (ITCs), nitriles, and other by-products depending on the nature of the GLS side chain and the reaction conditions, such as iron, pH, and presence of myrosinase interacting proteins (Chen and Andreasson, 2001; Wittstock et al., 2016a). This GLS-myrosinase (GM) system is popularly known as "mustard oil bomb" (Lüthy and Matile, 1984; Ratzka et al., 2002). Myrosin cells (an idioblast cell type accumulating TGGs) are involved in plant defense by hydrolyzing GLSs into toxic volatiles such as ITCs or nitriles (Wittstock et al., 2003). TGGs are known to be present in all *A. thaliana* organs and were reported in *A. thaliana* and *B. napus* phloem parenchyma as well as in guard cells (Andréasson et al., 2001; Thangstad et al., 2004). In general, GLSs are enriched in "S-cells" that are found in *Arabidopsis* flower stalks and occur close to myrosin cells (Koroleva et al., 2000; Andréasson et al., 2001).

The spatial distribution of GLSs was demonstrated in *A. thaliana* leaves by constructing ion intensity maps from matrix-assisted laser desorption/ionization-time of flight (MALDI-TOF) mass spectra, where major GLSs were found to be more abundant in tissues of the midvein and the periphery of the leaf than the inner lamina (Shroff et al., 2008). Although this study concluded that GLSs are not abundant on *A. thaliana* leaf surfaces, the authors could not obtain information on the cell type distribution of GLSs in leaves. Moreover, all the genes in the GLS biosynthetic pathways have been identified, and it is somewhat known where GLSs are stored (Koroleva et al., 2000; Andréasson et al., 2001), but it has remained elusive where GLSs are specifically produced at the subcellular, cellular, and tissue levels (Rask et al., 2000; Nintemann et al., 2017). Neither is it clear about the cellular and subcellular compartmentation of different myrosinases and their interacting proteins, which include myrosinase-binding proteins (MBPs), myrosinaseassociated proteins (MyAPs), and different specifier proteins.

In the following sections, we discuss various aspects of the GM system based on current knowledge, starting from the cellular control of enzymes, cell type, and subcellular organization, to uniqueness of myrosinases and myrosinase interacting proteins covering a range of small molecule and macromolecular interactions of the "mustard oil bomb."

# THE GLUCOSINOLATE-MYROSINASE SYSTEM AND CELLULAR CONTROL OF ENZYME REACTIONS

As found in the order of Brassicales, including important crops (e.g., mustard, oilseed rape, radish, broccoli, and cabbage), GLSs co-exist with myrosinases. When tissue damage occurs, the "mustard oil bomb" is detonated and GLSs are hydrolyzed and converted to different degradation products with a variety of biological activities (Rask et al., 2000; Halkier and Gershenzon, 2006; Yan and Chen, 2007; Bednarek et al., 2009; Clay et al., 2009; Halkier, 2016; Wittstock et al., 2016a). For example, these degradation products play important roles in plant defense against pathogens and herbivores, as well as serve as attractants to specialists (Rask et al., 2000; Barth and Jander, 2006; Clay et al., 2009; Wittstock et al., 2016a). Several of these degradation products are involved in plant nutrition (Holmes, 1980; Armengaud et al., 2004) and growth regulation (Hasegawa et al., 2000; Hull et al., 2000; Mikkelsen et al., 2000). In plant metabolism, it is important that enzymes and substrates are under tight regulation, which is more relevant for toxic compounds, as these chemical defenses are derived from specialized metabolites. There are several ways of regulation: (1) coarse control through biosynthesis; (2) fine control of enzyme activity through protein interaction and allosteric regulation; and (3) substrate and enzyme compartmentalization (Sweetlove and Fernie, 2013). While the regulation is well studied in primary metabolism (e.g., photosynthesis and respiration), it is not clear in many of the specialized metabolic processes such as GLS metabolism. Furthermore, protein-protein interactions are intrinsic to virtually every cellular process and have been extensively studied in animals and yeast (Uetz et al., 2000; Gavin et al., 2002; Ho et al., 2002; Li et al., 2004; Huttlin et al., 2017). In plants, this area has lagged behind in spite of recent progress (Hosseinpour et al., 2012; Zhang et al., 2016; Jiang et al., 2018). Vast majority of the studies did not go beyond identifying physical interactions to the point of functional analysis. **Figure 1** shows the potential molecular interactions of the GM system in the context of cell type-specific metabolisms.

# CELL TYPE-SPECIFIC CELLULAR AND SUBCELLULAR ORGANIZATION OF THE "MUSTARD OIL BOMB"

Myrosinase is located in myrosin cells, which are scattered cells in radicles, stems, leaves, petioles, seeds, and seedlings of several species (Husebye et al., 2002). A cell-specific localization was found in radicles and cotyledons of the maturing embryo resembling the pattern of the myrosin cells (Bones et al., 1991). Most GLSs are constitutively present in all *Arabidopsis* tissues (Petersen et al., 2002; Brown et al., 2003). The key steps in the biosynthesis of the different types of GLSs are localized in distinct cells in separate as well as overlapping vascular tissues (Nintemann et al., 2018). The presence of GLS biosynthetic enzymes in parenchyma cells of the vasculature may assign new defense-related functions to these cell types (Nintemann et al., 2018). To date, the cellular and subcellular compartmentation of the "mustard oil bomb" (Lüthy and Matile, 1984) is not completely clear and is rather contradictory. For instance, in *Arabidopsis* flower stalks, GLSs were found in the elongated sulfur-rich "S-cells" situated between phloem and endodermis (Koroleva et al., 2000; Husebye et al., 2002). However, the myrosinase TGG1 was found to be abundant in guard cells,

whereas TGG1 and TGG2 were localized to the phloemassociated cells close to the "S-cells" (Husebye et al., 2002; Thangstad et al., 2004; Barth and Jander, 2006). Thus, it appears that myrosinases and their substrates were physically separated in the plant tissues. However, such an arrangement may not be the case as a recent proteomics study located the myrosinases in "S-cells" (Koroleva and Cramer, 2011). In *Brassica juncea* seedlings, myrosinase was found to co-localize with GLSs in aleurone-type cells (Kelly et al., 1998). In *Arabidopsis* suspension cells, both myrosinases and GLSs were present (Alvarez et al., 2008). Such diverse co-localization results may indicate that myrosinases and GLSs are spatially separated at the subcellular levels. Alternatively, they could be in the same compartment with tight control of myrosinase activities. GLSs were found in vacuoles rich in ascorbic acid (Grob and Matile, 1979), which plays a role to inhibit myrosinase at high concentration and activate myrosinase at low concentration. This dual regulation supports the potential co-localization of GLSs and myrosinases in the same subcellular compartment.

and ESP were found in the S-cells, and the presence of ESM and NSP is indicative of other cell types.

Recent metabolomics data have confirmed the presence of GLSs in guard cells (Geng et al., 2016; Zhu and Assmann, 2017) and revealed the changes in GLS metabolism in guard cells upon treatment with CO2 (Geng et al., 2016) and ABA (Zhu and Assmann, 2017). The first indication of the role of GLS metabolism in stomatal movement was obtained through analysis of the effect of ABA on stomatal movement of the *Arabidopsis* myrosinase mutant *tgg1* (Zhao et al., 2008). Subsequently, additional reverse genetics studies corroborated the role of GLS metabolism in stomatal movement (Islam et al., 2009; Zhu et al., 2014). Furthermore, stomatal closure was induced by pharmacological treatments with different GLS hydrolysis products (Khokon et al., 2011; Sobahan et al., 2015). However, these products and the amounts used are of synthetic origin and abundance. It is not known what degradation products are produced and how much *in vivo*, which GLSs and myrosinases [TGGs and/or Penetration 2 (PEN2)] are involved, and how protein interactions regulate the GLS breakdown in guard cells.

The *Arabidopsis cyp79b2*/*cyp79b3* mutants are known to produce mostly aliphatic GLSs (Zhao et al., 2002; Chen et al., 2003; Grubb and Abel, 2006; Khokon et al., 2011; Sobahan et al., 2015), while the *myb28/myb29* mutants are known to produce mostly indolic GLSs (Hirai et al., 2007; Beekwilder et al., 2008). Furthermore, the *tgg1/tgg2* double mutant showed undetectable myrosinase activity, and damage-induced breakdown of endogenous GLSs was not from aliphatic GLSs and was greatly slowed for indole GLSs (Barth and Jander, 2006). Moreover, the *tgg1/tgg2* mutant lacking the foliar myrosinases was compromised in activation of their GLS defense. Another mutant, *atvam3* mutant showed abnormal distribution of myrosin cells and overproduction of TGG1 and TGG2 (Ueda et al., 2006). Thus, beyond TGGs, MYB28, MYB29, AtVAM, CYP79s, and other biosynthetic genes all affect GLS deposition levels and possibly cell type specificity of the GM system. To understand the regulation and correlation of these proteins, we used GeneMANIA software (Warde-Farley et al., 2010) and predicted the association of the known genes involved in GLS metabolism (from our selected gene list in **Supplementary Table S1**). This software further added putative proteins with similar functions and potential involvement in the GM system (**Figure 2**).

In *B. napus* leaves, myrosinases are localized in mesophyll cells and phloem cells (Chen and Andreasson, 2001) and were mainly stored in protein-rich vacuolar structures of myrosin cells (Rask et al., 2000; Ueda et al., 2006). There is also a report of the presence of myrosinase as cytosolic enzymes bound to intracellular membranes (Lüthy and Matile, 1984). The knowledge of the localization of myrosinases and interacting proteins was advanced by vacuolar proteomics. Myrosinases, TGG1 and TGG2, and myrosinase-associated protein (MyAP) 1 were identified in the vacuoles. In the early leaf developmental stages, TGG1 is more abundant than TGG2, whereas in fully expanded leaves, both TGG1 and TGG2 levels show increased accumulation. Concurrently, MyAP1 levels are increasingly abundant. We have previously observed such regulation of myrosinase expression, which correlated with GLS turnover (Petersen et al., 2002). The co-localization of myrosinase and MyAP1 and the concurrent expression during development lead to the hypothesis that the vacuolar myrosinases may be active and MyAPs may interact with myrosinase to play a role in GLS hydrolysis. For example, MyAPs may facilitate ITC production (Zhang et al., 2006). Indeed, immunogold analysis of leaf sections showed the presence of TGG1 and TGG2 in the same vacuoles (Ueda et al., 2006). An independent vacuolar proteomics study also identified these proteins (Carter et al., 2004). In addition, two more MyAPs (At1g54000 and At1g54010) and three myrosinase-binding proteins (MBPs) (At1g52040, At3g16470, and At2g39330) were localized in the vacuoles (Carter et al., 2004). TGG1 and TGG2 were also found in the endoplasmic reticulum (ER), ER bodies, and transvacuolar strands, and this localization is dependent on MyAP1 (MVP1). Mutation of the MyAP1 clearly altered the subcellular localization profiles of the green fluorescent protein (GFP)-tagged TGG1 and TGG2

co-localization (blue), and genetic interactions (green). As to functions associated with each protein, the color code inside the nodes indicates GLS metabolic process (red), sulfur compound metabolic process (blue), defense response to bacterium (yellow), defense response to insect (purple), response to oxidative stress (green), defense response to fungus (pink), and stomatal movement (light blue).

(Agee et al., 2010). Interestingly, the myrosinase PEN2 (hydrolyzing indole GLSs and shown to function in plant defense (Bednarek et al., 2009; Clay et al., 2009; Millet et al., 2010; Fan et al., 2011; Johansson et al., 2014; Frerigmann et al., 2016; Luti et al., 2016; Xu et al., 2016; Vilakazi et al., 2017)) is targeted to peroxisomes and the outer mitochondrial membrane (Fuchs et al., 2016). In addition to MyAPs and MBPs, specifier proteins including epithiospecifier modifier (ESM, MyAP-like), epithiospecifier protein (ESP), nitrile specifier protein (NSP), and thiocyanate forming protein (TFP) may affect the outcome of GLS degradation (Lambrix et al., 2001; Burow et al., 2006; Zhang et al., 2006; Wittstock et al., 2016a,b; Backenköhler et al., 2018). ESP was found to be in "S-cells" and in guard cells with NSP1 and NSP5 (Burow et al., 2007; Zhao et al., 2008). The functions of these MyAPs, MBPs, and specifier proteins in "S-cells" and guard cells and their interactions with myrosinases in different cell types are not known.

To understand the subcellular organization of the GM system, we compared the proteins and pathways involved in GLS biosynthesis, degradation, and transport using available and/ or predicted subcellular localization information. **Figure 3** and **Supplementary Table S1** provide an overview of the GM system at subcellular level based on available literature and analysis using different protein localization tools: (1) Plant-mPLoc1 (Chou and Shen, 2007, 2008, 2010); (2) TAIR2 (with annotation based on literature); (3) Eplant3 using SUBA (Subcellular Localisation Database for *Arabidopsis*) with annotation based on subcellular proteomics and/or protein fluorescence microscopy; (4) TargetP4 based on the N-terminal targeting sequences (chloroplast transit peptide (cTP), mitochondrial targeting peptide (mTP), or secretory pathway signal peptide (SP) (Emanuelsson et al., 2000) [with a reliability score of 1–5 (1 being most reliable and 5 least reliable)]; (5) LocTree5 using support vector machines for localization prediction (in the form of expected accuracy); and (6) ngLOC6 using Bayesian method for prediction of localization. As shown in **Figure 3**, most GM system proteins were found to be in the cytoplasm followed by nucleus, where the transcriptional regulators were localized. All the cytochrome P450s involved in GLS biosynthesis and modification were localized to endoplasmic reticulum, and other GLS biosynthesisrelated proteins were in the chloroplast and cytoplasm. Glucosinolate transporters (GTR1 and GTR2) and nitrate transporters (NRT1.6, NRT1.7, and NRT1.9) were found to be in plasma membrane. It is not known how glucosinolates are transported into vacuoles. PEN2 and BZO1 were localized in peroxisomes. No GM system proteins were found on Golgi apparatus. Out of the 114 GM system proteins used in this study (**Supplementary Table S1**), 65 proteins had experimental evidence of localization, 26 were predicted using the software tools (at least three tools with consistent result), and 23 proteins could not be conclusively localized.

# DISTINCT MOLECULAR AND BIOCHEMICAL PROPERTIES OF MYROSINASES

Myrosinases are classified into two types, typical (classical) and atypical myrosinases. The crystal structure of a classical myrosinase shows that the protein folds into an (*β*/*α*)8 barrel structure (Burmeister et al., 1997). In the active site, a Glu (E) residue is involved in nucleophilic attack to initiate the release of an aglycone (thiohydroximate-O-sulfonate) and form a glucosyl-enzyme intermediate. Another Gln (Q) residue enables the hydrolysis of this intermediate with assistance from water and ascorbate. Classical myrosinases (with QE catalytic residues) use ascorbate as a cofactor and proton donor to facilitate the release of bound glucose (Burmeister et al., 1997; Wittstock and Burow, 2010; Bhat and Vyas, 2019). In contrast, atypical myrosinases have two catalytic Glu residues (EE), which function as acid/base catalyst in the active site. They do not require ascorbate. In addition, atypical myrosinases have two basic amino acid residues at different positions (+6 and +7) for glucosinolate binding compared to +0 position arginine residue of classical myrosinases (Wittstock and Burow, 2010; Nakano et al., 2017; Shirakawa and Hara-Nishimura, 2018; Bhat and Vyas, 2019). Classical myrosinases are glycosylated, activated by low concentrations of ascorbate, and accepted GLSs as the only substrates (Chen and Halkier, 1999; Chen and Andreasson, 2001). In contrast, atypical myrosinases, such as PEN2 and PYK10, can hydrolyze indole GLSs and also use O-glucosides as substrates (Bednarek et al., 2009; Nakano et al., 2017). Although myrosinase does not use acylated GLSs and desulpho-GLSs as substrates, it may accept a wide range of GLS substrates (Chen and Halkier, 1999; Rask et al., 2000; Barth and Jander, 2006). Myrosinases from *B. napus* and *Crambe abyssinica* degrade different GLS at different rates (James and Rossiter, 1991; Finiguerra et al., 2001). However, the mechanism underlying this substrate specificity is not established. Myrosinases in *B. napus* are encoded by >29 genes in three subfamilies, denoted as MA, MB, and MC. The MA myrosinases occur as dimers, while MB and MC myrosinases exist in complexes with MBPs and/ or MyAPs (Lenman et al., 1990; Rask et al., 2000). By heterologous expression in yeast, we have previously produced a functional free form myrosinase Myr1 from the MB subfamily (Chen and Halkier, 1999). The activity of this Myr1 suggests that MBPs and MyAPs are not absolutely necessary for myrosinase activity, but raises questions on the functions of MBPs and MyAPs and their interactions with myrosinases.

Bioinformatic analysis of the *Arabidopsis* genome revealed the presence of six myrosinase genes *TGG1*-*TGG6* (Xu et al., 2004). *TGG1* and *TGG2* are expressed in leaves (Xue et al., 1995; Husebye et al., 2002; Thangstad et al., 2004; Barth and Jander, 2006; Ueda et al., 2006) and flowers (Ruan et al., 1998; Barth and Jander, 2006), while *TGG4* and *TGG5* are specifically expressed in roots (Zimmermann et al., 2004). *TGG3* and *TGG6* are pseudogenes (Husebye et al., 2002; Zhang et al., 2002). Although TGG1 and TGG2 appear to display a low degree of substrate specificity, the activities of TGG1 and

<sup>1</sup> http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/

<sup>2</sup> https://www.arabidopsis.org/tools/bulk/protein/index.jsp

<sup>3</sup> https://bar.utoronto.ca/eplant/

<sup>4</sup> http://www.cbs.dtu.dk/services/TargetP/

<sup>5</sup> https://rostlab.org/services/loctree3/

<sup>6</sup> http://genome.unmc.edu/ngLOC/index.html

TGG2 have been correlated with the feeding preference and growth of different generalist and specialist insects (Barth and Jander, 2006). Interestingly, overexpression of TGG1 and TGG2 leads to accumulation of several GLS degradation products, including 5-methylhexanenitrile, heptanenitrile, 1-isothiocyanato-3-methylbutane, 1-isothiocyanato-4-methyl pentane, and 1-isothiocyanato-3-methylhexane. Based on the degradation product profile, possible endogenous substrates for the two TGGs include 4-methylthiobutylglucosinolate, 4-methylpentylglucosinolate and 3-methylbutylglucosinolates (Ueda et al., 2006). Investigating endogenous substrates of different classical and atypical myrosinases is an important future direction.

In leaves, TGG1 was found to be abundant in guard cells, while TGG2 appeared only present in phloem-associated cells (Barth and Jander, 2006; Zhao et al., 2008). Considering the presence of GLSs in guard cells (Geng et al., 2016; Zhu and Assmann, 2017), how the GM system plays a role in guard cell functions (e.g., stomatal immunity) is an interesting question. Clearly, mutation of the TGG1 and/or TGG2 genes affected the guard cell size, stomatal aperture, and leaf metabolites, such as fatty acids, glucosinolates, and indole compounds (Ahuja et al., 2016). Another proteomic study of trichome and epidermal pavement cells did not identify the TGG1 protein in the samples (Wienkoop et al., 2004). However, a single cell type study in trichomes found the presence of gene encoding transcription factors of aliphatic GLS (MYB28, MYB29 and MYB76) and indole GLS (MYB34, MYB51 and MYB122), indicating that trichomes have biosynthetic genes for the GM system (Frerigmann et al., 2012), but nothing was suggested about myrosinases activity or expression. Given the defense roles of guard cells and trichomes, characterization of the GM systems in these special cell types is of great importance to understand the molecular mechanisms underlying the cell type-specific functions, e.g., defense against pathogen invasion.

# COMPLEX FORMATION BETWEEN MYROSINASE AND ITS INTERACTING PROTEINS

As described earlier, myrosinase interacting proteins include MBP, MyAP, and specifier proteins (ESM, ESP, NSP, and TFP). The first six MBPs identified in *B. napus* range in size from 30 to 110 kDa (Taipalensuu et al., 1996; Chisholm et al., 2000). All MBPs contain jacalin-like repeats (Chisholm et al., 2000; Andréasson et al., 2001). Jacalin-related proteins share the domain structure of plant lectins and are upregulated by phytohormones (e.g., jasmonic acid, salicylic acid, and ethylene) and pathogens (Taipalensuu et al., 1996; Geshi and Brandt, 1998; Xiang et al., 2011; Vilakazi et al., 2017). Recently, identification of bacterial lipopolysaccharide interacting proteins in *Arabidopsis* revealed myrosinases, TGG1 and TGG2, and a MBP (Vilakazi et al., 2017). It remains unclear how the MBP levels are regulated and whether MBPs directly interact and affect myrosinase activity and specificity. In *B. napus* seeds, MBPs are present in most cells but not in the myrosin cells (Rask et al., 2000; Ueda et al., 2006). During germination, MBPs are co-localized with myrosinases in cotyledons, suggesting that preformed myrosinase complexes do exist (Geshi and Brandt, 1998; Eriksson et al., 2002). Using basic local alignment search tool (BLAST) to interrogate the *Arabidopsis* genome reveals >30 putative MBPs. MBP1 and MBP2 are like lectin jacalins and plant aggregating factors. MBP1 and MBP2 are abundantly expressed in immature flowers, and the pattern is similar to that of myrosinase TGG1 (Capella et al., 2001). MBP expression and myrosinase activity are affected in the *coi1* mutant, which is insensitive to jasmonate (Capella et al., 2001). Depletion of MBPs does not alter the cellular distribution of myrosinases but prevents myrosinases from forming complexes (Eriksson et al., 2002). Thus, the functions of MBPs are not fully understood. Interestingly, most NSPs possess jacalin-like domains and are MBP-like (Kuchernig et al., 2012). The jacalinlike domain may interact with the glycans of myrosinases to potentially affect GLS degradation. However, experimental evidence is lacking. NSPs were shown to enhance simple nitrile formation (He et al., 2009; Kissen and Bones, 2009; Chen et al., 2015; Wittstock et al., 2016b). Recently, iron was shown to be a centrally bound cofactor of ESP, TFP, and NSP involved in glucosinolate breakdown. In addition, NSP active site has fewer restrictions to the aglycone conformation than ESP and TFP. This may explain why NSP facilitates simple nitrile production, but not production of epithionitrile and thiocyanate that may need exact positioning of the aglycone thiolate relative to the side chain (Backenköhler et al., 2018). In addition to MBPs, MyAPs form complexes with myrosinases in *B. napus* (Taipalensuu et al., 1996). In *Arabidopsis*, TGG2 was pulled down with MyAP1 in leaf extracts (Agee et al., 2010). MyAPs display high similarity to GDSL lipases, which have a motif of Gly, Asp, Ser, and Leu residues in the active site. The *Arabidopsis* genome contains >80 genes encoding GDSL lipases, typically with a GDSL-like motif, a catalytic triad of Ser, Asp and His residues, and a lipase signature sequence GxSxxxxG (Brick et al., 1995). The possible lipase activity of MyAP suggests a potential role of MyAP in releasing acyl groups from acylated GLSs, thereby making them available for myrosinase hydrolysis. *Arabidopsis* contains acylated GLSs in seeds, but *B. napus* does not contain acylated GLSs; thus, MyAP in *B. napus* may have other functions. A recent study shows that overexpression of *B. napus MyAP1* led to enhanced plant defense against a fungal pathogen *Sclerotinia sclerotiorum* (Wu et al., 2017). A MyAP-like ESM was found to favor ITC production and protect *Arabidopsis* from herbivory (Zhang et al., 2006). However, whether this system involves myrosinase complex formation is still not known. In some plants, ESPs are involved in GLS hydrolysis (Foo et al., 2000; Burow et al., 2006). Hydrolysis of alkenyl GLSs in the presence of ESP leads to the formation of nitriles or epithionitriles, instead of isothiocyanates (Zabala Mde et al., 2005; Burow et al., 2006). Because ESPs can alter the course of hydrolysis, they are important in determining plant herbivore choice and host resistance (Lambrix et al., 2001). Furthermore, this suggests that ESP is situated close to the active site so that it could promptly convert the unstable aglycone to nitriles. Although kinetic studies have showed that ESP acts as a non-competitive inhibitor of myrosinase (MacLeod and Rossiter, 1985), no stable interaction between ESP and myrosinase has been reported (Burow et al., 2006). Like nitrile formation, the production of thiocyanate was found to be associated with TFP. For detailed description of myrosinase specifier proteins, please refer to a recent review (Wittstock et al., 2016a). In summary, several other groups of proteins may interact with myrosinases and function to affect how GLSs are degraded, leading to the formation of different metabolic products. Systematic studies to characterize the interaction of these proteins with myrosinases are needed to elucidate their specific functions.

# DIRECTIONS FOR FUTURE RESEARCH AND CONCLUSIONS

Currently, the cellular and subcellular location of myrosinases, GLSs, and their interacting proteins, i.e., the GM system, are far from established. Given >100 cell types in plants and >5,500 species of GLS producers, it would be a challenge to capture all the species-specific and cell type-specific information of the "mustard oil bomb." In addition, with temporal accumulation and expression patterns of metabolites and enzymes involved typically in case of specialized metabolites, these eventual pictures could be very complex. Using cell type-specific genetic manipulations (e.g., GFP fusion and CRISPR), one can envision to capture the cell type-specific expression patterns and functional role of the glucosinolate biosynthetic proteins, myrosinases, and myrosinase interacting proteins. There exist large gaps in the knowledge base of the GM system, e.g., myrosinase interacting proteins in terms of their interactions, co-localizations, regulations, and functions in specific cell types. Furthermore, the developmental staged appearance and regulation

of the proteins and metabolites are not clear. Moreover, the reported interactions of myrosinases and other proteins could be very much cell type-specific or subcellular localized, which is not well studied till date. Without resolving the cell type specificity of the proteins and metabolites, it would be very challenging to draw mechanistic conclusions on the specific roles of the enzymes, interactors, transporters, and the metabolites from tissue- and whole plant-based data where the information are averaged out (Dai and Chen, 2012; Misra et al., 2014).

In the future, efforts need to focus on large-scale speedy preparations of organelles and subcellular fractions (e.g., vacuoles, peroxisomes, and chloroplasts) in a time-dependent manner to capture the dynamics of protein interactions and GLS metabolism. It is obviously challenging to prepare and enrich plant cell types (e.g., the "S-cells") in copious amounts for more system-wide experiments such as transcriptomics, proteomics, and metabolomics and to obtain preparations at a given time and for a specific treatment. With the recent development of single-cell omics tools (Misra et al., 2014; Efroni and Birnbaum, 2016; Doerr, 2019), such largescale molecular characterization of different single cells is within sight and will greatly enhance the understanding of the chemodiversity of the GM system at the singlecell resolution.

# REFERENCES


# AUTHOR CONTRIBUTIONS

SChh made the list of protein localization, **Figures 2, 3,** generated the reference list, and edited the manuscript. BM wrote one-third of the manuscript draft, made **Figure 1** draft, and edited the manuscript. NT assisted with **Supplementary Table S1** analysis, reference list, and edited the manuscript. SChe designed the manuscript, wrote two-third of the text, provided guidance to students, and finalized the manuscript for submission.

# FUNDING

The authors would like to acknowledge funding from the National Science Foundation MCB 1758820 to SChe. A NIH supported SF2UF Bridge Program (to Dr. David Julian) and a NSF REU grant 1560049 (to Dr. Ramesh Katam) have provided support to NT.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.00618/ full#supplementary-material


revealed by time-resolved metabolomics. *Plant J.* 88, 947–962. doi: 10.1111/ tpj.13296


cytochrome P450s CYP79B2 and CYP79B3. *Genes Dev.* 16, 3100–3112. doi: 10.1101/gad.1035402


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Chhajed, Misra, Tello and Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Evolutionary Developments in Plant Specialized Metabolism, Exemplified by Two Transferase Families

*Hiroaki Kusano1 , Hao Li1 , Hiroshi Minami2 , Yoshihiro Kato2 , Homare Tabata2 and Kazufumi Yazaki1 \**

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Toru Nakayama, Tohoku University, Japan Suvi Tuulikki Häkkinen, VTT Technical Research Centre of Finland Ltd, Finland*

> *\*Correspondence: Kazufumi Yazaki yazaki@rish.kyoto-u.ac.jp*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 01 March 2019 Accepted: 31 May 2019 Published: 25 June 2019*

#### *Citation:*

*Kusano H, Li H, Minami H, Kato Y, Tabata H and Yazaki K (2019) Evolutionary Developments in Plant Specialized Metabolism, Exemplified by Two Transferase Families. Front. Plant Sci. 10:794. doi: 10.3389/fpls.2019.00794*

*1 Laboratory of Plant Gene Expression, Research Institute for Sustainable Humanosphere, Kyoto University, Kyoto, Japan, 2 Life Science Center, Hokkaido Mitsui Chemicals, Sunagawa, Japan*

Plant specialized metabolism emerged from the land colonization by ancient plants, becoming diversified along with plant evolution. To date, more than 1 million metabolites have been predicted to exist in the plant kingdom, and their metabolic processes have been revealed on the molecular level. Previous studies have reported that rates of evolution are greater for genes involved in plant specialized metabolism than in primary metabolism. This perspective introduces topics on the enigmatic molecular evolution of some plant specialized metabolic processes. Two transferase families, BAHD acyltransferases and aromatic prenyltransferases, which are involved in the biosynthesis of paclitaxel and meroterpenes, respectively, have shown apparent expansion. The latter family has been shown to be involved in the biosynthesis of a variety of aromatic substances, including prenylated coumarins in citrus plants and shikonin in *Lithospermum erythrorhizon*. These genes have evolved in the development of each special subfamily within the plant lineage. The broadness of substrate specificity and the exon-intron structure of their genes may provide hints to explain the evolutionary process underlying chemodiversity in plants.

Keywords: prenyltransferase, acyltransferase, BAHD, *Citrus*, gene family, molecular evolution, specialized metabolism, *Lithospermum*, *Taxus*

# INTRODUCTION

Since land plant colonization 500 million years ago, plant specialized metabolic processes have expanded considerably, resulting in the development of diverse traits within the plant kingdom (Weng et al., 2012). The chemical diversity of those natural products provides various metabolites beneficial for human life, including compounds associated with flavor,

**43**

color, taste, and medicine. A comparative genome analysis strongly suggested that gene duplications played a major role in the evolution of divergent metabolic pathways (Fani and Fondi, 2009). The increase in the number of gene copies may have allowed promiscuous diversity of the encoded enzymes, resulting in the synthesis of new metabolites and providing organismal fitness that enhances the establishment of biosynthetic pathways in the plant lineage. The expansion of plant specialized metabolism has been observed in the genome of *Selaginella moellendorffii*, a plant that diverged shortly after the establishment of vascular tissues in plant evolution (Banks et al., 2011). A representative example of these expanded gene families is cytochrome P450-dependent monooxygenases, which constitute 1% of the predicted proteome in *Selaginella*. The genome of liverwort, *Marchantia polymorpha*, also encodes many terpenoid biosynthetic enzymes sharing a common isoprenoid pathway, a derivative designated taxadiene for the synthesis of plant hormones like gibberellin (Bowman et al., 2017). In *Physcomitrella patens,* a diterpene *ent*-kaurene is converted to gibberellin-type diterpenes, which act as regulators of protonema differentiation (Hayashi et al., 2010).

Species of the gymnosperm *Taxus* synthesize unique diterpene compounds called "taxoids," which include an important anticancer drug, paclitaxel, a derivative designated taxadiene (Guerra-Bubb et al., 2012). Over 350 taxoid compounds were identified by 1999, with these compounds having variable side residues at the C1, C2, C4, C5, C7, C9, C10, C13, and C14 positions of the core taxadiene skeleton (Baloglu and Kingston, 1999). Except for a partial biosynthetic route (Croteau et al., 2006), knowledge about the biosynthetic pathway of taxoids that contribute to the chemodiversity in *Taxus* is limited.

Because of their fine-tuned genome data resources, angiosperm species provide good model systems to study molecular mechanisms underlying the chemodiversity of plant metabolites (Kroymann, 2011). For example, meroterpenes, including furanocoumarin derivatives (Bourgaud et al., 2006) and shikonin derivatives that are lipophilic red naphthoquinone (Yazaki, 2017), are specialized metabolites synthesized through branched routes from a metabolic pathway common to the general phenylpropanoid and isoprenoid biosynthetic pathways (Yazaki et al., 2017). The term "primary metabolism" indicates processes required to sustain life, such as energy acquisition from glucose. These processes include, for example, the biosynthesis of ubiquinone, a component of the respiratory chain in mitochondria. The biosynthesis of shikonin derivatives involves steps common to those involved in ubiquinone biosynthesis. To avoid confusion in distinguishing between primary and specialized (secondary) metabolism, this article uses the term "common metabolism" rather than "primary metabolism" to indicate biosynthetic pathways conserved in a broad variety of organisms.

This perspective focuses on two enzyme families as examples of molecular evolutionary events: the aromatic substrate prenyltransferase family, which plays a key role in the diversification of phenolics, and the BAHD (BEAT-AHCT-HCBT-DAT; initials of representative members) acyltransferase family, which is responsible for the derivatization of a core metabolite.

# EVOLUTION OF THE *CITRUS* PRENYLTRANSFERASE GENE FAMILY

Among prenyltransferase superfamily including prenyl chain elogation enzymes, aromatic prenyltransferases represent a family responsible for the prenylation of aromatic substances. An aromatic prenyltransferase of *Citrus limon*, ClPT1, is responsible for the biosynthesis of 8-geranylumbelliferone, a coumarin derivative of a plant specialized metabolite (Munakata et al., 2014). The chemical diversity of coumarin derivatives is greatly increased by the involvement of aromatic prenyltransferases, which have been identified in many plant lineages during the last decade (Karamat et al., 2014; Munakata et al., 2014). Phylogenetic analysis has suggested that the diverse prenyltransferases developed independently in each plant family rather than developing from a common ancestor within the prenyltransferase gene family (Munakata et al., 2016). The plant prenyltransferase gene family contains conserved subfamilies responsible for the ubiquinone, plastoquinone, and vitamin E biosynthesis pathways (Li, 2016).

An outline of the evolutionary development of plant aromatic prenyltransferases in *Citrus* species was revealed by a phylogenetic analysis of previously characterized prenyltransferases and prenyltransferases of the model species *P. patens, S. moellendorffii, Arabidopsis thaliana, Glycine max,* and *Lithospermum erythrorhizon* (see below), in addition to *Citrus sinensis* (**Figure 1A**). Phylogenetically, these intrinsic membrane proteins can be grouped into three major subfamilies, i.e., those involved in the biosynthesis of vitamin E, plastoquinone, and ubiquinone (shown as yellow and gray backgrounds and as the black triangle, respectively in **Figure 1A**, with the black triangle expanded in **Figure 1B**). The biochemical functions of AtVTE2-1 (Savidge et al., 2002), AtVTE2–2 (Venkatesh et al., 2006), and OsPPT1 (Ohara et al., 2006) have been described. As expected from their fundamental roles, all model plant species had one or more proteins in each subfamily. In contrast, a search of the *C. sinensis* database revealed nine prenyltransferase-like proteins, forming a *Citrus*-specific subfamily within the vitamin E clade (shown in red in **Figure 1A**). A similar result was obtained by searching *Citrus clementina* genome sequences. These results suggest that *Citrus* species have developed a unique, expanded gene subfamily for specialized metabolism, with ClPT1 being biochemically characterized. This analysis also identified a similar unique subfamily expansion in *G. max* (shown in blue in **Figure 1A**). The first flavonoid-specific prenyltransferase SfN8DT1 from a legume species *Sophora flavescens* (Sasaki et al., 2008) is in this group, suggesting that flavonoid prenyltransferases in soybeans were derived from a vitamin E biosynthetic enzyme. Other, later detected flavonoid prenyltransferases were all classified in this subgroup (Akashi et al., 2009; Yoneyama et al., 2016). Most of these enzymes involved in specialized metabolism show strict substrate specificity in relation to a particular prenyl diphosphate.

Prenyltransferases involved in common metabolism show broad specificity in relation to substrates of different side chain lengths; i.e., they accept various prenyl diphosphates of different chain lengths (Sadre et al., 2010). For example, the ubiquinone biosynthesis pathway in rice can be modified by introducing a decaprenyl diphosphate synthase, resulting in the production of non-native UQ10 rather than native UQ9 (Ohara et al., 2006; Takahashi

a gray background), and a clade represented by the rice polyprenyltransferase OsPPT1 for ubiquinone biosynthesis (indicated by "Ubiquinone" and a compressed black triangle). Biochemically characterized proteins are indicated by a white background. The *Citrus* and legume proteins are shown in red and blue letters, respectively, and the lineage-specific clades are indicated by brackets with the same colors. (B) Details of the phylogenetic tree of polyprenyltransferases for ubiquinone in panel (A). *L. erythrorhizon* proteins are shown in magenta letters. The brackets indicate subclades of polyprenyltransferases involved in ubiquinone biosynthesis (PPT subfamily), LePGT-like proteins (LePGT subfamily), and unclassified subclade proteins (unclassified). The proteins from other organisms are shown in black letters. The asterisk indicates the putative PPT-like protein of *L. erythrorhizon*. The phylogenetic tree was drawn using the MEGA7 neighbor-joining method with 1,000 bootstrap replicates for alignment of polyprenyltransferase-related proteins, which were calculated with the MUSCLE algorithm. The accession numbers are shown next to the name of the organism. Biochemically characterized proteins are indicated by a yellow background. The scale bar represents 0.1 amino acid substitutions per site. (C) LePGT gene is encoded by a single exon gene in the *L. erythrorhizon* genome, whereas LePPT-like proteins are encoded by genes with inserted introns, at positions similar to those of the authorized OsPPT gene and the closest tobacco homolog, NtPPT-like gene (gene = LOC107804153). The first intron insertion site into the coding region is shown. Scale bar, 1 kb of DNA sequence. Blue boxes represent coding exons.

et al., 2006). These expanded gene families and the broad substrate specificity of prenyltransferases may provide the opportunity for neo-functionalization of new enzymes in plant evolutionary history.

# EVOLUTION OF THE *P*-HYDROXYBENZOIC ACID GERANYLTRANSFERASE GENE FOR SHIKONIN BIOSYNTHESIS

A boraginaceaeous medicinal plant, *L. erythrorhizon,* possesses a unique subfamily of *p*-hydroxybenzoic acid geranyltransferases (PGTs) (**Figure 1B**) that are specifically involved in shikonin biosynthesis (Yazaki et al., 2002). An overview of the evolutionary history of PGT was attained by assessing genome sequences and transcriptomes of *L. erythrorhizon* from the GenBank datasets SRP108575 and SRP141330, respectively, as well as by reassembling our original data (Takanashi et al., 2019). The hypothetical PGT-like proteins were found to be closely related to the ubiquinone prenyltransferase subfamily involved in common metabolism (magenta in **Figure 1B**), which was closer to these hypothetical PGT-like proteins than the specialized citrus prenyltransferases (**Figure 1A**). Most PGT-like proteins are encoded by genes with a single exon, whereas general ubiquinone biosynthetic polyprenyltransferases (PPTs) are encoded by genes containing multiple exons (**Figure 1C**). It is of interest to determine how the single exon structure was generated during the evolution of plant specialized metabolism.

# MISSING UBIQUINONE PRENYLTRANSFERASE IN *L. ERYTHRORHIZON*

Although ubiquinone is a common metabolite in all eukaryotes, and the genes encoding PPTs are essential for the survival of a broad range of organisms, no orthologous ubiquinone PPT gene was found in the *L. erythrorhizon* transcriptome. Experiments in yeast showed that LePGT cannot synthesize ubiquinone (Yazaki et al., 2002), and LePGT heterologously expressed in *E. coli* was found to inhibit ubiquinone biosynthesis (Wu et al., 2015). Genomic sequencing identified a contig fragment that could code for PPT-like proteins (asterisk in **Figure 1B**) and that its amino acid sequence was moderately similar to that of OsPPT1, which is responsible for ubiquinone biosynthesis in rice. In addition, there were three contigs that we could not classify, which are labeled "unclassified genes" ("unclassified" in **Figure 1B**). In contrast to the particular PGT that catalyzes shikonin biosynthesis, an intron insertion was found in the hypothetical gene, at the same position as in the PGTs of *Nicotiana tabacum* and *Oryza sativa* (**Figure 1C**). This conserved exon-intron organization was also observed in the PPT genes from *Arabidopsis* and rice (Ohara et al., 2006). This gene product is a strong candidate for a ubiquinone prenyltransferase in *L. erythrorhizon*, and its biochemical characterization is expected in the future.

# EVOLUTION OF THE *TAXUS* ACYLTRANSFERASE GENE FAMILY

Acyltransferases also substantially contribute to the diversification of specialized metabolites, in which BAHD and SCPL (serine carboxypeptidase-like) are representatives. Taxoids such as paclitaxel present in *Taxus* species are specialized metabolites and highly acylated compounds. Five known taxoid acyltransferases are closely related to each other, with all grouped in clade V of the BAHD acyltransferase family (D'Auria, 2006). These *Taxus* proteins differ in substrate specificities for both acyl donors and acceptors; i.e., they can utilize acetyl-CoA, benzoyl-CoA or phenylalanoyl-CoA for *O*- and *N*-acylation of various taxoid molecules (D'Auria, 2006).

To understand the evolutionary development of the *Taxus* BAHD acyltransferase family, BAHD clade V was analyzed phylogenetically in detail (yellow background in **Figure 2A**). The amino acid sequences of *Taxus* BAHD members were obtained from the transcriptome data of *Taxus x media* cultured cells (Yukimune et al., 1996). Phylogenetic analysis showed that the *Taxus* BAHD proteins form a *Taxus*-specific clade (red bracket in **Figure 2A**), containing all five characterized acyltransferases (white background in the *Taxus*-specific clade), as well as other *Taxus* proteins of unknown function (asterisk in **Figure 2A**). Within this clade of the BAHD family, *O. sativa* and *A. thaliana* each form a unique clade, suggesting that lineagespecific subfamily expansion of the BAHD acyltransferases plays a major role in plant evolution (Fani and Fondi, 2009). In addition to this *Taxus*-specific subgroup, other *Taxus* BAHD proteins have been identified, with these classified with other model plant BAHD members (**Supplementary Figure S1**), suggesting that *Taxus* species possess genes encoding general BAHD clade V proteins that are conserved among a broad range of plant species.

It can be hypothesized that neo-functionalization is induced by the acquisition of promiscuous enzymatic activity during plant evolution. We have examined the enzymatic activity of recombinant proteins prepared from seven isolated cDNAs encoding BAHD members of the *Taxus*-specific subfamily (dagger in **Figure 2A**). Each crude recombinant enzyme was prepared using pET22a and OrigamiB as a host-vector system (Novagen), without a periplasmic signal sequence, according to the conventional method. Each enzyme was reacted with acetyl-CoA and 10-deacetyl baccatin III (10-DAB) as substrates, and the reaction products were analyzed using an UPLC–MS/MS system equipped with a BEH C18 column (Waters). The clone encoding 5-hydroxytaxadiene 5-*O*-acetyltransferase (TAT) had 10-DAB:10- *O*-acetyltransferase (DBAT) activity (Walker et al., 2000), as well as the canonical enzyme DBAT (**Figures 2B,C;** Walker and Croteau, 2000). The amount of the product formed by the substrate was 1.4 mol% for TAT and 10.4% for DBAT, suggesting that the activity of TAT was 13.2% that of DBAT. This promiscuity of enzymatic activity may represent the evolutionary footprint of a biosynthetic enzyme that acquires a new functionality through the alteration of substrate and product specificities, resulting in the production of a unique specialized metabolite.

# CONCLUSIONS AND PERSPECTIVES

Using two transferase subfamilies as examples, we have shown the "heritage" of expansion of a gene family, which is relevant for the development of plant specialized metabolic pathways. A protein in the specific BAHD subfamily of *Taxus* species showed promiscuous enzymatic activity for noncanonical substrates containing side chains at a noncanonical carbon position. These observations fit the general context of developmental molecular evolution that explains the development and establishment of new canonical enzymatic activity (Weng et al., 2012). The generation in *L. erythrorhizon* of a PGT gene subfamily, each containing a single exon and involved in shikonin biosynthesis, suggests the putative involvement of the reverse transcription of mature mRNA. If this surmise is valid for other enzyme families, single exon genes may provide clues to identifying missing proteins responsible for biosynthetic pathways for other valuable plant specialized metabolites.

There are yet many missing links, even in actively studied shikonin and taxoid biosynthetic pathways. The applicable range of the single exon hypothesis may not be limited only to biosynthetic enzymes, but to regulatory factors. The identification of regulatory factors will be essential to understanding the production of plant specialized metabolites, including membrane transporters. Comparative genomics will enable the assessment of the evolutionary footprint of these genes, e.g., the expansion of specific subfamilies and the proliferation of single exon genes. Further biochemical and molecular genetics studies may provide experimental evidence for the involvement of hypothetical proteins in plant specialized metabolism.

# DATA AVAILABILITY

The datasets generated for this study can be found in GenBank.

FIGURE 2 | Phylogenetic analysis of BAHD acyltransferase proteins from *Taxus* species and LC-MS/MS analysis of the reaction products of the noncanonical enzyme, taxadienol 5-acyltransferase. (A) Performance of phylogenetic analysis with hypothetical *Taxus* BAHD acyltransferase-like proteins and related proteins from model plant species. The BAHD family was classified into five clades (D'Auria, 2006), with clade V indicated by a yellow background, and representatives of clade I–IV (Vh3MAT1, CER2, BEAT, and ACT, respectively) placed outside the yellow background. Proteins of *Taxus*, rice, *Arabidopsis* are shown in red, magenta, and blue letters, respectively, and the lineage-specific subclades are indicated by the same colors. The bracket "Taxus specific clade" indicates the *Taxus* lineage-specific subclade containing the five characterized proteins, TAT, DBAT, DBTNBT, DBBT, and BAPT, indicated by a white background. Asterisks indicate *Taxus* proteins of unknown function, and daggers indicate proteins biochemically analyzed in the present study. A representative widely conserved clade in land plants from *Physcomitrella* to *Arabidopsis* is indicated by brackets, with four other subclades compressed (expanded in Supplementary Figure S1), in addition to the clade conserved in seed plants containing the *Taxus* specific clade. The accession numbers are given next to the organism names. The phylogenetic tree was drawn using the MEGA7 neighbor-joining method with 1,000 bootstrap replicates for alignment calculated with the MUSCLE algorithm. Scale bar, 0.1 amino acid substitutions per site. (B) LC-MS/MS chromatograms of the enzyme reaction products of *Taxus* acyltransferases DBAT and TAT using acetyl-CoA and 10-DAB as substrates. The red arrow indicates the peak of the noncanonical reaction product. The bottom panel shows the chromatogram of standard specimens, 10-DAB and baccatin III. The chromatograms show a trace of representative ions m/z = 545.5 [M + H] + and 604.5 [M + NH4] + for the substrate 10-DAB (blue) and the product baccatin III (red), respectively. The vertical axis indicates the value relative to 5 million ion counts. (C) Mass spectrum of the *in vitro* reaction product peaks found at a retention time of 6.951 min of the chromatogram. The vertical axis indicates the relative value of ion count of maximum signal at m/z = 604.5. The molecular formulas of 10-DAB and baccatin III are shown in panel (B).

# AUTHOR CONTRIBUTIONS

HK and KY wrote the manuscript and performed the phylogenetic and biochemical analyses. HL was involved in the assembly of genomic contigs and the analysis of the exon-intron structure of *Lithospermum erythrorhizon* genes. HM, YK, and HT were responsible for transcriptome analysis of *Taxus* spp.

# FUNDING

This work was supported in part by the New Energy and Industrial Technology Development Organization (NEDO, No. 16100890 to KY). Additional support was provided by the Mission Research of RISH, Kyoto University.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.00794/ full#supplementary-material

FIGURE S1 | Expanded phylogenetic tree of Figure 2A. Phylogenetic analysis with hypothetical Taxus BAHD acyltransferase-like proteins and related proteins from model plant species. Asterisks indicate Taxus proteins found in this study. Functionally identified BAHD proteins are highlighted in yellow background.

## REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Kusano, Li, Minami, Kato, Tabata and Yazaki. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Acceleration of Mechanistic Investigation of Plant Secondary Metabolism Based on Computational Chemistry

*Hajime Sato1,2 , Kazuki Saito1,3 and Mami Yamazaki1 \**

*1 Graduate School of Pharmaceutical Sciences, Chiba University, Chiba, Japan, 2 Center for Sustainable Resource Science, Advanced Elements Chemistry Laboratory, Cluster for Pioneering Research (CPR), RIKEN, Saitama, Japan, 3 RIKEN Center for Sustainable Resource Science, Yokohama, Japan*

This review describes the application of computational chemistry to plant secondary metabolism, focusing on the biosynthetic mechanisms of terpene/terpenoid, alkaloid, flavonoid, and lignin as representative examples. Through these biosynthetic studies, we exhibit several computational methods, including density functional theory (DFT) calculations, theozyme calculation, docking simulation, molecular dynamics (MD) simulation, and quantum mechanics/molecular mechanics (QM/MM) calculation. This review demonstrates how modern computational chemistry can be employed as an effective tool for revealing biosynthetic mechanisms and the potential of computational chemistry—for example, elucidating how enzymes regulate regio- and stereoselectivity, finding the key catalytic residue of an enzyme, and assessing the viability of hypothetical pathways. Furthermore, insights for the next research objective involving application of computational chemistry to plant secondary metabolism are provided herein. This review will be helpful for plant scientists who are not well versed with computational chemistry.

Keywords: biosynthesis, computational chemistry, density functional theory, molecular dynamics simulation, quantum mechanics/molecular mechanics, plant, secondary metabolite

# INTRODUCTION

Recent advances in life science have revealed many plant secondary biosynthetic pathways and have also contributed to establishing efficient microbial-based manufacturing systems that can potentially afford greater amounts of important plant secondary metabolites, including artemisinin (Ro et al., 2006; Paddon et al., 2013), opioids (Galanie et al., 2015), and cannabinoids (Luo et al., 2019), by introducing all of the biosynthetic genes into the host. These outstanding achievements are obviously based on long-standing biosynthetic studies. In opioid heterologous expression systems, not only enzymes from plants but also enzymes from other organisms were used, which are relevant to the expression levels and substrate specificity. Considering the progress of biosynthetic studies and synthetic biology, rational engineering of the existing biosynthetic pathways or designing the novel pathways to obtain desired products appears to be the next challenge. To achieve this objective, detailed mechanistic investigations of biosynthetic reactions are critical for the rational modification of biosynthetic pathways or enzymes.

*Edited by:* 

*Hiroyuki Morita, University of Toyama, Takaoka, Japan*

#### *Reviewed by:*

*Takashi Matsui, Kitasato University, Japan Takahiro Mori, The University of Tokyo, Japan*

> *\*Correspondence: Mami Yamazaki mamiy@faculty.chiba-u.jp*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 09 May 2019 Accepted: 04 June 2019 Published: 26 June 2019*

#### *Citation:*

*Sato H, Saito K and Yamazaki M (2019) Acceleration of Mechanistic Investigation of Plant Secondary Metabolism Based on Computational Chemistry. Front. Plant Sci. 10:802. doi: 10.3389/fpls.2019.00802*

Although various types of experimental methods, such as labeling experiments, X-ray crystallography, site-directed mutagenesis, omics studies (Rai et al., 2017, 2019), and genome editing, have been applied, clarifying entire biosynthetic pathways and mechanisms remains a challenge in plant science. Unlike microorganisms, biosynthetic genes do not form clusters in plants, which makes biosynthetic studies of plants difficult. Even finding a biosynthetic gene (enzyme) is a challenge. For example, thus far, in quinolizidine alkaloid biosynthesis, only two enzymes have been characterized for their skeletal formation, although studies on their biosynthesis have been carried out for over half a century (Bunsupa et al., 2012; Yang et al., 2017).

Recently, an increasing number of computational chemistry studies on natural product biosynthesis have been published, which provides new insights that cannot be discovered solely by experimental approach. However, among plant biologists and computational chemists, there is still a significant gap in understanding how computational chemistry is an effectual tool. Through several representative biosynthetic studies on terpene/terpenoid, alkaloid, flavonoid, and lignin, this review describes what can be clarified using computational chemistry and also describes the computational methods, which are widely used for the mechanistic investigation of biosynthetic reactions. An attempt to be comprehensive was made; however, apologies are offered in advance for any accidental oversights. Moreover, there are many computational predictions of NMR, UV, and CD spectra, and mechanistic investigations on fungal or bacterial secondary metabolism; however, these studies are not covered here since this review focuses on the "mechanistic investigations" of "plant" secondary metabolism.

# THEORETICAL METHODS FOR NATURAL PRODUCT BIOSYNTHESIS RESEARCH

In plant science, cheminformatics such as KEGG (Kanehisa and Goto, 2000; Kanehisa et al., 2017, 2019) KNApSAcK (Afendi et al., 2011), etc., are widely utilized to predict biosynthetic pathways. However, computational chemistry is completely different from these database search approaches. In computational chemistry, the properties of compounds, such as free energy, spectrum, reactivity, and molecular orbitals, etc., are obtained by *ab initio* calculations based on the Schrödinger equation in molecular orbital (MO) calculations, the Kohn-Sham equation in density functional theory (DFT) calculations, and the Newton equation in molecular dynamics (MD) simulations (Jensen, 2017).

Although a variety of computational methods are available for the mechanistic investigation of chemical reactions, carefully choosing appropriate methods is important in terms of accuracy and computational expenses. Today, quantum mechanics (QM), i.e., MO or DFT, are mainly used for the evaluation of the reactivity or properties of small molecules. For mechanistic investigations using QM calculations, transition state (TS) search is initially performed, after which frequency calculation is performed to ensure that the TS has a single imaginary frequency. Finally, intrinsic reaction coordinate (IRC) calculation is carried out to obtain the reactant and product (Ishida et al., 1977; Fukui, 1981; Gonzalez and Schlegel, 1989; Page et al., 1990; Schlegel and Gonzalez, 1990). Today, hundreds of levels of theory are available. The choice of the level of theory is quite critical for computation accuracy in QM, which is dependent on the type of reactions or chemical structures. Many publications on benchmark tests exist, in which several combinations of density functional and basis set are tested against the one certain reaction or molecule to find the most accurate level of theory. For example, mPW1PW91/6-31+G (d,p)//B3LYP/6- 31+G(d,p) has been used for the terpene-forming reaction based on the benchmark test reported by Matsuda et al. (2006). In this literature, many combinations of basis set and functionals were tested and compared with the experimental data. The results indicated that B3LYP/6-31+G(d,p) is the best for geometry optimization, whereas mPW1PW91/6-31+G(d,p) is the best for computing the free energy for the terpene-forming reaction.

In plant secondary metabolism, most of the biosynthetic reactions are thought to be catalyzed by enzymes; however, the system size, which QM calculations can treat, is usually up to a few hundred atoms. This means that biological macromolecules, i.e., enzymes, are too large to be calculated with QM. Theozyme calculation (Tantillo et al., 1998; Ujaque et al., 2002; Tantillo, 2010) is one way to estimate the enzymatic assistance toward the chemical conversion, in which the catalytic center, substrate, and several residues are picked up and subsequently subjected to DFT calculations (see Section "Theozyme Calculation Identified the Key Residue for Sesterfisherol Biosynthesis" for more detail). However, QM calculation is applied only for the isolated model, and the regions that could affect the chemical conversion are ignored. Thus, MD simulation is used for large macromolecule systems, which can simulate the time-dependent structure changes of enzymes; however, the free energy value is less accurate than that obtained using QM. For relatively high accuracy, quantum mechanics/molecular mechanics (QM/MM) is used for large macromolecule systems, in which the reaction system is divided into two regions: QM and MM. Generally, the catalytic center is calculated using QM, and the other parts of the enzyme are calculated using MM. Moreover, a state-of-the-art QM/MM MD is also utilized for mechanistic investigations.

# TERPENOIDS

Terpene/terpenoids are the largest natural product group. At least 80,000 terpene/terpenoids have been reported to date (Quin et al., 2014; Dickschat, 2016; Christianson, 2017). Most theoretical studies on terpene cyclization have been carried out by Tantillo and Hong. Although they have provided valuable insight into carbocation chemistry, only some of their works are presented in this review due to page limitations. One of the reasons for the considerable success of this compound group in computational chemistry is that terpenes consist of only carbons and hydrogens; therefore, strong interactions between the substrate and enzyme, such as hydrogen bonding, are not necessarily considered for its cyclization. In fact, the computed inherent reactivity shows good agreement with the experimental data for the terpene-forming reaction. Moreover, as was described above, a detailed benchmark test has been reported, which also supports the computation accuracy. In this section, several examples ranging from small systems to large systems will be discussed.

# Two Possible Cyclization Mechanisms Were Assessed Using Density Functional Theory Calculations

In comparison to mono-, sesqui-, di-, and triterpene, sesterterpenes are considered relatively minor terpenes, and only several biosynthetic enzymes have been characterized to date. In 2017, a genome mining approach to discovering the biosynthetic genes of sesterterpenes from *Arabidopsis thaliana*, *Capsella rubella*, *Brassica oleracea*, and *Brassica rapa* was reported by Huang et al. (2017), which strongly promoted the biosynthetic research of sesterterpenoids. They isolated (+)-arathanatriene, (−)-retigeranin B, (+)-astellatene, (−)-ent-quiannulatene, (−)-variculatriene A, (−)-caprutriene, (+)-boleracene, (−)-aleurodiscalene A, (+)-caprutriene C, (−)-caprudiene A, (+)-brarapadiene A, (−)-brarapadiene B, (−)-arathanadiene A, and (−)-arathanadiene B, and also characterized the corresponding biosynthetic enzymes (**Figure 1**). Interestingly, some of these compounds are originally found in microorganisms, suggesting the convergent evolution of biosynthetic genes.

Since sesterterpenes have more carbons than sesqui- and diterpenes do, their structure is more flexible and complicated, which makes the mechanistic investigation challenging. For example, two reasonable pathways can be proposed for (+)-astellatene formation (**Figure 2**). In path I, a 5/6/11 tricyclic skeleton is formed first, whereas path II involves the initial formation of a 5/12/5 tricyclic skeleton. Prior to this research, a labeling experiment was performed by Ye et al. in the biosynthesis of sesterfisherol, which is a similar sesterterpenoid. They could not experimentally distinguish between the 5/6/5 and 5/12/5 tricyclic skeleton formations because both paths are consistent with the labeling experiment results (Ye et al., 2015). Moreover, predicting a reasonable pathway based on presumable intermediates is difficult. For example, (+)-arathanatriene and (−)-caprutriene could be formed *via* path I, whereas (−)-variculatriene A and (−)-variculatriene B could be formed *via* path II, indicating that the experimental approach is not enough to answer this question. Accordingly, Huang et al. performed DFT calculations to assess the viability of the proposed biosynthetic pathways of (+)-astellatene and brarapadiene A (Huang et al., 2018).

Based on their computational results, both pathways are energetically viable; however, path I is inherently preferred since the highest energy barrier in path I is lower than that in path II by 6 kcal/mol. This study shows that DFT calculations can clarify the inherent reactivity of the carbocation intermediate in terpene biosynthesis, which could be useful for distinguishing the favored pathway. This combined experimental and computational approach is very useful, particularly when the reaction mechanism cannot be accessed solely using experiments.

# Density Functional Theory Calculation Revealed That the Initial Conformation Affects the Regio- and Stereoselectivity

In terpene cyclization cascade, the computed inherent reactivity is consistent with the experimental result, indicating that enzymes do not precisely manipulate the reactions step by step. However, conventionally, various kinds of terpenoids are synthesized from common isoprenoid substrates. Therefore, some may wonder how terpene cyclases can control the regio- and stereoselectivity without interacting with the intermediate during the cyclization.

To investigate the origin of the terpenoid structural diversity, Sato et al. carried out a detailed comparative study of two structurally similar sesterterpenes: sesterfisherol and quiannulatene, which were synthesized by NfSS and EvQS, respectively (Sato et al., 2018a). Based on their calculations, these sesterterpenoids are formed *via* the 5/12/5 tricyclic intermediate unlike for astellatene (**Figure 2**, path II). In addition, a careful comparison study clarified the conformational difference between the eightmembered rings in these two 5/12/5 intermediates. Here, we refer to the conformation of the quiannulatene's 5/12/5 tricyclic intermediate (IM\_Q) as conformation A, and that of sesterfisherol (IM\_S) as conformation B.

They carried out "conformation swapping analysis" among these two sesterterpenoid intermediates (**Figure 3**). When IM\_Q is in conformation B, 5/5/5-type triquinane formation proceeds, which was observed in sesterfisherol biosynthesis (**Figure 3B**). However, this triquinane formation reaction does not proceed inside EvQS, because conformation A is more stable than conformation B by 2.3 kcal/mol, and the activation energy of the 5/5/6-type condensation reaction is much lower. When IM\_S is in conformation A, a 5/5/6-type condensation reaction proceeds, which is observed in quiannulatene biosynthesis (**Figure 3C**). However, this reaction cannot proceed inside NfSS because the energy barrier is too high (∆*G*‡ = 26.6 kcal/mol). This analysis suggests that the destination of the cyclization cascade is determined by the conformation of the intermediates. For each intermediate, the preferred conformation is different (A is preferred in quiannulatene biosynthesis, and B in sesterfisherol biosynthesis), which could be attributed to the stereochemistry of the 5/11 condensation positions (*trans*-fused in IM\_Q while *cis*-fused in IM\_S).

Moreover, a further detailed comparison was carried out for the whole cyclization cascade, from which the initial conformation was found to be critical for the selectivity (**Figure 4**). The orientation of the methyl group of each double bond in the initial conformation determines the stereochemistry of the intermediate; accordingly, the conformation of the intermediate is automatically fixed, and the destination of the cascade is automatically set. In conclusion, terpene synthase appears to regulate the regio- and stereoselectivity by fixing the initial conformation.

# Theozyme Calculation Identified the Key Residue for Sesterfisherol Biosynthesis

Generally, the main roles of terpene cyclases are (1) abstraction of pyrophosphate, (2) fixing the initial conformation (as was

described above), (3) protecting the reactive intermediates from water, and (4) termination of the cyclization. The terpene-forming reaction is initiated by the elimination of the pyrophosphate group; afterward, the cyclization is driven by the inherent reactivity of the carbocation. However, the residue in the active site occasionally affects the carbocation intermediate in the cyclization cascade. Considering computational expenses, theozyme calculation is a great method for estimating the involvement of residues in the active site. Here, we show one example in which theozyme calculation and experimental validation worked successfully.

A detailed theoretical study of sesterfisherol biosynthesis was published by Sato et al. (Narita et al., 2017; Sato et al., 2018b). While they were exploring the cyclization mechanisms of sesterfisherol, a deep minima was found that required extraordinarily high energy, which was different from the activation energy in terpene biosynthesis. Subsequently, they examined the possibility of C–H π interactions (Hong and Tantillo, 2015), since terpene cyclase active sites are generally formed by aromatic and aliphatic residues. To estimate the C–H π interaction, a benzene ring was located around the

carbocation center of the intermediate, and the transition state structure search was subsequently carried out. Consequently, an alternative pathway was found, in which the C–H π interaction lowered the activation energy and also avoided the deep minima (**Figure 5**).

Notably, they also experimentally examined this C–H π interaction by site-directed mutagenesis. First, they constructed a homology model of NfSS and carried out the docking simulation. Afterward, several aromatic residue candidates that could form the C–H π interaction were found in NfSS's active site (**Figure 6**). Due to their experiment, NfSS F191A produced new compounds instead of sesterfisherol (**Figure 5**). This result suggests that F191 is the key catalytic enzyme in sesterfisherol biosynthesis.

Interestingly, two of the products produced by NfSS F191A are thought to be derivatives of computationally predicted intermediates (Byproduct 1 and Byproduct 2), which could be deprotonated by the pyrophosphate adjacent to the active site.

FIGURE 4 | Initial conformations of GFPP in quiannulatene and sesterfisherol biosyntheses. (A) 3D representations of IM1 in the quiannulatene biosynthesis pathway and the sesterfisherol biosynthesis pathway. (B) The orientation of methyl groups attached to the double bonds I, II, III, IV, and V.

As we have shown in this section, the combination of the theozyme calculation, homology modeling, docking simulation, and site-directed mutagenesis can be a powerful tool for finding the key catalytic residues. We think this approach can accelerate the mechanistic studies of biosynthetic enzymes in plant secondary metabolism.

# High-Resolution All-Atom Model of Terpene Synthase

As was described above, terpene synthases play an important role in pre-organizing the initial conformation of the substrate. However, building molecular models of entire terpene-forming reactions within an active site remains a challenge. Major et al. mentioned the following in their review, "A crucial question in any study of terpene synthases is that of the correct binding mode. Indeed, crystal structures of terpene synthases often contain substrates bound in unreactive conformations, partly due to the stickiness of the hydrocarbon moiety of the substrate and its lack of hydrogen bond potential. Thus, there is often great uncertainty regarding the correct binding mode when commencing multi-scale simulation projects of terpene cyclases" (Major et al., 2014). To tackle this problem, O'Brien et al. combined QM calculation and computational docking with Rosetta molecular modeling suite and reported the highresolution all-atom models of epi-aristolochene synthase (TEAS) from *Tobacco* (O'Brien et al., 2016) and (+)-bornyl diphosphate synthase from *Salvia officinalis* (O'Brien et al., 2018).

Due to page limitations, we briefly explain their methodologies here (please refer to the original paper for more details). In the study, they initially carried out DFT calculations along with the generally accepted biosynthetic pathway of epi-aristolochene formation (**Figure 7**). All computed carbocation intermediates were subsequently subjected to conformational search using molecular mechanics force field (MMFF). These conformation

FIGURE 6 | A carbocation intermediate and NfSS complex obtained by the docking simulation.

libraries, containing over a hundred conformers for each carbocation intermediate, were subsequently optimized using DFT calculations, and low energy structures within 5 kcal/mol were used for the docking simulation. For the preparation of the protein structure, the X-ray crystal structure was minimized using a constrained FastRelax (Conway et al., 2014) procedure from the Rosetta modeling suite (Meiler and Baker, 2006; Richter et al., 2011). The diphosphate/magnesium complex extracted from the TEAS crystal structure was docked along with previously generated conformer libraries into the relaxed crystal structure using the chemically meaningful constraints. To ensure that the sampling was sufficient, 2,500 docking runs per catalytic orientation (motif) were performed. The resulting structures were combined and subsequently subjected to filtering based on three explicit constraints, involving (1) the departing diphosphate oxygen that results in the carbocation; (2) deprotonation of carbocation intermediate 2; and (3) protonation of carbocation intermediate 3.

As shown in **Figure 8**, motif 1 is abundant in all intermediates, suggesting that this orientation is the most reasonable docking mode in TEAS. Furthermore, they carried out the RMSD calculation on the motif 1 structures in all intermediates and revealed the least movable docking mode during the cyclization reaction. Interestingly, only a few orientations are enriched in the early stage of this biosynthesis, whereas several motifs are enriched in the late stage of this enzyme reaction. In addition, the RMSD value was increased in the late stage of this cyclization reaction. These are consistent with the generally accepted concept that the substrate affinity decreases as the reaction proceeds. We think this method is applicable for future efforts to carry out the rational redesign of reaction specificity of this class of enzymes.

# Quantum Mechanics/Molecular Mechanics Molecular Dynamics Calculation Revealed the Substrate Specificity of Geranyl Diphosphate Synthase

Although the inherent reactivities of the carbocation intermediates, computed by DFT calculations, are consistent with the experimental results, terpene cyclases occasionally affect the carbocation intermediates, as shown in Section "Theozyme Calculation Identified the Key Residue for Sesterfisherol Biosynthesis". Therefore, the calculation including both the substrate and whole protein structure might be required to estimate the significance of the enzymatic support. For this purpose, the QM/MM method is widely used due to its computational cost. Here, we introduce a detailed study of geranyl diphosphate synthase (GPPS) from *Mentha piperita*, reported by Wu and Xu (Liu et al., 2014).

GPPS accepts isopentenyl diphosphate (IPP) and dimethylallyl diphosphate (DPP) as substrates and yields geranyl diphosphate (GPP), which is the first step of the chain elongation in isoprenoid biosynthesis. In the current study, they carried out MD simulations to reveal the mechanism of an "open-closed" conformation change of the catalytic pocket in the GPPS active site and identified a critical salt bridge between **Asp91** (in loop 1) and **Lys239** (in loop 2), which is responsible for opening or closing the catalytic pocket. In addition, the small subunit regulates the size and shape of the hydrophobic pocket to flexibly host substrates with different shapes and sizes (DPP/ GPP/FPP, C5/C10/C15). Furthermore, QM/MM MD simulations were carried out to explore the binding modes for the different substrates catalyzed by GPPS. GPPS is known to be a bifunctional enzyme and can catalyze GPP, GGPP, and a negligible amount of FPP formation. QM/MM MD simulation revealed that the distances and angles between two substrates are critical for the reaction. These parameters are similar when GPP or GGPP is produced, whereas the reverse is the case when FPP is produced. This study shows how QM/MM MD simulation is effective for clarifying the enzymatic effect toward the reaction in plant metabolism. Moreover, the key residues Asp91, Lys239, and Gln156, which could be good candidates for the sitedirected mutagenesis, were found based on the computation.

As was described in Section "High-Resolution All-Atom Model of Terpene Synthase," building the high-resolution all-atom model of terpene synthase has been challenging for a long time, which is one reason there are only a few examples of


FIGURE 8 | Docking results using Rosetta. The structures of the intermediates are shown on the left. The darker the green in each box, the higher the percentage of low energy structures that are found in that catalytic orientation. If no low energy solutions were found for a particular intermediate, then no value was given. The number is the percentage of total low energy structures found for the catalytic motif when docking a particular intermediate.

QM/MM calculations of terpene synthase. We expect an increased number of reports on the all-atom modeling, using QM and Rosetta modeling suite, in the near future, which would promote QM/MM calculations. Moreover, we hope this kind of research will facilitate the mechanistic investigation and rational engineering of the biosynthetic pathways in plants.

# ALKALOIDS

Unlike terpenoids, alkaloids have polar moieties that can form strong interactions with the enzyme. Therefore, more careful model construction is necessary, such as theozyme, MD simulation, or QM/MM. Accordingly, few theoretical studies on alkaloid biosynthesis have been reported to date. Here, we introduce four theoretical studies on the inherent reactivity in alkaloid biosynthesis.

# Density Functional Theory Calculation Revealed a Favorable Mechanism for **β**-Carboline Formation Catalyzed by Strictosidine Synthase

Strictosidine is an important common intermediate that can be converted to many kinds of plant indole monoterpenoid alkaloids, such as ajmalicine, quinine, vinblastine, reserpine, camptothecin, and vincamine. The Pictet-Spengler condensation reaction of tryptamine with secologanin, catalyzed by strictosidine synthase, has been intensively studied, and several X-ray crystal structures have been reported. The Pictet-Spengler reaction consists essentially of two steps. First, an electron-rich aromatic amine attacks the aldehyde of secologanin to form an iminium intermediate. Second, an aryl amine attacks the electrophilic iminium to yield a positively charged intermediate that is then deprotonated to yield a β-carboline product. Although the ligandbound crystal structure is available, the detailed mechanism of β-carboline formation remains unclear. Notably, carbon 2 and 3 of the indole moiety are nucleophilic; therefore, two possible mechanisms can be written (**Figure 9**). In path I, carbon 2 attacks the iminium moiety, after which the direct six-membered ring formation proceeds. In path II, carbon 2 attacks the iminium moiety, thereby forming spiroindolenine, which is subsequently converted to β-carboline by a 1,2-alkyl shift. To clarify which pathway is more reasonable, Maresh et al. carried out DFT calculations (Maresh et al., 2008).

As shown in **Figure 10**, the formation of the six-membered ring is several orders of magnitude faster than spiroindolenine formation, which is consistent with the empirical rule: that 6-endo-ring closures are favored over 5-endo-trig cyclization. However, the formation of the spiroindolenine intermediate has been observed by isotopic scrambling. In addition, spiroindolenine formation requires only 8 kcal/mol, and it appears to occur under ambient conditions. Therefore, if the deprotonation is slow enough, spiroindolenine can be formed during the course of the reaction. Moreover, this calculation also suggests that the 1,2-alkyl shift that connects iminium spiroindolenine and the six-membered ring intermediate requires significantly high energy, which does not contribute to the mechanism.

Even though we obtained X-ray crystal structures, revealing the detailed reaction mechanisms is often challenging. This study indicates that the combined strategy of computations and kinetic isotope experiments is quite effective in distinguishing two possible pathways. We hope this approach will be widely utilized in the study of plant secondary metabolism.

# Camptothecin E Ring Opening Reaction

Camptothecin (CPT) is a plant alkaloid that was originally isolated from the Chinese tree, *Camptotheca acuminate*, in 1966 (Wall et al., 1966). It features a planar pentacyclic ring structure with a pyrrolo [3-4-β] quinolone moiety (rings A, B, and C), a conjugated pyridine group (ring D), and one α-hydroxy lactone ring (ring E); consequently, it has received interest from scientists due to its remarkable anticancer activity in preliminary clinical trials. Although two CPT analogs, topotecan and irinotecan, have been approved and are currently used in cancer chemotherapy, the pharmacological investigation into CPT had been suspended on one occasion due to the poor solubility of CPT in water and most organic solvents (Gupta et al., 1995; Wu and Liu, 1997; Wall, 1998).

The stabilities of CPT and its derivatives in solution were reported to be highly pH-dependent by several experimental

analyses. In addition, CPT and its derivatives tend to aggregate in solution, particularly in dimer formation. One of the key factors is the E-ring-opening reaction that can proceed at neutral pH. Zou et al. carried out DFT calculations to obtain the theoretical basis for this E ring opening reaction (Zou et al., 2013). In their study, they revealed (1) that the E-ring opening reaction can proceed under the physiological pH (∆G‡ = 12.94 in aqueous), (2) the solvation effect, and (3) the substitution effect. This study is a good example of how computational chemistry is an effective tool for revealing the degradation process of plant alkaloids and for examining their instability.

# Biosynthetic Dipolar Cycloaddition: Daphniphyllum Alkaloids, Flueggine A, and Virosaines

Cycloaddition is one of the most important reactions for skeletal formation in both total synthesis and biosynthesis of natural products. There have been many theoretical studies on Diels-Alderase which provide insightful mechanistic approaches that cannot be easily addressed by experimental approaches. These reports also show that the inherent reactivity of a molecule is relevant to many cyclization reactions promoted by enzymes. Here, we show two theoretical studies of plant alkaloids from *Daphniphyllum* (Tantillo, 2016) and *Flueggea* (Painter et al., 2013).

The development of cascade polycyclizations by Heathcock and co-workers, to construct *Daphniphyllum* alkaloids, is a milestone in biomimetic total synthesis, in which bicyclic intermediate A is thought to be converted into tetracyclic intermediate D *via* intermediates B and C (Heathcock et al., 1992; Heathcock, 1996). Based on their calculations, A is converted to B with an activation energy of only 7.1 kcal/mol *via* concerted [4 + 2] cycloaddition (aza-Diels-Alder reaction). This result represents a potential biological Diels-Alder reaction for which enzymatic barrier lowering would not be required. It also demonstrates that enzymatic preorganization of the substrate is not required for a successful reaction, presumably because most conformers are unreactive. Interestingly, B is directly converted to D *via* ene reaction, in which C is not a minima but a transient structure (**Figure 11**).

While there have been many studies on [4 + 2] cycloaddition reactions, other types of cycloadditions received less attention. Flueggine A and virosaine alkaloids, isolated from *Flueggea virosa*, are thought to be synthesized from a common precursor that could be derived from norsecurinine or tyrosine (**Figure 12**). Interestingly, this proposed biosynthetic pathway involves a nitrone-alkene (3 + 2) cycloaddition reaction.

In the referenced study, they examined all possible eight stereoisomers for flueggine A formation. The predicted energy barrier for the cycloaddition in water *via* this transition state

structure is 10.3 kcal/mol, which is lowest among all the transition state structures for this reaction, implying that selectivity control by the enzyme is not required. Based on the distortion interaction model, this transition state has both considerably small distortion energy and considerably large interaction energy. They also examined virosaine formation, and the activation energy was only ~1 kcal/mol, indicating that enzymatic assistance is unnecessary.

It is always difficult to ascertain if the Diels-Alder reaction involved in the proposed biosynthetic pathway is enzymatic or not. However, computational chemistry provides us with the energy landscape of the reaction, which can assist in narrowing down the candidate biosynthetic genes in plant science.

#### Elucidation of ∆1 -Piperideine Dimer Formation in Quinolizidine Alkaloid Biosynthesis

Unlike for fungi and bacteria, revealing whole biosynthetic genes relevant to a plant secondary metabolite is still quite challenging. Quinolizidine alkaloids (QAs), a subclass of Lys-derived plant alkaloids widely distributed in *Leguminosae*, have been intensively studied for over half a century. However, most of their biosynthetic genes, enzymes, and intermediates remain unknown. Particularly, in the quinolizidine skeletal formation process, which is thought to be common in all QA biosynthesis, only two enzymes, L-lysine decarboxylase (LDC) (Bunsupa et al., 2012) and copper amine oxidase (CAO) (Yang et al., 2017), have been identified to date (**Figure 13A**).

Therefore, there is no valid information on how many genes, enzymes, and reactions are required for production of QAs after ∆1 -piperideine formation. In a plausible biosynthetic pathway of QAs, 5-aminopentanal is spontaneously converted to ∆1 -piperideine (**Figure 13A**). The dimerization of ∆1 -piperideine has been proposed; however, it is unclear whether or not it requires enzymatic assistance. Therefore, Sato et al. carried out theoretical investigations to uncover the biosynthetic mechanism of ∆1 -piperideine dimerization (Sato et al., 2018c). There are four possible stereoisomers for the piperideine dimer, i.e., *(R,R)*, *(S,S)*, *(R,S)*, and *(S,R)* (**Figure 13B**). In the mentioned study, they constructed a model that has two molecules of ∆1 -piperideine and two to four molecules of water for each isomer.

The results of the DFT calculations indicate that piperideine dimerization spontaneously proceeds under neutral conditions and yields only (R,R) or (S,S) dimers. (R,S) and (S,R) dimer formations require considerably higher energy barriers; therefore, they cannot be formed under neutral conditions. Based on the previous literature, tetrahydroanabasin, which could be formed from the (R,R) piperideine dimer through isomerization, was isolated from a QA producing plant. Therefore, (R,R) piperideine formation could be favored in a QA producing plant. This stereochemistry is also consistent with the other QA derivatives. For example, (−)-lupinine and (+)-epilupinine could be synthesized from (R,R)-piperideine and (R,S)-piperideine dimers, respectively. However, we cannot rule out the possibility that the enzyme assists this dimerization reaction since the (S,S)-piperideine dimer has not yet been isolated. Moreover, nature occasionally uses an enzyme for the reaction that can spontaneously proceed under aqueous conditions (Chen et al., 2018).

In the study of QA biosynthesis, differential expression analysis was performed to search for LDC genes, which also provided dozens of other candidate genes. The combination of computational chemistry and expression-based approach could be a significantly powerful tool for narrowing down the candidate genes.

# OTHER SECONDARY METABOLITES

Unlike the case with other natural product groups, there are only a few mechanistic investigations on flavonoid and lignin biosynthesis, although there are many studies on spectrum prediction regarding flavonoid. Here, we introduce one study about anthocyanin biosynthesis and three studies about lignin biosynthesis.

# A Long-Standing Issue in Anthocyanin Biosynthesis Was Solved Using Density Functional Theory Calculations

The biosynthetic pathways of flavonoids have been intensively studied, and related genes, enzymes, and intermediates have been characterized. In the late stage of the biosynthesis of anthocyanin, dihydroflavonoid is converted to anthocyanidin, which is catalyzed by two enzymes, dihydroflavonol reductase (DFR) (Gong et al., 1997) and anthocyanidin synthase (ANS) (**Figure 14**, Route A; Saito et al., 1999; Nakajima et al., 2001; Turnbull et al., 2004). Interestingly, reduction proceeds just after the oxidation in this conversion, which appears to be a detour pathway and energy consuming, because the oxidation state does not change after these two reactions. An alternative non-enzymatic pathway, that is a simple spontaneous tautomerization, was proposed (**Figure 14**, route B). However, this spontaneous pathway does not proceed inside the plant because the suppression of genes leads to a decrease in the amount of anthocyanidin. Therefore, the question about this biosynthetic pathway is why this tautomerization pathway is not feasible. To answer this question, Sato et al. carried out DFT calculations (Sato et al., 2018d).

Based on their calculations, the first tautomerization *via* the non-enzymatic pathway requires ca. 30 kcal/mol, which is too high to be achieved under ambient conditions. This is due to the instability of the transition state structure that has a highly electron rich en-diol moiety adjacent to the electronrich aromatic ring. Interestingly, dihydroflavonol without hydroxyl

tautomerization reaction.

groups at the 5′ and 7′ positions require lower activation energies than the original one does, indicating that the hydroxyl group is the natural controller of the yield of anthocyanidin. Moreover, the first tautomerization requires less activation energy under the acidic conditions, suggesting that this tautomerization requires enzymatic support. They searched for enzymes that could catalyze this type of tautomerization using the KEGG database, but no enzyme was found.

Furthermore, they investigated the later part of the anthocyanin biosynthetic pathway. However, interestingly, their DFT calculation suggests that 2-flaven-3,4-diol is directly converted to anthocyanidin under acidic conditions, which was originally thought to be converted *via* 3-flaven-2,3-diol formation (**Figure 14**). This new finding is consistent with the experimental result in the previous literature, in which hydrochloride was used to terminate the ANS reaction, and 3-flaven-2,3-diol was not detected.

As we described here, computational chemistry can simulate a reaction that cannot actually occur and can also easily delete the functional groups to assess their effect on the reaction, which is quite difficult in organic synthesis. This study emphasizes the usefulness of the computational approach to test the hypothetical pathways.

# Lignin

Unlike the other types of natural products that are mentioned above, only a few examples were reported for the mechanistic investigation of lignin biosynthesis. The theoretical investigation of lignin biosynthesis is challenging for some reasons, i.e., radical reaction, branched polymer structure, various kinds of coupling products, etc. To reveal the mechanism of lignin biosynthesis, a consecutive theoretical study has been reported by Sangha et al. (2012, 2014, 2016).

The initial step of lignin biosynthesis is initiated by the radical coupling reaction of monolignols. Besides the reactive oxygen, the monolignol derived radicals are reactive at the C1, C3, C5, and β positions due to the delocalization *via* the conjugated π system. Conventionally, lignin polymerization, catalyzed by peroxidases and laccases, takes place in three steps: (1) monolignol binding to the enzyme active site, (2) H2O2-mediated oxidation at the active site to form radicals, and (3) radical coupling reaction to form lignin polymers.

To reveal the inherent reactivity of monolignol, Sangha et al. initially carried out DFT calculation for the six possible radical coupling products, as shown in **Figure 15A**. The result indicates that the formation of β-O4, β-β, and β-5 type monolignol radical couplings is enthalpically favored over those of others, which is consistent with the experimental data suggesting that these three bonds are most common in natural lignin.

Next, they focused on horseradish peroxidase C (HRPC), which triggers the radical polymerization cascade in lignin biosynthesis. Lignin's subunits can be classified into three groups; guaiacyl (G), syringyl (S), and *p*-hydroxyphenyl (H) (**Figure 15B**). The ratio of the subunits is relevant to the efficiency of the deconstruction of biomass. Therefore, they examined the binding affinity of three monolignols, including *p*-coumaryl, coniferyl, and sinapyl, toward the horseradish peroxidase C (HRPC). To answer this question, they carried out MD simulations. The results indicated that the binding affinity of the monolignols toward HRPC decreases in the order of *p*-coumaryl > sinapyl > coniferyl alcohol.

Since lignin biosynthesis is a combination of simple radical coupling reactions, many possible reaction pathways can be proposed. As we have shown in this section, computational chemistry can reveal the inherent reactivity and can provide

a rational explanation for substrate specificity, which is helpful for narrowing down the candidate structures. We believe that computational chemistry can eliminate the hindrances to precise mechanistic investigations of radical coupling reactions in lignin biosynthesis.

# SUMMARY AND PERSPECTIVES

As we have shown in this review, computational chemistry can be a powerful tool for revealing the biosynthesis of secondary metabolites in plants. Unlike cheminformatics, computational chemistry provides not only reasonable reaction pathways and energy barriers but also many new insights, as described above. For example, ring cyclization order in astellatene biosynthesis, key catalytic residues of biosynthetic enzymes, the controlling mechanism of regio- and stereoselectivity in quiannulatene biosynthesis, docking mode in the active site of terpene synthase, detailed reaction mechanisms of piperideine dimerization, etc., which cannot be achieved solely by traditional experimental methods.

Although computational approaches using QM calculation, MD simulation, and QM/MM are well established in terpeneforming reactions and cycloadditions, only a few preliminary studies on other types of plant secondary metabolism have been reported until now. Many studies on other types of natural products are required to sophisticate this powerful computational approach. Particularly, more reports on the oxidation reactions, mainly catalyzed by P450, FMO, or iron-dependent enzymes (Nakashima et al., 2018), are desired because oxidation is one of the key reactions facilitating the structural diversity and complexity of plant secondary metabolites.

As was mentioned in Section "Introduction," the next goal of these applicable studies using computational chemistry in natural product biosynthesis is to engineer or design novel biosynthetic pathways and enzymes to obtain desired products. One approach to achieve this objective is swapping the biosynthetic enzymes (genes) with other genes that can accept the biosynthetic intermediate, which could produce novel natural products. Another approach could be to design novel enzymes by computational chemistry (Kiss et al., 2013). The methodology of *de novo* enzyme design was reported by a decade ago, although it has not yet been applied to plant

# REFERENCES


secondary metabolism engineering. They published three remarkable examples of artificially designed enzymes: retroaldolase (Jiang et al., 2008), Kemp elimination enzyme (Röthlisberger et al., 2008), and Diels-Alderase (Siegel et al., 2010). In their method, they initially designed the catalytic cycle. Subsequently, they carried out DFT (theozyme) calculations to obtain the transition state structures. The next step involved searching the template protein, which has enough space in its cavity to accommodate the transition state structure. Finally, the enzyme was designed using Rosetta, which is also used in the terpene docking simulation, as described above. They successfully designed a desired artificial enzyme that was nonexistent. Moreover, they successfully improved the catalytic ability of those enzymes using directed evolution for several generations. We believe this kind of approach is very useful for the efficient production of novel bioactive plant secondary metabolites. We think terpene synthase is a good candidate for designing the artificial enzyme, since (1) the computational approach is well developed, (2) many transition-state structures are already available, (3) substrate is commercially available, and (4) most of the terpenes have unique biological activities. We hope that more biosynthetic studies utilize a computational chemistry-based approach for mechanistic investigations.

# AUTHOR CONTRIBUTIONS

HS wrote entire manuscript under the supervision of KS and MY.

# FUNDING

This work is supported by the JSPS Grant-in-Aid for Scientific Research on Innovative Areas JP16H06454 (MY).

# ACKNOWLEDGMENTS

We gratefully acknowledge the Strategic Priority Research Promotion Program of the Chiba University. We would also like to thank *Editage* (www.editage.jp) for English language editing.


and expressed in a forma-specific manner in *Perilla frutescens*. *Plant Mol. Biol.* 35, 915–927. doi: 10.1023/A:1005959203396


rearrangement in anditomin biosynthesis. *J. Am. Chem. Soc.* 140, 9743–9750. doi: 10.1021/jacs.8b06084


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Sato, Saito and Yamazaki. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Large-Scale Profiling of Saponins in Different Ecotypes of Medicago truncatula

Zhentian Lei1,2 \*, Bonnie S. Watson<sup>3</sup> , David Huhman<sup>3</sup> , Dong Sik Yang3,4 and Lloyd W. Sumner1,2 \*

<sup>1</sup> University of Missouri Metabolomics Center, Columbia, MO, United States, <sup>2</sup> Department of Biochemistry, University of Missouri, Columbia, MO, United States, <sup>3</sup> Noble Research Institute, Ardmore, OK, United States, <sup>4</sup> Biomaterials Laboratory, Material Research Center, Samsung Advanced Institute of Technology, Gyeonggi-do, South Korea

#### Edited by:

Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan

#### Reviewed by:

Jacob Pollier, Flanders Institute for Biotechnology, Belgium Ery Odette Fukushima, Regional College Amazon Ikiam, Ecuador Cristian Daniel Quiroz Moreno, Regional College Amazon Ikiam, Ecuador, in collaboration with reviewer EF

> \*Correspondence: Zhentian Lei

leiz@missouri.edu Lloyd W. Sumner sumnerlw@missouri.edu

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 16 April 2019 Accepted: 13 June 2019 Published: 03 July 2019

#### Citation:

Lei Z, Watson BS, Huhman D, Yang DS and Sumner LW (2019) Large-Scale Profiling of Saponins in Different Ecotypes of Medicago truncatula. Front. Plant Sci. 10:850. doi: 10.3389/fpls.2019.00850 A total of 1,622 samples representing 201 Medicago truncatula ecotypes were analyzed using ultrahigh pressure liquid chromatography coupled to mass spectrometry (UHPLC-MS) to ascertain saponin profiles in different M. truncatula ecotypes and to provide data for a genome-wide association study and subsequent line selection for saponin biosynthesis. These ecotypes originated from 14 different Mediterranean countries, i.e., Algeria, Cyprus, France, Greece, Israel, Italy, Jordan, Libya, Morocco, Portugal, Spain, Syria, Tunisia, and Turkey. The results revealed significant differences in the saponin content among the ecotypes. European ecotypes generally contained higher saponin content than African ecotypes (p < 0.0001). This suggests that M. truncatula ecotypes modulate their secondary metabolism to adapt to their environments. Significant differences in saponin accumulation were also observed between the aerial and the root tissues of the same ecotypes (p < 0.0001). While some saponins were found to be present in both the aerial and root tissues, zanhic acid glycosides were found predominantly in the aerial tissues. Bayogenin and hederagenin glycosides were found mostly in roots. The differential spatially resolved accumulation of saponins suggests that saponins in the aerial and root tissues play different roles in plant fitness. Aerial saponins such as zanhic glycosides may act as animal feeding deterrent and root saponins may protect against soil microbes.

#### Keywords: Medicago truncatula, ecotypes, triterpene saponin, metabolomics, LC-MS/MS

# INTRODUCTION

Legumes are economically important and widely cultivated crops. They contain relatively high protein content and are important sources of protein for both humans and animals. Their high protein content may be attributed to their unique symbiotic relationship with nitrogen-fixing bacteria. Legumes also produce a vast array of natural products including flavonoids, isoflavonoids, anthocyanins, condensed tannins, lignin, and saponins (Dixon and Sumner, 2003). These natural products play important roles in many important biological processes and are important to legume quality. For example, flavonoids serve as signaling compounds in the symbiotic plantmicrobe interactions and induce the expressions of Nod genes in the nitrogen-fixing bacteria. Condensed tannins can prevent bloat associated with animals grazing on legumes with high protein

content such as alfalfa and clover. Saponins are triterpene glycosides composed of tritepenoid aglycones (normally referred to as sapogenins) conjugated with various carbohydrate residues. They have been documented to possess anti-fungal, anti-bacterial and anti-insect properties and contribute to plant development and defense against pathogens (Moses et al., 2014). However, due to their hemolytic activity and membrane permeabilization nature, saponins are considered as anti-nutritional. They have been reported to cause bloat, reduce digestibility of proteins, interfere with uptake of nutrients in the gut, and result in reduced weight gain (Francis et al., 2002). These undesired antinutritional effects of saponins negatively affect the efficient use of high-protein containing legume forages such as alfalfa and clover as animal feeds. Manipulation of the saponin contents in legumes through genetic engineering and/or molecular breeding may provide an efficient way to improve the nutritional values or field performance of legume forages. However, this effort is hindered by our limited understanding of triterpene saponin biosynthesis. In addition, the effects of growth conditions and developmental stages on levels of individual saponins in legumes are still not clear. Information about saponin variation among the many different ecotypes is also lacking. This knowledge is particularly useful in breeding of low saponin containing legume forages.

LC-MS based metabolomics is ideally suited for the analysis of saponins in complex plant extracts. It has been successfully used in the analyses of saponins in many legumes including Medicago truncatula (Huhman et al., 2005; Kapusta et al., 2005a,b), M. arborea (Tava et al., 2005), alfalfa (Sen et al., 1998; Bialy et al., 1999), clover (Perez et al., 2013), and soybean. Using LC-Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS), Pollier et al. (2011) revealed a complex mixture of saponins in the hairy roots of M. truncatula. More recently, saponins in 12 annual Medicago species have been profiled and compared (Tava and Pecetti, 2012). The content of saponins were found to range from 0.38 to 1.35% (dry weight), depending on the species. In addition, differences in the aglycone moieties were observed among the 12 Medicago species. While some aglycones such as bayogenin and hederagenin were found in all the species, some including medicagenic acid and zanhic acid were speciesdependent (Tava and Pecetti, 2012). The large number of MS/MS and NMR data collectively generated over the past years by a number of different groups constitutes an important and valuable resource for saponin annotation in M. truncatula (Bialy et al., 1999; Kapusta et al., 2005a,b; Tava et al., 2005; Pollier et al., 2011).

Saponins are broadly classified into hemolytic (oleanates) and non-hemolytic (soyasapogenol) (**Figure 1**). It is generally believed that the hemolytic activity of olenate saponins is conferred by the presence of a C28-carboxylic group (Voutquenne et al., 2002). Glycosylation of the C28-carboxylic group dramatically reduces and even eliminates the hemolytic activity. Our understanding of triterpene saponin biosynthesis in M. truncatula is still very limited. All the saponins are believed to derive from beta-amyrin that is formed through the cyclization of 2,3-oxidosqualene catalyzed by beta-amyrin synthase (**Figure 1**) (Tava et al., 2011). Hydroxylation and subsequent oxidation of beta-amyrin lead to multiple pentacyclic triterpene aglycones (or sapogenins), glycosylation of which results in diverse and complex saponins. The hydroxylation and oxidation of aglycones are believed to be catalyzed by cytochrome P450 proteins (Tava et al., 2011). In M. truncatula, a wellestablished model legume and close relative to alfalfa (Medicago sativa), CYP716A12 has been identified to oxidize beta-amyrin to erythrodiol and then subsequently to oleanolic acid (Carelli et al., 2011; Fukushima et al., 2013). Mutants in CYP716A12, lha (lacking hemolytic activity), were found to lack hemolytic saponins and only produce non-hemolytic soyasaponins (Carelli et al., 2011). Oxidation of oleanolic acid to hederagenin, gypsogenin, and gypsogenic acid was found to be catalyzed by CYP72A68 (Tzin et al., 2019). Formation of non-hemolytic triterpene aglycones 24-hydroxy-beta-amyrin and soyasapogenol B from beta-amyrin is catalyzed subsequently by CYP93E2 and CYP72A61v2 (Fukushima et al., 2013). A cytochrome P450 (CYP72A67) involved in hemolytic sapogenin biosynthesis was also identified. It was found to be responsible for hydroxylation at the C-2 position of oleanolic acid for downstream sapogenin biosynthesis (Biazzi et al., 2015). However, enzymes involved in the formation of other sapogenins are still unknown. In addition, the effects of the environment on the production of saponins in M. truncatula is not clear. To increase our understanding of saponin accumulation in different ecotypes and provide a basis for selecting appropriate M. truncatula lines for future correlated gene expression analyses and for the discovery of genes involved in triterpene aglycone biosynthesis, we profiled saponins in 1,622 samples representing 201 M. truncatula ecotypes from 14 different countries using UHPLC-MS (**Table 1**). Differential distributions of saponins in roots and aerial tissues were observed. Zanhic acid saponins were found predominantly in leaves, whereas hederagenin and bayogenin saponins were mostly found in roots. The differential spatial accumulation suggests that different classes of saponins may play different roles in plant defense responses. In addition, different ecotypes were found to accumulate different amounts of saponins, with highest saponin containing ecotypes found mostly in Europe and lowest saponin containing ecotypes were from Africa.

# MATERIALS AND METHODS

# Biological Materials

Seeds were scarified with sulfuric acid, sterilized with 5% bleach and germinated on damp sterile filter paper. Three days after germination seedlings were transplanted into root cones (Stuewe & Sons Inc.) filled with Turface (BWI Texarkana) which had been rinsed in distilled water and autoclaved. The ecotypes were grown in a growth room under controlled conditions. Day length was 16 h, with a gradual increase in light. The light source included both fluorescent and incandescent bulbs, and light intensity averaged ∼225 µJ. Temperature was set at a constant 21◦C. Plants were watered with Broughton and Dilworth (100 ppm N) fertilizer every day. Enough seedlings were planted to analyze four replicates of each ecotype. In most cases, this was accomplished but a few ecotypes grew poorly and three or fewer replicates were harvested. The large number of plants necessitated growth in three separate groups, and A17 and R108

TABLE 1 | Geographical origin of the 201 M. truncatula inbred lines used for saponin profiling.


<sup>a</sup>13 out of the 28 lines were from Corsica. <sup>b</sup>2 out of the 16 lines were from Crete. <sup>c</sup>1 out of the 11 lines was from Madeira.

were grown with each group as controls. Plantings were staggered so that all plants could be harvested during the same morning time frame (3–5 h after full light) to insure a uniform position in the diurnal photosynthetic cycle. Plants were harvested at 5 weeks of age before the onset of flowering. Roots and aerial tissues were separated and immediately frozen in liquid nitrogen.

The tissues were lyophilized and dry weights were recorded. All tissues were ground to a fine powder and 10 mg each was weighed for extraction. If plant material was limited, extraction volumes were reduced proportionally. Tissues were incubated with 80:20 methanol:water solution containing 18 µg/ml umbelliferone.

# UHPLC-QTOF-MS

Roots and aerial tissues were lyophilized until dry and ground to a fine powder. Ten milligrams (10 ± 0.06 mg) of powder for all tissues were accurately weighed and extracted on an orbital shaker for 2 h with 1 mL of 80% of methanol containing 18 µg/mL umbelliferone as an internal standard. Samples were centrifuged at 2,900 × g for 30 min at 4◦C, and the supernatants were collected. Five microliters of the supernatant were injected into a Waters UPLC system coupled to a quadrupole-time of flight mass spectrometer (QTOF-MS, Waters QTOF Premier). Chromatographic separations were performed on a Waters reverse phase column (2.1 × 150 mm, BEH C18, 1.7 µm particles) using the following gradient: mobile phase B (acetonitrile) increased from 5 to 70% over 30 min, then to 95% in 3 min, held at 95% for 3 min, and returned to 95% mobile phase A (0.05% formic acid in water) for equilibration for 3 min. The flow rate of the mobile phases was 0.56 mL/min, and the column and autosampler temperatures were maintained at 60 and 4◦C, respectively. Mass spectral data were acquired from m/z 50 to 2,000 in the negative electrospray ionization mode, with the nebulization gas set at 850 L/h (350◦C) and the cone gas at 50 L/h (120◦C). Raffinose (m/z 503.1612) was used as the reference compound in the independent lock-mass mode, with the lock mass scan (1 s) collected every 10 s for accurate mass measurements. The concentration of raffinose was 50 fmol/mL, and the flow rate 0.2 mL/h.

The raw data files obtained from UHPLC-QTOF-MS analyses were processed with MarkerLynx software (version 4.1, Waters) for mass features extraction and alignment with the following parameters: minimum peak intensity: 500 counts, mass tolerance: 0.05 Da, and retention time window 0.2 min. The peak areas were normalized by dividing each peak area by the value of the internal standard peak (area of metabolite/area of area of internal standard × 1,000). Annotations of metabolites were performed by matching their m/z to those of the previously observed saponins in M. truncatula (Huhman et al., 2005; Kapusta et al., 2005a,b; Pollier et al., 2011). Retention time was also used in the Rt-m/z pair matching when it was available. Tandem MS was performed on a number of saponins, mainly the most abundant ones, to validate the annotations by matching to previously reported MS/MS data. The MS/MS experiments were performed using a UHPLC-Bruker QTOF MS. The tandem data were compared to previously published data for identification confirmation (Kapusta et al., 2005a; Pollier et al., 2011). The normalized data (i.e., data normalized to the internal standard) were used for statistical analyses. Multivariate statistical analyses were performed using JMP software from SAS (Cary, NC). Tukey HSD (Tukey Honest Significant Differences) was performed using TukeyHSD function in r, and the not significantly different groups were labeled with the same letters using the HSD.test function in the "agricolae" package.

# RESULTS AND DISCUSSION

The M. truncatula ecotypes (201 lines, 1,622 plants) analyzed in this work represented 14 geographical origins, i.e., Algeria, Cyprus, France, Greece, Israel, Italy, Jordan, Libya, Morocco, Portugal, Spain, Syria, Tunisia, and Turkey (**Table 1**). The raw data are freely available for download at https://sumnerlab.missouri.edu/download/. Aerial and root tissues were separated, lyophilized and weighed. The dry weight for each ecotype's aerial and root tissues is shown in **Figure 2**. Almost all the ecotypes (93.3%) were found to produce more aerial tissues than roots by weight. Significant difference in the average total dry weight was observed among the ecotypes (p < 0.0001), with a 90 fold difference between the lowest line (HM095, France, 6.42 ± 9.43 mg, mean ± standard deviation, n = 5) and the highest line (HM174, Spain, 548.98 ± 68.25 mg, mean ± standard deviation, n = 5). It was also found that there was a significant positive correlation between the aerial and the root dry weight (r = 0.86, p < 0.0001), indicating that ecotypes producing more aerial tissues also tended to produce more root tissues. The average total dry weight for ecotypes of the same country of origin is shown in **Table 2**. Most of the North African

and Middle East ecotypes appeared to produce less biomass compared to the European ecotypes, but the difference was not statistically significant (p = 0.3). This is due to large variations of dry weight among the ecotypes from the same country as evidenced by the large standard deviation associated with each average dry weight (**Table 2**). Thus, the dry weight of ecotypes, when grown in greenhouse, appeared to be ecotype specific and did not show a statistically significant correlation between their dry weight and their geographic origin. This is probably due to the different growth rates of the different ecotypes. For example, the two well characterized and mostly used ecotypes, A17 and R108, have previously been found to have different seed-to-seed generation times (Hoffmann et al., 1997). The generation time of R108 was 12–14 weeks, which is about 3 weeks shorter than A17. Ecotypes with even shorter seed generation times (e.g., 7 weeks shorter than A17) was also found (Hoffmann et al., 1997). As the ecotypes were harvested at the same time and not at the same developmental stage, the different generation time among the ecotypes may contribute to the high variance observed (**Table 2**). Although it was desirable to harvest all the ecotypes at the same developmental stage, it was simply not feasible given the scale of this experiment.

Saponin profiling of the aerial and root tissues was performed in a random manner and spanned over a period of 5 months (November 2012–March 2013). The data are shown in **Supplementary Tables 1,2**. Reproducibility is always a concern when large-scale experiments of this magnitude are performed. To monitor the reproducibility, a blank wash and a quality control (QC) mixture were performed every 10 samples to monitor any carry-over or changes in instrumental response. An internal standard solution (extraction solution only) was also analyzed every 20 samples. Eighty-seven injections of the internal standard solution were made during the 5-months of analyses and the responses were used to calculate the relative standard deviation (RSD) to quantify reproducibility. The RSD was determined to be 15.3%, comparable to a previously reported value (15.9%) in a metabolomics project (Kirwan et al., 2013) and below that (20%) recommended for large-scale metabolomics (Food and Drug Administration [FDA], 2001; Zelena et al., 2009). Annotation of saponins was performed by matching the mass features' m/z to those of saponins found in M. truncatula and then confirmed by MS/MS (**Figure 3**). **Figure 3** shows a representative UHPLC-MS chromatogram of the aerial and root tissues of HapMap 135 (line number: L000332, country of origin: Israel). Significant differences were observed between the metabolic profiles of the aerial and root tissues (**Figures 3A,B**). For example, a peak (m/z 973.5069\_Rt12.48 min) found predominantly in the root tissues was annotated as glucoseglucose-glucose-bayogenin and confirmed by tandem MS (**Figures 3C,D**). The relative abundances of saponins (area of metabolite peak normalized to that of the internal standard in the sample) were used for statistical analysis such as principal component analysis (PCA) and multiple mean comparisons. The results showed that there was significant difference in the saponin content among different ecotypes (p < 0.0001) as well as between the aerial and the root tissues of the same ecotypes (p < 0.0001). These data indicated that the geographical origin has an impact on the production of saponin in plants and that saponins differentially accumulated in roots and aerial tissues. Their differences and biological implications are discussed below.

TABLE 2 | Average total dry weights and their standard deviations of ecotypes.


The average total dry weight was calculated by averaging the average weights of all individual M. truncatula lines from the same country of origin.

#### M. truncatula. Bayo: bayogenin; Hex: heoxse.

# Saponin Content Among Ecotypes

Principal component analysis was performed for all samples and means for the total, aerial and root saponin content in ecotypes of the same country of origin were calculated (**Figure 4**). **Figure 4** shows the PCA results (**Figure 4A**), mean of total saponins (**Figure 4B**), mean of aerial saponins (**Figure 4C**) and mean of root saponins (**Figure 4D**) in ecotypes from the same country. There was a significant difference in the total saponin content among the ecotypes of different countries (p < 0.0001). The lowest saponin-containing ecotypes were found to originate from Tunisia and contained only 60% of the amount of saponins in the highest saponin-containing ecotypes (Portugal). **Figure 4B** also shows that the low saponincontaining ecotypes were mostly from Africa except those originating from Turkey. These ecotypes (Algeria, Libya, Tunisia, and Turkey) contained comparable amounts of total saponins (p = 0.4). This may be explained by the similar environments such as similar climates, seasonal changes and soil conditions in these countries as most of them are in the African continent. In contrast, ecotypes originating from Israel and European countries such as France, Italy, Portugal, and Spain typically contained higher amount of saponins. The two highest saponincontaining ecotypes were from Portugal and Israel. This distinct difference between the African and the European ecotypes indicated a clear geographic segregation of M. truncatula around the Mediterranean area in terms of saponin production, with the ecotypes in the north region (Europe) segregated from those in the south region (Africa). Similar segregation has also been observed in a previous microsatellite diversity study of 346 inbred lines of M. truncatula ecotypes that revealed a stratification of the M. truncatula population between the North and the South of the Mediterranean basin (Ronfort et al., 2006). It was further suggested that the M. truncatula colonization of the Mediterranean region was via two routes from its original

habitat around the Middle East (Ronfort et al., 2006). However, the molecular mechanism responsible for the difference, i.e., higher saponin contents in European ecotypes and lower saponin contents in the African ecotypes, is not clear. Plant secondary metabolism is complex and influenced by both biotic and abiotic stimuli. Plants under different environments can modulate their secondary metabolism to increase their fitness and adapt to the environments. The effect of environments on plant saponin biosynthesis has been recently reviewed (Szakiel et al., 2011). Both biotic and abiotic factors (e.g., temperature, light, humidity, water availability, soil fertility, insects, herbivores, competition from neighboring plants) and their interactions all can affect saponin biosynthesis. For example, saponin content was found higher in aphid-infested alfalfa compared to the uninfested alfalfa (Goławska et al., 2012). A study of herbivore-induced responses in alfalfa further indicated that saponin content increased with higher herbivore densities (Agrell et al., 2003). Methyl jasmonate treatment that mimics mechanical wounding of plants was also found to increase the production of saponins in M. truncatula (Suzuki et al., 2002), suggesting that grazing also results in higher saponin content in forage legumes. The environmental abiotic factors also affect saponin content significantly. It has been shown that saponin content decreased significantly in medicinal plants exposed to drought stress (Solíz-Guerrero et al., 2002; Zhu et al., 2009) and the application of an appropriate amount of inorganic fertilizer was able to partly restore saponin content in Bupleurum (Zhu et al., 2009). In Brachiaria, the main forage for ruminants cultivated worldwide in both tropical and subtropical climates, the saponin content was found to correlate negatively with the duration of sunshine and maximum ambient temperature, but positively with relative humidity (Lima et al., 2012). This suggests that the combination of high temperature, long duration of sunshine and arid condition in Africa might, at least in part, be responsible for the lower saponin content in the African ecotypes. A recent study using over 20,000 annotated genes from M. truncatula showed that genes involved in defense against pathogens and herbivores constituted the single largest functional class of genes under positive selection in adaptive evolution (Paape et al., 2013). Other genes under positive selection included those involved in mediating symbiotic relationship with rhizobia and one-third of the annotated histone-lysine methyltransferases that could be involved in epigenetic modifications (Paape et al., 2013). The relatively higher saponin content in European ecotypes may reflect such a positive selection of genes related to disease and defense response as saponins are anti-feeding and anti-microbial (Da Silva et al., 2012; Goławska et al., 2012). Significant variations in the response of M. truncatula ecotypes to Verticillium albo-atrum, a soil-borne

pathogenic fungus, have been reported recently (Ben et al., 2013). Comparison of the resistant and susceptible M. truncatula ecotypes led to the identification of three QTLs associated with resistance to the Verticillium wilt in the resistant lines. The resistance appeared to be selected within environments as it did not seem to correlate with the population structure (Ben et al., 2013).

In addition to the total saponin content, the aerial and root tissues of the African ecotypes were also found to contain significantly less saponins than the European ecotypes (p < 0.0001) (**Figures 4C,D**). In the aerial tissues, the lowest saponin containing ecotypes were found to be from Algeria. They contained only 60% of the saponins found in the highest saponin-containing ecotypes (Israel). In roots, Tunisia's ecotypes contained the least amount of saponin, only 55% of the saponin in the highest saponin-containing ecotypes (Portugal). **Figure 4** also reveals that roots contained higher saponin content than the aerial parts. For example, compared to their aerial parts, the roots of Portugal's ecotypes contained about twice the amount of saponins (**Figures 4C,D**). The differences between the aerial and the root tissues were not only quantitative but also qualitative as evidenced by the distinct saponin profiles in the aerial parts (**Figure 3A**) and the roots (**Figure 3B**). This was further supported by PCA analysis that showed the aerial tissues were clearly segregated from the root tissues (**Figure 4A**), suggesting that the difference between the aerial and the root tissues would be greater than the difference among ecotypes. The differential accumulation of saponins in the aerial and the root tissues may reflect the different roles of these saponins in plant fitness and defense response as discussed below.

# Differential Accumulation of Saponins in the Aerial and Root Tissues

The results of the saponin profiling in the aerial and the root tissues are shown in **Figure 5**. **Figure 5** shows the accumulation of individual classes of saponins (i.e., bayogenin glycosides, hederagenin glycosides, medicagenic acid glycosides, soyasaponin B glycosides, soyasaponin E glycosides, and zanhic acid glycosides) in the aerial (Y-axis) and the root (X-axis) tissues of the ecotypes. It reveals clear differences in the spatial accumulation of saponins. Regardless of the country of origin, bayogenin glycosides, hederagenin glycosides and soyaspogenin E glycosides were mostly found in the roots, while zanhic acid glycosides were only detected in the aerial tissues. In contrast, medicagenic acid glycosides and soyasaponin B glycosides were found in both the aerial and root tissues although the aerial tissues appeared to contain more medicagenic acid glycosides. The finding is similar to a previous report that zanhic acid glycosides, medicagenic acid glycosides and soyasaponin B glycosides were found to be the three dominant groups of saponins in M. truncatula foliar tissues (Kapusta et al., 2005b; Confalonieri et al., 2009). It is also consistent with several previous reports that zanhic acid was the major aglycone in the aerial parts of M. truncatula (Kapusta et al., 2005a,b) and zanhic acid glycosides could not be detected in the roots of M. truncatula (Confalonieri et al., 2009). A more recent study of M. truncatula hairy roots showed that the zanhic acid glycosides were also absent in the hairy roots (Pollier et al., 2011). Zanhic acid glycosides are therefore leaf-specific saponins. This suggests that P450 enzyme(s) involved in the biosynthesis of zanhic acid can best be studied using leaf tissues. Comparisons of P450 gene expression profiles between the aerial and root tissues may facilitate the identification of enzymes converting medicagenic acid into zanhic acid. Zanhic acid formation is the last step of the sapogenol biosynthesis pathway (**Figure 1**). The accumulation of zanhic acid glycosides and the lack of other sapoinins in the earlier steps of the pathway in the aerial tissues suggest that the physiological role of zanhic acid glycosides cannot be substituted for by other saponins such as hederagenin and bayogenin glycosides. The differential accumulation of saponins in the aerial and the root tissues may reflect the effect of the different environments on the aerial parts and the roots. Compared to roots, one of the unique stresses that the aerial parts face is herbivore feeding and wounding. This suggests that zanhic acid glycosides may be potent anti-feeding metabolites. Indeed, some zanhic acid glycosides have been found to be the most bitter and throat-irritating components among the complex sapoinins in alfalfa (Oleszek et al., 1992) and the most active compounds in disrupting the transmural potential difference in mammalian small intestine (Oleszek et al., 1994). Therefore, zanhic acid glycosides are generally considered as anti-nutritional and the major anti-feeding agents against herbivores in leaves. Similar to alfalfa, medicagenic acid glycosides were also found in the aerial parts of M. truncatula ecotypes and their amounts were higher in the aerial parts than in the roots (**Figure 5**). Unlike zanhic acid glycosides, medicagenic acid glycosides have been reported to possess, in addition to the anti-feeding property, a broad and strong anti-microbial activity (Oleszek et al., 1990; Oleszek, 1996; Avato et al., 2006; Jarecka et al., 2008). For example, medicagenic acid glycosides strongly inhibited the growth of Trichoderma viride, a fungus that is highly sensitive to alfalfa saponins and has been traditionally used to quantify saponins (Zimmer et al., 1967). In contrast, zanhic acid glycosides were found inactive against a wide range of fungi including T. viride; even at higher concentrations (Oleszek et al., 1992; Oleszek, 1996). Because the difference between medicagenic acid and zanhic acid is the presence of C16-hydroxy group in zanhic acid, it has been suggested that the C16-hydroxy group is responsible for the strong bitter taste but low anti-fungal activity of zanhic acid glycosides (Oleszek, 1996). Soyasaponin B glycosides were also found to accumulate in the aerial parts and the roots (**Figure 5**). This is consistent with the previous report that soyasaponin B glycosides were found in both aerial and root tissues of alfalfa (Sen et al., 1998). Soyasaponin B glycosides are non-hemolytic saponins. They have been reported to possess anti-feeding and antifungal activities. The combination of saponins of distinct functions in leaves provides an excellent defense against herbivores and fungal attacks. Indeed, incorporation of dried alfalfa leaf tissue in their diet significantly inhibited growth and development of larvae of the European corn borer (Ostrinia nubilalis). In contrast, saponin fractions isolated from alfalfa root tissues, when incorporated into their diet at equivalent concentrations, had

little effect on larvae development although they inhibited their growth (Nozzolillo et al., 1997). Feeding Spodoptera littoralis (Egyptian Cotton Leafworm) larvae with a diet supplemented with saponins isolated from alfalfa also significantly reduced their growth, fecundity and fertility and increased their mortality (Adel et al., 2000). Medicagenic acid and its glycosides were found to be much more effective in inhibiting larvae than hederagenin and its glycosides which are normally accumulated in roots. In a more recent study, the number of aphids infesting alfalfa was found to be inversely related to the contents of zanhic acid and medicagenic acid glycosides in the leaves (Goławska et al., 2012). It was also demonstrated that these compounds were induced upon aphid infestation, indicating their anti-feeding properties. All these suggest that saponins in leaves have been tailored to defend against herbivore feeding through increased bitterness and toxicity.

Compared to leaf saponins, root saponins consisted of a different set of triterpene glycosides, with the major difference being the absence of zanhic acid glycoisdes and the presence of bayogenin glycosides and hederagenin glycosides (**Figure 5**). This suggests that P450s responsible for converting medicagenic acid to zanhic acid is not active in roots. This is not surprising as

leaves and roots are in different environments and may require different saponins to respond to different abiotic and biotic stress. The absence of the bitter zanhic acid glycosides in roots suggests that anti-feeding is less important in roots than in leaves. The abundance of bayogenin and hederagenin glycosides found in roots suggests that these root-specific compounds are important to root fitness. We hypothesize that they likely serve important roles in defense against soil pathogenic microbes. Indeed, saponins from Medicago hybrida roots were found to substantially inhibit six pathogenic fungi Botrytis cinerea, Botrytis tulipae, Fusarium oxysporum f. sp. callistephi, F. oxysporum f. sp. narcissi, Phoma narcissi, and F. oxysporum Schlecht (Saniewska et al., 2006). Similarly, saponins isolated from roots of alfalfa were also found to have strong anti-fungal activity (Jarecka et al., 2008). The strong antifungal activity was attributed to some hederagenin and medicagenic acid glycosides (Saniewska et al., 2006; Jarecka et al., 2008). Bayogenin glycosides have also been reported to possess anti-fungal activity (Martyniuk and Biały, 2008). Their inhibitory effects against Cephalosporium gramineum, a soil fungus that infects roots of plants, were markedly higher than their similar hederagenin glycoside counterparts (Martyniuk and Biały, 2008). Bayogenin differs with hederagenin only in that it possesses a hydroxyl group at C2 position (**Figure 1**). The hydroxyl groups at C2 and C3 positions are important for antifungal activities, possibly due to the increased polarity and solubility (Oleszek, 1996). Selective methylation of the hydroxyl groups in medicagenic acid showed that the hydroxyl group at C3 position is essential for antifungal activity (Levy et al., 1989). While glycosylation of the C3 hydroxyl group did not affect the overall antifungal activity, methylation or acetylation resulted in a significant loss of the antifungal activity. This suggests that the polarity of these compounds is important in their antifungal activity. The spatially differential accumulation of saponins in the aerial parts and roots shows that plant secondary metabolism is flexible and adaptable. Different tissues of the same plant can accumulate different sets of metabolites to increase their fitness and adapt to the environments. In M. truncatula, the bitter herbivore deterrent zanhic acid glycosides are predominantly found in the aerial tissues while the anti-fungal agents such as hederagenin and bayogenin glycosides are mostly found in roots to defend against soil borne fungi. In contrast, medicagenic acid glycosides, the broad spectrum and strong anti-microbial compounds, are accumulated in both the aerial and root tissues.

# CONCLUSION

Saponin profiling of 201 M. truncatula ecotypes revealed a clear differential spatial accumulation of saponins in the aerial parts relative to the roots. Zanhic acid glycosides were only found in the aerial tissues. In contrast, bayogenin glycosides, hederagenin glycosides, and soyasaponin B glycosides were predominantly accumulated in roots, thus suggesting interesting ecological roles for these compounds in plant defense. Overall, zanhic acid glycosides, medicagenic acid glycosides and soyasaponin B glycosides were the three major triterpene saponins found in the aerial parts, while medicagenic acid glycosides, bayogenin glycosides, hederagenin glycosides, SoyE glycosides, and SoyB glycosides constituted the major root saponins. Although medicagenic acid glycosides were found in both the aerial and root tissues, aerial parts were found to contain more medicagenic acid glycosides. Significant correlation between the quantity of saponins and the ecotypes' country of origin was observed. The European ecotypes were found to contain higher content of saponins than most of the African and the Middle-East ecotypes. This dataset also represents an extremely valuable resource for discovery of biosynthetic genes and deciphering saponin biosynthesis. Based on the correlation analysis of the metabolic profiling data and gene expression data, a number of P450 genes have been selected for further characterization to elucidate their roles in saponin biosynthesis in M. truncatula (Tzin et al., 2019).

# DATA AVAILABILITY

All processed data for this study are included in the manuscript and/or the **Supplementary Files**.

# AUTHOR CONTRIBUTIONS

LS conceptualized and designed the experiments, and secured the funding. BW and DY grew, harvested, dried, and ground the samples. DH extracted the ground samples and collected the LCMS data. ZL processed the data and drafted the manuscript. All authors were involved in the implementation of the experimental design and contributed to the final version of the manuscript.

# FUNDING

The Sumner laboratory has been graciously supported by several entities over the years for the development of natural products profiling and plant metabolomics. For this project, these specifically included the NSF MCB Award 1024976, the Oklahoma Center for the Advancement of Science and Technology (OCAST) #PSB10-027. The Sumner lab was also supported by the University of Missouri, The Samuel Roberts Noble Foundation, the Bruker Daltonics Gmbh, the NSF-JST Metabolomics for a Low Carbon Society, IOS Award IOS-1139489 and IOS-1639618, the NSF MRI DBI Award 1126719, and the NSF RCN Award 1340058.

# ACKNOWLEDGMENTS

We gratefully acknowledge Shelagh Henson for her assistance in the plant care and harvest.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.00850/ full#supplementary-material

# REFERENCES

fpls-10-00850 July 2, 2019 Time: 17:44 # 11



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Lei, Watson, Huhman, Yang and Sumner. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Assessing Chemical Diversity in Psilotum nudum (L.) Beauv., a Pantropical Whisk Fern That Has Lost Many of Its Fern-Like Characters

#### Dunja Šamec1,2, Verena Pierz1,3, Narayanan Srividya<sup>1</sup> , Matthias Wüst<sup>3</sup> and B. Markus Lange<sup>1</sup> \*

1 Institute of Biological Chemistry and M.J. Murdock Metabolomics Laboratory, Washington State University, Pullman, WA, United States, <sup>2</sup> Ruder Boškovi ¯ c Institute, Zagreb, Croatia, ´ <sup>3</sup> Chair of Bioanalytics, Institute of Nutritional and Food Sciences, University of Bonn, Bonn, Germany

### Edited by:

Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan

#### Reviewed by:

A. Daniel Jones, Michigan State University, United States Zhigang Yang, Lanzhou University, China

> \*Correspondence: B. Markus Lange lange-m@wsu.edu

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 12 March 2019 Accepted: 18 June 2019 Published: 09 July 2019

#### Citation:

Šamec D, Pierz V, Srividya N, Wüst M and Lange BM (2019) Assessing Chemical Diversity in Psilotum nudum (L.) Beauv., a Pantropical Whisk Fern That Has Lost Many of Its Fern-Like Characters. Front. Plant Sci. 10:868. doi: 10.3389/fpls.2019.00868 Members of the Psilotales (whisk ferns) have a unique anatomy, with conducting tissues but lacking true leaves and roots. Based on recent phyogenies, these features appear to represent a reduction from a more typical modern fern plant rather than the persistence of ancestral features. In this study, extracts of several Psilotum organs and tissues were analyzed by Gas Chromatography – Mass Spectrometry (GC-MS) and High Performance Liquid Chromatography – Quadrupole Time of Flight – Mass Spectrometry (HPLC-QTOF-MS). Some arylpyrones and biflavonoids had previously been reported to occur in Psilotum and these metabolite classes were found to be prominent constituents in the present study. Some of these were enriched and further characterized by Nuclear Magnetic Resonance (NMR) spectroscopy. HPLC-QTOF-MS and NMR data were searched against an updated Spektraris database (expanded by incorporating over 300 new arylpyrone and biflavonoid spectral records) to aid significantly with peak annotation. Principal Component Analysis (PCA) with combined GC-MS and HPLC-QTOF-MS data sets obtained with several Psilotum organs and tissues indicated a clear separation of the sample types. The principal component scores for belowground rhizome samples corresponded to the vectors for carbohydrate monomers and dimers and small organic acids. Above-ground rhizome samples had principal component scores closer to the direction of vectors for arylpyrone glycosides and sucrose (which had high concentrations in above-and below-ground rhizomes). The unique position of brown synangia in a PCA plot correlated with the vector for biflavonoid glycosides. Principal component scores for green and yellow synangia correlated with the direction of vectors for arylpyrone glycosides and biflavonoid aglycones. Localization studies with cross sections of above-ground rhizomes, using Matrix-Assisted Laser Desorption/Ionization – Mass Spectrometry (MALDI-MS), provided evidence for a preferential accumulation of arylpyrone glycosides and biflavonoid aglycones in cells

**77**

of the chlorenchyma. Our results indicate a differential localization of metabolites with potentially tissue-specific functions in defenses against biotic and abiotic stresses. The data are also a foundation for follow-up work to better understand chemical diversity in the Psilotales and other members of the fern lineage.

Keywords: arylpyrone, biflavonoid, mass spectrometry, metabolomics, nuclear magnetic resonance, whisk fern

# INTRODUCTION

fpls-10-00868 July 6, 2019 Time: 12:42 # 2

Free−sporing vascular plants encompass two distinct evolutionary lineages, the lycophytes and ferns, with the latter resolved as more closely related to seed plants (Kenrick and Crane, 1997; Pryer et al., 2001). Whisk ferns (order Psilotales), which comprise two genera (Psilotum and Tmesipteris) in the family Psilotaceae, have conducting tissues but no veins, and lack true leaves and roots. Water and mineral absorption occurs through underground, horizontally creeping rhizomes, sometimes in association with symbiotic fungi (mycorrhizae) (Ducket and Ligrone, 2005). Plants grow mostly as epiphytes (using other plants as physical support) in moist habitats. The stem-like aerial portion of rhizomes of members of the Psilotaceae is covered by an epidermis, followed inward by extensive cortical areas, a single-layered endodermis, and a thick-walled protostele that accommodates the water and nutrient-conducting tissues (Pittermann et al., 2011). The epidermal layer of the photosynthetic above-ground rhizomes contains stomata for gas exchange (Nilsen, 1995). In the genus Psilotum, above-ground rhizomes have many branches with scale-like appendages called enations. These structural outgrowths resemble miniature leaves but, unlike true leaves, have no internal vascular tissues. Above these enations, positioned laterally along the distal portions of aerial shoots, are spore-containing synangia, which result from the fusion of three adjacent sporangia (Renzaglia et al., 2001).

Because of its unusual anatomical characteristics, P. nudum was traditionally thought to be descended from the earliest vascular plants (Banks, 1975), and conflicting views regarding the placement of the Psilotales remained in the literature for decades. Recent phylogenies based on both morphological characters and extensive sequence data provided strong evidence that Psilotales, Ophioglossales (moonworts) and Marattiales (king ferns) – all eusporangiate ferns – form a monophyletic clade that is sister to leptosporangiate ferns, the largest group of living ferns (Doyle, 2018; Rothwell et al., 2018). The unique anatomy of extant Psilotales therefore appears to represent a reduction from a more typical modern fern plant rather than the persistence of ancestral features. While recent progress has been made with regard to resolving the classification of vascular plants, there is still a notable lack of knowledge regarding the phytochemical diversification associated with the adaptive radiation of ferns.

We selected Psilotum nudum (L.) Beauv. to evaluate chemical diversity in the fern lineage, as only limited knowledge exists on this topic. Psilotin and 3'-hydroxypsilotin are unusual C<sup>11</sup> arylpyrone glycosides unique to the Psilotaceae (McInnes et al., 1965; Tse and Towers, 1967; Balza et al., 1985; Takahashi et al., 1990). Psilotic acid is a C6-C4 organic acid that is structurally related to the psilotin aglycone (psilotinin) (Shamsuddin et al., 1985). Prominent flavonoid glyosides in the Psilotaceae are O-glucosides of the biflavonoid, amentoflavone, and C- and O-glycosides of the flavone, apigenin (Cooper-Driver, 1977; Wallace and Markham, 1978; Markham, 1984). A survey across sixteen pteridophytes (ferns and fern allies), including P. nudum, concluded that the sterol composition is generally similar to that of spermatophytes (seed plants), with β-sitosterol, campesterol and stigmasterol as principal constituents (Chiu et al., 1988). P. nudum tissues were also demonstrated to contain representatives of several phytohormone classes (auxins, cytokinins and gibberellins) (Takahashi et al., 1984; Abul et al., 2010). In this pilot study, which is the beginning of efforts to chart out the most abundant classes of specialized metabolites in ferns, we demonstrate the utility of multi-platform analyses for capturing the unique chemical fingerprints of different P. nudum organs and tissues. In addition, we report the tissuelevel localization of the most prominent arylpyrone glycoside and biflavonoid constituents.

# MATERIALS AND METHODS

# Chemicals and Solvents

Solvents for extraction and chromatography were of the highest commercial grade and obtained from Sigma-Aldrich (St. Louis, MO, United States). Deuterated solvents for nuclear magnetic resonance (NMR) spectroscopy were obtained from Cambridge Isotope Laboratories Inc. (Andover, MA, United States), with details in **Table 2**. All authentic standards, reference materials (red phosphorus, α-cyano-4-hydroxycinnamic acid, 9-anthracenecarboxylic acid, sinapic acid and vanillic acid) and reagents (N-methyl-N-(trimethylsilyl)trifluoroacetamide) were generally purchased from Sigma-Aldrich (St. Louis, MO, United States); exceptions: 2,5-dihydroxybenzoic acid (TCI America, Portland, OR, United States) and leucine enkephalin (Waters, Milford, MA, United States).

# Plant Growth

Psilotum nudum (L.) P. Beauv. plants had been established from rhizomes roughly 6 years before the initiation of the experiments described here. A voucher specimen was deposited with the John G. Searle Herbarium of the Field Museum (Chicago,

**Abbreviations:** AMT tag, accurate mass-time tag; CHCA, α-cyano-4 hydroxycinnamic acid; DBA, 2,5-dihydroxybenzoic acid; GC–MS, gas chromatography – mass spectrometry; HPLC–QTOF–MS, high performance liquid chromatography – quadrupole time of flight – mass spectrometry; MALDI–MS, matrix-assisted laser desorption/ionization – mass spectrometry; NMR, nuclear magnetic resonance; PC, principal component; PCA, principal component analysis.

IL, United States). Plants were maintained in a greenhouse under ambient lighting, with supplemental lighting during winter months provided by high-intensity discharge lamps. The daily light integral varied from 15 to 25 mol m−<sup>2</sup> d −1 . Temperatures ranged between 22 and 27◦C and the humidity was set to 70%. At the time of harvesting, P. nudum produced synangia that, based on color (green, yellow or brown), could be differentiated into three developmental stages (immature, mature, and senescent). Five biological replicates were harvested at the same time of day (11:00 AM, Pacific Daylight Savings Time) for the following organs: below-ground rhizome, above-ground rhizome (stem), and (separately) green, yellow and brown synangia. Samples were immediately frozen in liquid nitrogen, freeze-dried (aerial parts for 5 days, rhizomes for 7 days). Lyophilized material was homogenized to a fine powder under liquid nitrogen using mortar and pestle. Defined quantities of homogenate were weighed out, placed in a 2 ml microfuge tube, and stored as aliquots at –20◦C until further use.

# Metabolite Extraction and Derivatization for Analysis by Gas Chromatography – Mass Spectrometry

Frozen tissue homogenate from each sample (15 ± 3 mg) was transferred to 8 ml glass tubes and overlaid with 700 µL methanol (containing myristic acid-d<sup>27</sup> (CDN Isotopes, Quebec, Canada) as internal standard at 1.5 mg/ml) and 25 µL water. Tubes were capped tightly and heated in a water bath to 70◦C for 15 min, centrifuged for 2 min at 3,500 × g, and supernatants transferred to new 8 ml glass vials. To each supernatant, 700 µL of water and 375 µL of chloroform were added and the contents of the tube mixed with a multi-tube vortexer (VWR Scientific, South Plainfield, NY, United States) for 15 min at a speed setting of 4. Extracts were centrifuged for 15 min at 3,500 × g, the upper aqueous phase was combined with the first methanol extract (henceforth referred to as aqueous methanol extract), and the lower organic phase was collected separately (chloroform extract). The two extracts were separately evaporated to dryness [Vacufuge Plus for aqueous methanol extract (Eppendorf, Hauppage, NY, United States); EZ-Bio Evaporator for chloroform extract (GeneVac LTD, Ipswich, United Kingdom)]. Dried samples were derivatized just-in-time by adding 10 µL of a 40 mg/ml solution of methoxyamine hydrochloride in pyridine and shaking gently at 30◦C for 90 min, then adding 50 µL of N-methyl-N- (trimethylsilyl)trifluoroacetamide (MSTFA; Sigma-Aldrich, St. Louis, MO, United States) and shaking gently at 37◦C for 30 min. Samples were allowed to cool to room temperature, and the extract was transferred to a glass insert, which was then placed in a 2 ml glass reaction vial.

# Gas Chromatography – Mass Spectrometry Analysis

Gas chromatography – mass spectrometry (GC–MS) was performed under the following conditions: injection volume: 1 µL (splitless mode); GC instrument: 6890N (Agilent Technologies, Santa Clara, CA, United States), GC; column: DB-5MS + DG (30 m × 0.25 mm × 0.25 µm; J&W Scientific, Santa Clara, CA, United States); inlet temperature: 250◦C; temperature program: start at 60◦C, ramp to 320◦C at 3◦C/min, hold for 10 min; retention time locking: myristic acid-d<sup>27</sup> at 42.06 min at an inlet pressure of 10.65 psi; MS instrument: 5975 MSD (Agilent Technologies, Santa Clara, CA, United States); transfer line temperature: 250◦C; electron ionization at 70 eV. Data analysis was performed using ChemStation, version E.02.00.493 (Agilent Technologies, Santa Clara, CA, United States). Custom spectral databases (specifying retention time, a quantification signal and three qualifier ions) were created using authentic standards from our in-house library for the identification of GC–MS peaks (**Supplementary Table S1**). Peaks generated by unidentified analytes were annotated based on community reporting guidelines (Bino et al., 2004; Fiehn et al., 2007). Raw data values were normalized for sample weight and signal intensity associated with the internal standard. Normalized data values were z-transformed (autoscaled) prior to statistical analyses.

# Metabolite Extraction for High Performance Liquid Chromatography – Quadrupole Time-of-Flight – Mass Spectrometry

Frozen tissue homogenate from each sample (30 ± 5 mg) was transferred to a 2 ml reaction tube and extracted with 1 ml of 80% aqueous methanol (containing 6 mg/L anthracene-9-carboxylic acid as internal standard) for 10 min [multi-tube vortexer (VWR Scientific, South Plainfield, NY, United States) at highest speed setting] and subsequent sonication for 20 min (ultrasonic bath at highest intensity setting, Fisher Scientific, Hampton, NY, United States). Following centrifugation for 10 min at 13,000 × g, supernatants were filtered through 0.22 µm polypropylene filter material and collected in plastic inserts for 2 ml reaction vials.

# High Performance Liquid Chromatography – Quadrupole Time-of-Flight – Mass Spectrometry Analysis

High Performance Liquid Chromatography – Quadrupole Time-of-Flight – Mass Spectrometry (HPLC–QTOF–MS) was performed under the following conditions: HPLC system: 1290 system (Agilent Technologies, Santa Clara, CA, United States) consisting of thermo-controlled autosampler (set to 4◦C), binary pump (operated at 0.6 ml/min), isocratic pump [operated at 0.1 ml/min flow rate to introduce a reference mass solution containing 300 nM purine (exact mass 120.043596 g/mol) and 250 nM hexakis-(1H, 1H,3H-tetrafluoropropoxy)-phosphazine (exact mass 921.002522 g/mol) in acetonitrile/water (95:5; v/v)], thermo-controlled column compartment (set to 60◦C), and diode array detector (scanning range 210–600 nm, resolution 1.2 nm); injection volume: 10 µl; Column: HD Zorbax SB-Aq (100 × 2.1 × mm; 1.8 µm pore size, Agilent Technologies, Santa Clara, CA, United States); Mobile phase: 0.1% (v/v) formic acid in water (solvent A) and 0.1% (v/v) formic acid

in acetonitrile (solvent B). Gradient: 5% B at start; linear gradients to 10% B at 5 min, 20% B at 10 min, 80% B at 35 min, 95% B at 45 min; QTOF–MS instrument: 6530 series with electrospray ion source (Agilent Technologies, Santa Clara, CA, United States); polarity: positive; drying gas flow rate: 10 L/min; drying gas temperature: 325◦C; nebulizer pressure: 2.4 bar; m/z range: 100–1,200 (high gain mode); scan rate: 1.4 scans/s for MS and 4 scans/s for MS/MS. Data analysis was performed using the MassHunter Workstation software package [B.07.00, Qualitative Analysis and B.06.00, Profinder, Agilent Technologies, Santa Clara, CA, United States). For each detected peak, molecular feature extraction (considering retention time (tolerance window 1.30 s) and high mass accuracy (m/z tolerance window 10 ppm)], deconvolution, and alignment across samples were performed using the recursive feature extraction algorithm (settings: threshold of 10,000 counts and peak spacing tolerance of 0.0025 m/z). Quasi-molecular ions and adducts were considered ([M+H]+, [M+Na]+, [M+K]+, [M+NH4] <sup>+</sup>), as were the corresponding dimers. The minimum absolute height required for feature extraction in the recursive step was set to 10,000 counts (sum of all peaks for a given molecular entity), which had to be fulfilled in at least three of five biological replicates. The global filter was limited to 2,000 results. Peak annotation was performed based on a combination of chromatographic, mass spectral (accurate mass and MS/MS fragmentation patterns), evaluation of the literature, and searches against spectral databases (**Table 1**). Peaks generated by unidentified analytes were annotated based on community reporting guidelines (Bino et al., 2004; Fiehn et al., 2007). MS/MS spectra for identified peaks were submitted to MassBank (Horai et al., 2010) to expand a widely used community spectral resource. Normalized data values for HPLC–QTOF–MS peaks were z-transformed (autoscaled) and combined with the normalized and z-transformed GC–MS data (**Supplementary Table S2**). The combined HPLC–QTOF–MS and GC–MS data set were processed by Principal Component Analysis (PCA) using the R statistical package<sup>1</sup> , for which the settings and outcomes are summarized in **Supplementary Table S3**.

# Metabolite Isolation and Analysis by Nuclear Magnetic Resonance Spectroscopy

Above-ground biomass from P. nudum was harvested and homogenized to a fine powder in the presence of liquid nitrogen. A 300 mg aliquot of the homogenate was extracted with 10 ml of 80% aqueous methanol by vigorous mixing for 10 min (Vortex Mixer, VWR Scientific, South Plainfield, NY, United States; operated at highest speed setting) and subsequent sonication in an ultrasonic bath for 20 min. Following centrifugation of this mixture for 10 min at 13,000 rpm, the supernatant was recovered and filtered through a 0.22 µm polypropylene filter. The extract was stored at –20◦C until further processing. Aliquots (100 µl each) of the filtered extracts were injected onto a C18 reversed phase and absorbance at 280 and 360 nm was monitored (1100 Series HPLC system; Agilent Technologies, Santa Clara, CA, United States). The mobile phase consisted of two solvents (A: 0.2% (v/v) acetic acid in water; B: 0.2% (v/v) acetic acid in methanol) and the separation of metabolites was achieved using the following gradient: 2% B at start, with a series of linear gradients to 35% B at 10 min, 60% B at 21 min, 90% B at 40 min, and 98% B at 50 min. The flow rate was set to 1.3 ml/min. Trial runs indicated when metabolites of interest eluted and fractions were collected accordingly. The eluents of several runs were accumulated and each of these fractions evaporated to dryness in a rotary evaporator. Each residue was dissolved in a deuterated solvent and NMR spectra were acquired with the settings listed in **Supplementary Table S4**. Spectral records for bioflavonoids and arylpyrones were generated based on information extracted from the literature (listing in **Supplementary Table S5**) and integrated into the Spektraris database (Cuthbertson et al., 2013; Fischedick et al., 2015). The combined spectral data from HPLC–QTOF–MS and NMR were then used to search for matches in the Spektraris online resource (**Table 2**).

# Metabolite Imaging by Matrix-Assisted Laser Desorption/Ionization – Mass Spectrometry

Psilotum nudum above-ground rhizomes were cross-sectioned into 2 cm segments, embedded in 3% (w/v) agarose, and stored at –80◦C until further processing. On the day of the metabolite imaging analysis, the chamber of a CM 1950 Cryostat (Leica Biosystems, Buffalo Grove, IL, United States) was set to –20◦C, embedded samples were sectioned to 30 µm thickness and sections immediately transferred to an imaging target plate (Waters Corp., Milford, MA, United States). The ionization matrices tested for their suitability with P. nudum metabolites were 2,5-dihydroxybenzoic acid (DBA), α-cyano-4 hydroxycinnamic acid (CHCA), sinapic acid and vanillic acid (each at 40 mg/ml (w/v) in methanol/water (1:1; v/v)). Matrices were applied with a sample preparation system (TM-Sprayer of HTX Technologies, Chapel Hill, NC, United States) connected to an 1100 Series HPLC Binary Pump (Agilent Technologies, Santa Clara, CA, United States). The settings were: flow rate at 0.05 ml/min; nozzle temperature at 80◦C; spraying velocity at 1,250 mm/min; 12 passes; and track spacing of 1 mm. The final amount of matrix deposited per linear distance was 0.19 mg mm−<sup>2</sup> . Besides matrix-covered samples, the following chemicals were also spotted onto the imaging target plates: red phosphorus for instrument calibration (10 mg/ml in acetone), leucine-enkephalin to generate a lock mass [10 mg/ml mixed with 3.4 mg/ml CHCA in methanol/water (1:1; v/v)], and authentic standards [1 mg/ml of amentoflavone, psilotin and 3' hydroxypsilotin mixed with 5 mg/ml DHB in methanol/water (1:1; v/v)]. Metabolite imaging was performed by Matrix-Assisted Laser Desorption/Ionization Mass Spectrometry (MALDI–MS) on a Synapt G2-S instrument equipped with an ion mobility drift tube and operated with MassLynx software version 4.1 (Waters, Milford, MA, United States). The imaging target plate was introduced into the sample chamber and the laser operated with the following settings: 1,000 Hz firing rate; laser energy

<sup>1</sup>https://www.r-project.org/

TABLE 1 | Annotation of HPLC-QTOF-MS peaks.


fpls-10-00868 July 6, 2019 Time: 12:42 # 5

#### TABLE 1 | Continued


fpls-10-00868 July 6, 2019 Time: 12:42 # 6

Šamec et al.

(Continued)

TABLE 1 | Continued


fpls-10-00868 July 6, 2019 Time: 12:42 # 7

Šamec et al.

TABLE 2 | Peak annotationbased on Spektraris searches with combined HPLC-QTOF-MS and NMR spectroscopy data.


of 40 (arbitrary units); and a step size of 25 µm. Lock mass correction was repeated every 600 s for a duration of 5 s. Other settings: helium gas flow at 90 ml min−<sup>1</sup> ; trap wave velocity at 311 m s−<sup>1</sup> ; trap wave height at 4 V; ion mobility wave velocity at 650 m s−<sup>1</sup> ; ion mobility wave height at 40 V; transfer ware velocity at 191 m s−<sup>1</sup> ; transfer wave height of 0.1 V; and ion mobility wave delay of 450 µs. The highest signal intensity for the analytes of interest (and thus most desirable signal-to-noise ratio) was achieved in positive polarity for psilotin and 3'-hydroxypsilotin, whereas for amentoflavone negative polarity was preferable. MS/MS experiments were performed by selecting a precursor ion and a collision energy of 30 eV in the transfer cell. MALDI– MS data were processed using the High Definition Imaging software version 1.2 (Waters, Milford, MA, United States) with lock mass correction. Metabolite identification was achieved by comparing accurate mass, MS/MS fragmentation patterns, and ion mobility drift time with those of authentic standards. Signals for isomers of amentoflavone, for example robustaflavone and hinokiflavone, were detectable with authentic standards but not in tissue samples, where their concentrations were too low for MS-based imaging.

# RESULTS

# Strategy for Multi-Platform Analysis of Metabolites in P. nudum Organs

In an attempt to capture chemically diverse metabolites, we used a strategy that accessed five P. nudum organs/tissues (belowground rhizome, above-ground rhizome, and synangia harvested at different developmental stages [green (young), yellow (maturing) and brown (mature)], and generated hydrophilic (methanol/water) and hydrophobic (chloroform/methanol) extracts (**Figure 1**). These two types of extracts were processed separately for GC–MS analysis (**Supplementary Figure S1**). Methanol/water extracts were also subjected to HPLC–QTOF–MS analysis in positive ionization mode only (preliminary experiments indicated that chromatographic runs in negative polarity did not add significant spectral information) (**Supplementary Figure S1**). These data sets were normalized, autoscaled, and then combined for multivariate statistical analyses (**Figure 1**). Fractions representing selected metabolites of interest were collected from chromatographic separations of extracts and further characterized by <sup>1</sup>H-NMR. The different metabolomics platforms (GC–MS and HPLC–QTOF–MS) were chosen because they provide complementary information about different metabolites classes (details presented in the upcoming paragraphs). Cryosections of P. nudum above-ground rhizomes were sprayed with a chemical matrix and the cell type-level localization of the most abundant metabolites determined by MALDI–QTOF–MS (**Figure 1**).

Peak annotation for GC–MS data was achieved based on comparisons of retention times and mass spectral characteristics with those of authentic standards, which led to the highconfidence identification of 83 metabolites in our extracts. MS and MS/MS data from HPLC–QTOF–MS runs (acquired in positive polarity mode) were searched against comprehensive online databases (MassBank<sup>2</sup> , Metlin<sup>3</sup> , and National Institute of Standards and Technology<sup>4</sup> ). However, these searches were mostly unsuccessful due to a lack of relevant reference spectra in these databases, and we therefore decided to expand the Spektraris online resource<sup>5</sup> with spectral records acquired as part of this study or extracted from information in the literature. Accurate mass, inferred molecular formula, NMR spectral data, and bibliographic information for 328 metabolites (arylpyrone and biflavonoid aglycones and corresponding glycosides) were integrated into Spektraris-NMR, which now contains spectral records for approximately 21,500 metabolites (status: February, 2019). The combination of accurate mass and retention time data (AMT-tags) acquired by HPLC–QTOF–MS were then searched against Spektraris records (for details of this approach see Cuthbertson et al., 2013), which provided tentative identifications for 27 metabolite peaks (4 annotations with high confidence because of available authentic standards) (**Table 1**). By also including <sup>1</sup>H-NMR data in Spektraris searches (Fischedick et al., 2015), eight metabolites could be identified with high confidence (**Table 2**). Relevant structures of arylpyrones and bioflavonoids are shown in **Figure 2** and the annotation process for selected peaks is outlined in more detail in the following paragraph.

Arylpyrone glycosides with a psilotinin aglycone eluted early (4.6–6.9 min), followed by flavonoid glycosides (11.2– 14.6 min), biflavonoid glycosides (14.0–19.3 min), and then biflavonoid aglycones (19.9–24.7 min). In addition to the quasimolecular ion ([M+H]+), adducts ([M+Na]<sup>+</sup> and [M+K]+) and dimers ([2M+Na]<sup>+</sup> and [2M+K]+) were detected consistently for almost all analytes (**Table 1**). An authentic standard of psilotin (R<sup>t</sup> 5.20 min; m/z 353.1250 ([M+H]+); C17H20O8) (McInnes et al., 1965) allowed us to investigate the typical fragmentation patterns of this class of arylpyrone glycosides. At low fragmentation energy (10 eV), the loss of glucose generated a fragment representing the psilotinin aglycone (m/z 191.0699; [M–Glc]+; C11H10O3), a second fragment with additional loss of water (m/z 173.0598; [M–Glc–H2O]+; C11H10O3), a third fragment at m/z 123.0438 (C7H7O2), and a fourth fragment with m/z 107.0487 (C7H7O). The absorption spectrum of the peak in chromatograms and the authentic psilotin standard was essentially identical (**Supplementary Figure S2**). Furthermore, following purification of the peak by HPLC, the NMR spectrum of the isolated metabolite matched literature reports (McInnes et al., 1965) (**Table 2**). The signature fragments generated from the psilotin HPLC–QTOF–MS peak (corresponding to C11H10O<sup>3</sup> and C7H7O) were also detected in MS/MS spectra of four additional peaks (plus further common fragments at 50 eV). The molecular ions of two of these peaks indicated the potential presence of two hexose moieties (R<sup>t</sup> 5.12 and 6.83 min; m/z 515.1751 ([M+H]+); C23H30O13), which was corroborated by fragmentation patterns ([M – 2 × Glc]<sup>+</sup> and [M – 2 × Glc – H2O]+). The third peak of this series had a molecular ion consistent with 3'-hydroxypsilotin (R<sup>t</sup> 4.66 min; m/z 369.1198

<sup>2</sup>https://massbank.eu/MassBank/

<sup>3</sup>https://metlin.scripps.edu/

<sup>4</sup>https://chemdata.nist.gov/

<sup>5</sup>http://langelabtools.wsu.edu/spektraris/

([M+H]+); C17H20O9) (Balza et al., 1985), an annotation that was confirmed based on the NMR spectrum of the isolated metabolite (**Tables 1**, **2**). The fourth peak appeared to contain the same aglycone but with two attached hexose moieties and was therefore tentatively annotated as 3'-hydroxypsilotinin-di-Ohexoside (R<sup>t</sup> 4.76 min; m/z 553.1530 ([M+Na]+); C23H30O14).

Based on the characteristics of the peak corresponding to apigenin 7-O-glucoside, for which an authentic standard was available (R<sup>t</sup> 14.26 min; m/z 433.1133 ([M+H]+); C21H20O10) (Wallace and Markham, 1978), the fragment indicative of an apigenin aglycone was m/z 271.0597 ([M– Glc]+; C15H10O5), with m/z 153.0175 (C7H5O4) representing a second prominent fragment obtained from this flavone aglycone (Kachlicki et al., 2016). Two additional peaks had comparable fragmentation patterns, one of which showed a quasi-molecular ion corresponding to apigenin-7-O-rhamnoglucoside (R<sup>t</sup> 14.50 min; m/z 579.1712 ([M+H]+); C27H30O14), a metabolite that had previously been reported to occur in P. nudum (Wallace and Markham, 1978). The mass spectrum of the second of these peaks was indicative of a metabolite with two hexose moieties (R<sup>t</sup> 11.20 min; m/z 595.1654 ([M+H]+); C27H30O15) and therefore likely corresponds to an apigenin di-hexoside. Interestingly, apigenin-6,8-di-C-glucoside (vicenin-2) was described before as an abundant constituent of P. nudum extracts (Markham,1984), which was used as a tentative annotation for the peak of interest (**Table 1**).

All biflavonoid glycosides thus far characterized from P. nudum extracts contained an amentoflavone (3<sup>0</sup> ,800 biapigenin) aglycone with likely O-linked hexose moieties (Wallace and Markham, 1978; Markham, 1984). The amentoflavone authentic standard (C30H18O10) eluted at 21.15 min (**Table 1**). Six additional peaks with the characteristic m/z 539.0963 fragment (corresponding to this aglycone) plus common MS/MS fragments of the aglycone (Zhang Y.X. et al., 2011; Feuereisen et al., 2017) were detected in our extracts. Based on their quasi-molecular ion, these peaks were tentatively annotated as amentoflavone-hexosides (R<sup>t</sup> 18.12 and 19.16 min; m/z 701.1490 ([M+H]+); C36H28O15), amentoflavone-di-hexosides (R<sup>t</sup> 15.78 and 16.60 min; m/z 863.2021 ([M+H]+); C42H38O20) or amentoflavonetri-hexosides (R<sup>t</sup> 13.99 and 14.82 min; m/z 1025.2551 ([M+H]+); C48H48O25). Three peaks in the same retention

arrows).

time region (R<sup>t</sup> 17.56, 18.28, and 20.96 min) had a quasimolecular ion (m/z 703.1642 ([M+H]+); C36H30O15) and characteristic aglycone fragment (m/z 541.0545; C30H20O10) indicative of two additional mass units compared to amentoflavone-hexosides. The aglycone in these cases is dihydroamentoflavone (C–C-linked dimer of apigenin and naringenin), and the peaks were therefore annotated as dihydroamentoflavone-hexosides (**Table 1**). The absorption spectrum of the amentoflavone standard in the ultraviolet and visible range was identical to that of the corresponding peak in our HPLC runs (**Supplementary Figure S2**). Two arylpyrone aglycones, 3<sup>0</sup> -hydroxypsilotinin (R<sup>t</sup> 5.52; m/z 207.0652 ([M+H]+); C11H10O4) and psilotinin (R<sup>t</sup> 8.31; m/z 191.0703 (M+H]+); C11H10O3), were tentatively identified based on chromatographic properties and similarity of MS/MS fragmentation patterns to those of their glycosides (**Supplementary Figure S3**).

To enable the differentiation of biflavonoid aglycones, four fractions collected by HPLC were subjected to <sup>1</sup>H-NMR spectroscopy. Taking into account the previously published elution order of biflavonoids on reversed-phase HPLC materials (Zhang Y.X. et al., 2011), MS and MS/MS data, and by combining this information with MS and NMR data searches against the Spektraris database, four peaks could be identified with high confidence (20.73 min, 2,3-dihydroamentoflavone; 21.15 min, amentoflavone; 22.03 min, robustaflavone; and 24.50 min, hinokiflavone (**Figure 2** and **Tables 1**, **2**). The peak at 19.96 min (m/z 555.0931 ([M+H]+); C30H18O11) was tentatively identified as hydroxyamentoflavone based on its earlier elution (compared to amentoflavone) and MS/MS fragmentation patterns (Zhang Y.X. et al., 2011). Analogous comparisons allowed the tentative identification of dihydro-O-methyl-amentoflavone (Rt 22.59 min; m/z 555.1270 ([M+H]+); C31H22O10), O-methyl-amentoflavone (Rt 23.38 min; m/z 553.1127 ([M+H]+); C31H20O10), binaringenin (Rt 23.80 min; m/z 543.1291 ([M+H]+); C30H22O10), and dihydrohinokiflavone (Rt 24.14 min; m/z 541.1136 ([M+H]+); C30H20O10) (Markham, 1984; Zhang Y.X. et al., 2011; Wang et al., 2015; Feuereisen et al., 2017) (**Table 1**).

A third class of metabolites with high abundance in HPLC– QTOF–MS runs had the chromatographic and mass spectral properties of highly functionalized triterpenoids (steroids) (**Supplementary Table S2**). While the typical membrane sterols of P. nudum have been reported before (Akihisa et al., 1992), the more functionalized steroids detected here have not been mentioned in previous studies. A more detailed characterization of these underexplored specialized metabolites will be the subject of future endeavors to further evaluate chemicals diversity in the fern lineage.

# Principal Component Analysis Differentiates Metabolomics Data Sets From Different P. nudum Organs

Multivariate statistical analyses, such as PCA, aid with reducing the complexity of extensive data sets into a smaller number of Principal Components (PCs). When this approach was brought to bear on our combined GC–MS and HPLC– QTOF–MS data sets, the first three PCs accounted for roughly 71% of the varied influences of the original characteristics (metabolite patterns across all sample types), and indicated a clear separation of the five sample types (P. nudum belowground rhizome, above-ground rhizome, and synangia harvested at three different developmental stages), with a tight clustering of biological replicates (**Figures 3A,B**). Belowground rhizome samples were characterized by positive scores in PC1 and PC3, with neutral values in PC2. Aboveground rhizome samples also had positive scores in PC1 but negative scores in PC2 and PC3 (**Figures 3A,B**). All samples from synangia had negative scores in PC1, but were differentiated by a combination of negative PC2/positive PC3 scores (green synangia), negative PC2/PC3 scores (yellow synangia) or positive PC2/neutral PC3 scores (brown synangia) (**Figures 3A,B**).

Component loadings were then evaluated for characteristics that contributed to the differences among sample clusters in PCA and visualized in a biplot (**Figure 3C** and **Supplementary Table S3**). The scores for below-ground rhizome samples (positive PC1 and PC2) corresponded to the vectors for carbohydrate monomers and dimers (e.g., glucose, fructose, galactose, and raffinose) and small organic acids (malic acid, citric acid, and succinic acid). Above-ground rhizome samples occupied a biplot position (positive PC1, negative PC2) closer to the direction of vectors for some arylpyrone glycosides (psilotin and psilotinindi-O-hexoside II) and sucrose (which had high concentrations in both above- and below-ground rhizomes) (**Figure 3C**). The unique position of brown synangia (negative PC1/positive PC2) correlated with the vector for biflavonoid glycosides (e.g., dihydroamentoflavone hexoside I and amentoflavone-tri-O-hexoside II). Scores for green and yellow synangia were similar (negative scores in both PC1 and PC2) and correlated with the direction of vectors for arylpyrone glycosides (e.g., psilotinin-di-O-hexoside I) and biflavonoid aglycones (e.g., amentoflavone and hinokiflavone) (**Figure 3C**).

# Organ-Specific Accumulation of Metabolites

The PCA component loadings indicated that specific metabolite classes might explain the separation of sample types. We therefore generated a heatmap of metabolite accumulation patterns across P. nudum organs (**Figure 4**). The relative quantities of five biflavonoid glycosides, based on normalized peak areas, were quite high in brown synangia, followed by yellow and green synangia. These metabolites were also of fairly high abundance in samples of above-ground rhizomes, but extremely low in below-ground rhizomes (**Figure 4**). The quantities of six additional biflavonoid glycosides were considerably lower in all samples. Amentoflavone was the by far most abundant biflavonoid aglycone, with very high amounts present in yellow and brown synangia, relatively high quantities in above-ground rhizomes and green synangia, and fairly low levels in below-ground rhizomes (**Figure 4**). Similar patterns were observed for five additional biflavonoid

aglycones (three dihydrobiapigenin isomers, binaringenin and hinokiflavone), albeit at much lower abundance compared to amentoflavone. Among arylpyrone glycosides, psilotin was most abundant in rhizomes and above-ground rhizomes, but was also accumulated to appreciable amounts in synangia (**Figure 4**). 3 0 -Hydroxypsilotin was primarily found in synangia, with an abundance comparable to that of psilotin. Three other arylpyrone glycosides were of relatively low abundance in all samples.

Sucrose was equally abundant in rhizomes, above-ground rhizomes and green synangia (**Figure 4**). Glucose, fructose and other small molecule carbohydrates were most abundant in rhizomes, with significantly lower amounts being present in all

other samples. The highest levels of small organic acids were also found in rhizomes. While malic and citric acid were fairly abundant in all samples, other organic acids (e.g., α-ketoglutaric acid, glyceric acid and fumaric acid) were detected at considerably lower levels in rhizomes and yet lower levels in all other samples (**Figure 4**).

# MALDI–MS Imaging Indicates Preferential Accumulation of Amentoflavone and Arylpyrone Glycosides in Stem Epidermis and Outer Cortex

Building on recent successes with MS-based imaging of sesquiterpene alkaloids and triterpenoids (Lange et al., 2017), MALDI–MS was employed for localizing metabolites of interest in the current study. Two arylpyrone glycosides, psilotin and 3 0 -hydroxypsilotin, and a biflavonoid aglycone, amentoflavone, were selected for because they were highly abundant in tissue samples (MS-based imaging is much less sensitive compared to tissue extraction followed by HPLC–QTOF–MS) and were available as authentic standards in sufficient quantities for methods development. Based on the results of preliminary experiments, 30 µm cryosections of above-ground rhizomes served as biological material, 2,5-dihydroxybenzoic acid was chosen as matrix substance to aid with ionization of metabolites desorbed from tissue sections, and leucine-enkephalin was selected to provide an external lock mass. The ionization of psilotin and 3<sup>0</sup> -hydroxypsilotin was most effective in positive ionization mode, where potassium adducts (m/z 391.0797 and 407.0750, respectively) were readily detectable with unique drift times in the ion mobility cell. Mass spectrometric signals for psilotin, 3<sup>0</sup> -hydroxypsilotin and amentoflavone were highest in the epidermal and outer cortex layers, which collectively form the chlorenchyma (**Figures 5A–D**). The 3<sup>0</sup> -hydroxypsilotin signal was also apparent, albeit at significantly lower abundance, in the protostele. Amentoflavone ionized particularly well in negative mode, with the quasi-molecular ion being more abundant than adducts (m/z 537.0827) and traveling through the ion mobility cell with a unique drift time (**Figures 5B,C,E**). Based on MALDI– MS experiments performed with above-ground rhizome extracts, the normalized peak area for amentoflavone was 5-fold higher than that of hinokiflavone and 47-fold higher than that of robustaflavone, and the abundance of the latter two metabolites was too low for localization studies.

# DISCUSSION

# Expanding the Coverage of Spectral Databases to Incorporate Information on Chemical Diversity in the Fern Lineage

Biflavonoids have long been known to accumulate prominently across the bryophytes, pteridophytes and gymnosperms, with only sporadic occurrence in the angiosperms (Geiger and Quinn, 1988; Iwashina, 2000). When we began processing the data presented as part of the current study with P. nudum, we noticed a surprising paucity of spectral data relating to biflavonoids in publicly available MS and NMR databases. We therefore embarked on a literature search to gather phytochemical and spectral data for this important class of metabolites, which was then used to generate 328 new spectral records for the Spektraris online resource (Cuthbertson et al., 2013; Fischedick et al., 2015). Additionally, electronic files representing the MS/MS data acquired with biflavonoids were submitted to MassBank, a widely used online mass spectral repository (Horai et al., 2010).

The orthogonal data sets acquired in this study (retention time on GC or HPLC, quasi-molecular ion (and inferred molecular formula), MS/MS data, and NMR spectra), combined with the use of authentic standards, aided substantially in peak annotation. The inclusion of NMR data was particularly impactful for the annotation of peaks for the biflavonoids (amentoflavone, robustaflavone, and hinokiflavone) that consist of two fused apigenin molecules (differing only in the coupling position). Using our integrative approach, a total of 83 GC-MS and 8 HPLC–QTOF–MS peaks were identified with very high confidence. An additional 23 HPLC–QTOF–MS peaks were tentatively identified (for example, amentoflavone-tri-Ohexoside I, where uncertainty pertains only to the position and exact nature of the hexose moiety) (**Table 1**). While we were able to determine the structures of some of the more abundant aglycones, the identification of biflavonoid glycosides, which occur as larger families of closely related structures, has proven much more difficult. Our data sets also contained a very large number of peaks that could not be identified. Some of these, based on peak area counts, appeared to be fairly abundant. These results indicate that significant efforts will be needed to generate a more comprehensive account of chemical diversity in P. nudum and, more broadly, in the fern lineage.

# Below-Ground Rhizome of P. nudum Contains High Levels of Soluble Sugars and Organic Acids, Possibly Indicating Differential Nutrient Allocation

The above- and below-ground portions of the P. nudum rhizome are part of the same organ and it is thus notable that, in our study, significantly higher amounts of soluble sugars (in order of abundance: fructose, glucose, raffinose and galactose) and organic acids (in order of abundance: malic acid, citric acid and phosphoric acid) were present in the below-ground part of the rhizome. The abundance of soluble sugars might be interpreted as evidence for a storage function for P. nudum rhizomes but, to the best of our knowledge, the corresponding storage sugar polymers have not been analyzed in this species. The chemical properties of rhizome starches have been reported for other ferns (Zhang S. et al., 2011; Yu et al., 2015) and this work indicates indirectly (based on the high abundance of sugar precursors) that storage function is a possibility. It is also conceivable that relatively high levels of soluble sugars and organic acids are a reflection of active metabolism to support horizontal rhizome growth in P. nudum. However, while information is available regarding the correlation of fern development and some classes of metabolites (White and Turner, 1995; Abul et al., 2010), we were not able to find literature

on soluble sugar quantities in fern rhizomes. Further research is clearly necessary to begin to appreciate the tissue specialization within fern rhizomes.

# Rhizomes Accumulate Particularly High Amounts of Psilotin, an Arylpyrone Glycoside With Demonstrated Biological Activities

Our data indicated that psilotin and psilotinin, both arylpyrones unique to the Psilotaceae, were most abundant in the rhizome (both below- and above-ground), while being only half or onethird as abundant in samples from synangia. The biflavonoid amentoflavone was also highly abundant in the above-ground part of the rhizome but occurred at fairly low quantities in the below-ground parts (**Figure 4**). This begs the question if psilotin and its aglycone psilotinin might play a particular role in the below-ground rhizome, where arylpyrones are major constituents. Interestingly, it was demonstrated more than 40 years ago that psilotin acts as a germination inhibitor for turnip, onion and lettuce seeds (Siegel, 1976). It is, therefore, conceivable that psilotin (and possibly its aglycone as well) plays a defensive or allelochemical role in and around the below-ground rhizome. Psilotin was also shown to have antifeedant activities against the European corn borer (Ostrinia nubilalis) at concentrations below those present in P. nudum (Arnason et al., 1986). However, in the absence of more complete data on the bioactivities of arylpyrones, this interpretation is highly speculative. It is also unknown how psilotin might be secreted into the rhizosphere to exert allelochemical activities. The fact that the inhibitory effects of psilotin on germination can be reversed by the addition of GA<sup>3</sup> (Siegel, 1976), a gibberellin hormone, can be interpreted as evidence for a possible role of this arylpyrone in growth regulation, but the mechanism and target(s) of such an activity have not yet been explored. In the above-ground rhizome, psilotin and amentoflavone (the latter also exerting high bioactivity; Yu et al., 2017) may act collectively as defense metabolites. Currently, information about such activities has been inferred from in vitro assays only and it would thus be informative to also assess potential defensive functions of arylpyrones and bioflavonoids in in vivo investigations.

# Occurrence of Biflavonoids and Arylpyrones in Chlorenchyma Is Consistent With Function as Sunscreen Pigments

Based on our MALDI–MS imaging data, psilotin and amentoflavone are accumulated preferentially in the photosynthetically active tissues of above-ground rhizomes (above-ground rhizomes) (**Figure 5**). Considering the absorption

characteristics of these metabolites (**Supplementary Figure S2**), a protective function against excess photosynthetically active radiation and certain wavelengths (e.g., high energy ultraviolet-B) radiation would be a reasonable hypothesis for their tissue-level localization (Yamaguchi et al., 2009; Waterman et al., 2017). Our localization data sets for amentoflavone (chlorenchyma) are also consistent with the literature for other plants. For example, amentoflavone was accumulated preferentially in the leaf epidermis in Agathis robusta (Gadek et al., 1984) and Ginkgo biloba (Beck and Stengel, 2016). An interesting, as yet unanswered, question pertains to the functional role of the differential subcellular localization one would predict for the metabolites of interest. Psilotin is likely stored in the vacuole, in analogy to other (polar) phenolic glycosides (Wink, 1993), while amentoflavone is an apolar biflavonoid aglycone that was previously found to be associated with cell walls (Gadek et al., 1984). Both locations allow for the sequestration of these bioactive metabolites, thereby protecting cellular metabolism in different subcellular locations (Agapakis et al., 2012). Another advantage of the differential localization of psilotin and amentoflavone could be that greater quantities of these pigments can be accumulated, but this hypothesis remains to be tested.

# DATA AVAILABILITY

All datasets generated for this study are included in the manuscript and/or the **Supplementary Files**.

# AUTHOR CONTRIBUTIONS

BL conceived the work and wrote the manuscript, with input from all authors. DŠ, VP, NS, and BL designed the experiments and analyzed the data. MW served as mentor for VP. VP generated GC-MS data. DŠ obtained HPLC-QTOF-MS and MALDI-MS data. NS produced and interpreted NMR data and contributed to the generation of new spectral records for the Spektraris online resource.

# REFERENCES


# FUNDING

This work was supported in part by seed funds from the USDA National Institute of Food and Agriculture, Hatch Umbrella project #1015621. DŠ acknowledges financial support by the European Union's Seventh Framework Programme under grant agreement # 291823, Marie Curie FP7-PEOPLE-2011-COFUND NEWFELPRO, project 64.

## ACKNOWLEDGMENTS

We would like to thank the Institute of Biological Chemistry's greenhouse staff, Ms. Julie Thayer and Mr. Devon Thrasher, for plant maintenance. We are also indebted to Dr. Jordan Zager for establishing a pipeline for PCA. Ms. Ingrid Wokey and Mr. Nicholas Elms are acknowledged for their work on generating Spektraris records.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.00868/ full#supplementary-material

FIGURE S1 | Representative GC-MS and HPLC-QTOF-MS chromatograms (total ion current) of P. nudum extracts.

FIGURE S2 | Absorbance spectra of peaks detected in HPLC-QTOF-MS runs and comparison with those of authentic standards.

FIGURE S3 | MS/MS spectra (acquired at 30 eV collision energy) of psilotin and psilotinin.

TABLE S1 | Peak identification in GC-MS runs.

TABLE S2 | Normalized peak areas for GC-MS and HPLC-QTOF-MS data.

TABLE S3 | Component loadings of a PCA analysis with combined GC-MS and HPLC-QTOF-MS data.

TABLE S4 | NMR acquisition parameters.

TABLE S5 | Bioflavonoids and related metabolites for which new Spektraris spectral records were generated.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Šamec, Pierz, Srividya, Wüst and Lange. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Formation of Flavonoid Metabolons: Functional Significance of Protein-Protein Interactions and Impact on Flavonoid Chemodiversity

*Toru Nakayama\*, Seiji Takahashi and Toshiyuki Waki*

*Department of Biomolecular Engineering, Graduate School of Engineering, Tohoku University, Sendai, Japan*

Flavonoids are a class of plant specialized metabolites with more than 6,900 known structures and play important roles in plant survival and reproduction. These metabolites are derived from *p*-coumaroyl-CoA *via* the sequential actions of a variety of flavonoid enzymes, which have been proposed to form weakly bound, ordered protein complexes termed flavonoid metabolons. This review discusses the impacts of the formation of flavonoid metabolons on the chemodiversity of flavonoids. Specific protein-protein interactions in the metabolons of *Arabidopsis thaliana* and other plant species have been studied for two decades. In many cases, metabolons are associated with the ER membrane, with ER-bound cytochromes P450 hypothesized to serve as nuclei for metabolon formation. Indeed, cytochromes P450 have been found to be components of flavonoid metabolons in rice, snapdragon, torenia, and soybean. Recent studies illustrate the importance of specific interactions for the efficient production and temporal/spatial distribution of flavonoids. For example, in diverse plant species, catalytically inactive type-IV chalcone isomerase-like protein serves as an enhancer of flavonoid production *via* its involvement in flavonoid metabolons. In soybean roots, a specific isozyme of chalcone reductase (CHR) interacts with 2-hydroxyisoflavanone synthase, to which chalcone synthase (CHS) can also bind, providing a mechanism to prevent the loss of the unstable CHR substrate during its transfer from CHS to CHR. Thus, diversification in chemical structures and temporal/spatial distribution patterns of flavonoids in plants is likely to be mediated by the formation of specific flavonoid metabolons *via* specific proteinprotein interactions.

Keywords: metabolon, flavonoids, chemodiversity, biosynthesis, protein-protein interaction, binary interaction, cytochrome P450, ER

# INTRODUCTION

Flavonoids are a class of plant specialized metabolites with a basic C6-C3-C6 skeleton, for which 10 major classes (i.e., chalcones, aurones, flavanones, flavones, isoflavones, dihydroflavonols, flavonols, leucoanthocyanidins, anthocyanidins, and flavan-3-ols) have been described (**Figure 1**). In nature, flavonoids generally occur as glycosides or acylglycosides, with more than 6,900

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Taira Miyahara, Chiba University, Japan Stefan Martens, Fondazione Edmund Mach, Italy*

*\*Correspondence: Toru Nakayama toru.nakayama.e5@tohoku.ac.jp*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 19 March 2019 Accepted: 07 June 2019 Published: 09 July 2019*

#### *Citation:*

*Nakayama T, Takahashi S and Waki T (2019) Formation of Flavonoid Metabolons: Functional Significance of Protein-Protein Interactions and Impact on Flavonoid Chemodiversity. Front. Plant Sci. 10:821. doi: 10.3389/fpls.2019.00821*

**95**

FIGURE 1 | Proposed general pathways of flavonoid biosynthesis. Ten flavonoid classes are shown within boxes. Enzymes that are discussed in terms of their protein-protein interactions in this review are shown within circles. Enzyme abbreviations are: PAL, phenylalanine ammonia lyase; C4H' cinnamate 4-hydroxylase; 4CL' 4-coumarate:CoA ligase; CHS, chalcone synthase; CGT, chalcone 4′-*O*-glucosyltransferase; AS, aureusidin synthase; IFS, 2-hydroxyisoflavanone synthase; HID, 2-hydroxyisoflavanone dehydratase; CHI' chalcone isomerase; FNS, flavone synthase; F3H, flavanone 3-hydroxylase; FLS, flavonol synthase; F3′H, flavonoid 3′-hydroxylase; F3′5′H, flavonoid 3′,5′-hydroxylase; DFR' dihydroflavonol 4-reductase; ANS, anthocyanidin synthase; LAR, leucoanthocyanidin 4-reductase; FGT' flavonoid 3-*O*-glucosyltransferase. Note that F3′H and F3′5′H may act on flavanones, flavones, and flavonols, depending on the plant species (not shown). Flavonoids and related metabolites are *p*-coumaroyl-CoA (1), 2′,4,4′,6′-tetrahydroxychalcone (THC) (2), and naringenin (3).

different structures (Arita and Suwa, 2008). Each plant lineage produces structurally distinct flavonoids in a lineage-specific manner, which play important roles in plant survival and reproduction. For example, in many cases, flower colors arise from anthocyanins and other flavonoids, which contribute to attracting pollinators (Andersen and Markham, 2006). In legumes, (iso)flavonoids in root exudates serve as chemoattractants for specific symbiotic bacteria as well as genetic inducers of nodulation (Barz and Welle, 1992; Subramanian et al., 2006, 2007). These (iso)flavonoids also play important roles in plant defensive mechanisms against infections by pathogens and invasion by herbivores (Aoki et al., 2000). Moreover, consumption of flavonoids is relevant for human nutrition, as illustrated by soybean [*Glycine max* (L.) Merr.] isoflavones, which exhibit estrogen-like and antioxidant activities and have been implicated in the ability of soy to prevent hormone-dependent cancers and cardiovascular diseases (Wiseman, 2006). These diverse bioactivities of flavonoids in plant biology and human nutrition are closely related to their diversity in chemical structure.

Flavonoids are derived from the amino acid l-phenylalanine *via* the general phenylpropanoid pathway, shown in **Figure 1** (Winkel-Shirley, 2001). Chalcone synthase (CHS), the first committed enzyme of the flavonoid pathway, catalyzes the production of 2′,4,4′,6′-tetrahydroxychalcone (THC, **2**; **Figure 1**), which serves as a precursor for the other flavonoids (Austin and Noel, 2003). Aurones are directly derived from chalcones in limited plant species (Nakayama et al., 2000, 2001; Kaintz et al., 2014), while other flavonoids, including flavones, isoflavones, flavonols, and anthocyanidins, are derived after the conversion of chalcones to flavanones catalyzed by chalcone isomerase (CHI) (Winkel-Shirley, 2001). While the core flavonoid pathway is well conserved among seed plants, specific lineages develop specific flavonoid pathways to enhance fitness in particular environmental conditions. Enzymes involved in flavonoid biosynthesis (**Figure 1**) include polyketide synthases (e.g., CHS), 2-oxoglutarate-dependent dioxygenases [e.g., flavanone 3-hydroxylase (F3H, also termed FHT), anthocyanidin synthase (ANS; also termed leucoanthocyanidin dioxygenase, LDOX), flavonol synthase (FLS), flavone synthase I (FNSI)], short-chain dehydrogenases/reductases [e.g., dihydroflavanol 4-reductase (DFR)], aldo-keto reductases [e.g., chalcone reductase (CHR)], and cytochrome P450 monooxygenases [e.g., flavone synthase II (FNSII), flavonoid 3′-hydroxylase (F3′H), flavonoid 3′,5′-hydroxylase (F3′5′H), and 2-hydroxyisoflavanone synthase (IFS)]. These enzymes are hypothesized to have evolved from enzymes involved in primary metabolism (Weng and Noel, 2012; Moghe and Last, 2015). Cytochromes P450, shown with an asterisk in **Figure 1,** have been shown to be anchored to the cytoplasmic surface of the ER (Ralston and Yu, 2006), while most of the other enzymes are proposed to be soluble enzymes. A variety of regio-specific glycosyltransferases, acyltransferases, methyltransferases, and prenyltransferases acting on flavonoids have evolved in a lineage-specific manner to enhance the structural diversity of flavonoids (Ono et al., 2010; Sasaki and Nakayama, 2015).

It is generally accepted that the intracellular environments are of macromolecular crowding state (Fulton, 1982). Given our understanding of diffusion rates of small solutes and macromolecules in cells and organelles (Verkman, 2002), it is now recognized that cells and organelles are not simply bags of enzymes; rather, metabolic enzymes in the same pathway tend to be associated with each other in cellular environments, with each of these metabolic pathways confined to a specific region of the cell (microcompartmentalization of cellular metabolism) (Saks et al., 2008). The weakly bound, ordered complexes of enzymes involved in sequential metabolic pathways are referred to as "metabolons" (Ovadi and Srere, 2000; Srere, 2000; Ovadi and Saks, 2004; Jørgensen et al., 2005; Sweetlove and Fernie, 2013). The formation of a metabolon is believed to provide catalytic advantages *via* substrate channeling, including preventing the loss of intermediates by diffusion, reducing the transit time between active sites, protecting the chemically labile intermediates, circumventing unfavorable equilibria, and segregating the intermediates of competing reactions (Ovadi, 1991). The formation of metabolons is well defined in primary metabolic pathways of prokaryotic and eukaryotic cells, including glycolysis (Giege et al., 2003; Graham et al., 2007), the tricarboxylic acid cycle (Wu et al., 2015; Wu and Minteer, 2015), the Calvin-Benson cycle (Suss et al., 1993), and nucleotide synthesis (An et al., 2010). In plant specialized metabolism, metabolons formed during the biosynthesis of cyanogenic glycosides (Laursen et al., 2016) and lignins (Gou et al., 2018) in *Arabidopsis thaliana* (L.) Heynh. and other plant species have been studied in detail (Sweetlove and Fernie, 2013). In many cases, metabolon formation takes place on biological membranes or cytoskeletal elements *via* specific interactions of soluble enzymes with these cellular structures. However, because protein-protein interactions in metabolons are weak in most cases, it is difficult to isolate metabolons in their intact forms.

The concept of flavonoid metabolons was first proposed in 1974 to explain the efficiency of flavonoid synthesis in plant cells (Stafford, 1974). Subsequently, the association of flavonoid enzymes on biological membranes (e.g., the ER) and the formation of complexes were supported by several lines of experimental evidence (Hrazdina and Wagner, 1985a,b; Hrazdina et al., 1987; reviewed by Winkel, 2004). Since then, flavonoid metabolons have been assumed to form in diverse plant species; a model of flavonoid metabolon was proposed as a linear array of consecutive flavonoid enzymes along the ER (Hrazdina and Wagner, 1985a,b; Stafford, 1990). To date, specific proteinprotein interactions in flavonoid metabolons have been studied in multiple plant species. Substrate channeling between DFR and leucoanthocyanidin 4-reductase (LAR) was predicted by computational studies, which also suggested the functional significance of metabolon formation during flavonoid synthesis (Diharce et al., 2016). Thus, elucidation of the structural organization of metabolons provides a basis for understanding how flavonoid structures are diversified, as well as how the temporal and spatial accumulations of flavonoids are regulated (Laursen et al., 2015). This review describes our current knowledge of specific protein-protein interactions identified in flavonoid metabolons and discusses their functional significance in flavonoid biosynthesis.

# CYTOCHROMES P450 CAN BE COMPONENTS OF FLAVONOID METABOLONS

It has been shown so far that soluble enzymes involved in plant specialized metabolism are associated on the cytoplasmic surface of ER to form metabolons, nucleated by ER-bound cytochromes P450. More than three decades ago, some of the soluble enzymes related to the general phenylpropanoid and flavonoid pathways, l-phenylalanine ammonia-lyase (PAL), CHS, and flavonoid glucosyltransferase, were found to be associated with the ER membrane in several plant species including *Hippeastrum* (amaryllis, order Asparagales) and *Fagopyrum esculentum* (order Caryophyllales) (Hrazdina and Wagner, 1985a,b; Hrazdina et al., 1987), suggesting the occurrence of ER-bound metabolons for the synthesis of phenylpropanoids and flavonoids. Meanwhile, the formation of metabolons during the syntheses of other classes of plant specialized metabolites, including cyanogenic glucosides and lignins, was shown to involve the anchoring of soluble enzymes by cytochromes P450 to specific domains of the ER membrane [reviewed by Ralston and Yu (2006)]. In 2004, in tobacco (*Nicotiana tabacum,* order Solanales), cinnamate 4-hydroxylase (C4H), a cytochrome P450 (CYP73A) that is involved in the general phenylpropanoid pathway (**Figure 1**), was found to be responsible for the weak association of soluble isozymes of PAL (PAL1 and PAL2) with ER membranes (**Figure 1**), using a combination of biochemical and fluorescence microscopic methods (Achnine et al., 2004).

Formation of a flavonoid metabolon on cytochrome P450 was first demonstrated in 2008 in rice (*Oryza sativa* L.; order Poales, a monocot) that accumulates flavones, flavonols, proanthocyanidins (oligomeric flavan-3-ols), and anthocyanins (Shih et al., 2008). In this plant, an isozyme of flavonoid 3′-hydroxylase (F3′H1), a cytochrome P450 (CYP75B) catalyzing the 3′-hydroxylation of the B-ring of flavonoids (**Figure 1**), was shown to bind to CHS1 (a CHS isozyme) (**Figure 2A**) by yeast two-hybrid (Y2H) assays (Shih et al., 2008). The series of binary interaction assays showed that CHS1 also interacts with F3H, DFR, and ANS1 (an isozyme of rice ANS), but not with CHI (**Figure 2A**). Interactions among CHI, F3H, F3′H1, DFR, and ANS1 were not detected. It was proposed that in rice, CHS could serve as a common platform for a flavonoid metabolon, which might be anchored to the cytoplasmic surface of the ER *via* F3′H1. In 2016, two groups independently published evidence supporting the association of soybean flavonoid enzymes in metabolons tethered to the ER-bound cytochromes P450 IFS (CYP93C) and C4H (**Figures 2Ca,Cb**; Dastmalchi et al., 2016; Waki et al., 2016; Mameda et al., 2018). Additionally, physical interactions among flavonoid enzymes in snapdragon (*Antirrhinum majus* L.) and torenia (*Torenia hybrida*) were clarified, in which FNSII (CYP93B1, **Figures 2D,E**) was found to be a component of flavonoid metabolons (Fujino et al., 2018) (see below).

# FLAVONOID DIVERSITY AND FLAVONOID METABOLONS

During the past two decades, physical interaction partnerships of flavonoid enzymes and related proteins have been studied in multiple phylogenetically distinct plants, including rice, *A. thaliana*, soybean, snapdragon, and hops (*Humulus lupulus* L. var. *lupulus*), each of which belongs to different orders of plants and accumulates different classes of flavonoids (see below). The data suggest that production of specific flavonoids in these plants is attained *via* spatially and temporally dependent interactions between specific proteins during plant growth and stress responses. These data are discussed in more detail below.

# FLAVONOID METABOLONS IN *ARABIDOPSIS*

*A. thaliana* (order Brassicales), in which 54 flavonoid species have been identified to date, primarily accumulates flavonols and proanthocyanidins and also produces anthocyanins under stress conditions (Saito et al., 2013). Studies of *A. thaliana* flavonoid enzymes provide one of the best-characterized flavonoid metabolons with respect to protein-protein interactions. Direct protein-protein interactions among soluble flavonoid enzymes in *A. thaliana* have been studied by Y2H assays, affinity chromatography (AC), immunoprecipitation (IP), and physicochemical methods: Förster resonance energy transfer (FRET) detected by fluorescence lifetime imaging microscopy (FLIM) and surface plasmon resonance refractometry (SPR) (Burbulis and Winkel-Shirley, 1999; Owens et al., 2008; Crosby et al., 2011; Watkinson et al., 2018). In this plant, interactions between the following enzyme pairs have been identified (followed by methods in parentheses): CHS-CHI (Y2H, AC, and IP), CHS-F3H (AC), CHS-DFR (Y2H and FRET), CHS-isozyme of FLS (FLS1) (Y2H and FRET), CHI-DFR (Y2H), CHI-F3H (AC and IP), FLS1-DFR (Y2H and FRET), and FLS1-F3H (Y2H) (**Figure 2B**, black arrows). Interactions of catalytically inactive paralogs of FLS with CHS and DFR were

also found *via* Y2H (Owens et al., 2008). These binary interactions suggested a flavonoid metabolon model (**Figure 2B**) with CHS as the hub. Moreover, this model features a globular association, rather than a linear array, of flavonoid enzymes in the metabolons. FRET-FLIM analyses revealed that FLS1 and DFR, the key enzymes of branch pathways (**Figure 1**), interact with CHS in a mutually exclusive manner *in planta* (Crosby et al., 2011). This provides a possible *in planta* mechanism for regulating metabolic flux by changing physical interactors with CHS, which is pivotal in the pathway.

It has been shown that not only flavonoid enzymes but also a protein with no catalytic activity can be a component of the flavonoid metabolon in *A. thaliana*. Phylogenetic analyses suggest that CHI enzymes have evolved from a non-catalytic ancestor related to fatty acid-binding proteins (FAPs) and land plantspecific CHI-like proteins (CHILs) with no catalytic activity (Ngaki et al., 2012; Kaltenbach et al., 2018). Thus, CHIs, FAPs, and CHILs, all of which are soluble proteins, constitute a larger structurally related family, the CHI-fold family, in which CHIs correspond to types I and II, and FAPs and CHILs, respectively, correspond to types III and IV within the family. In 2014, CHILs were shown to serve as enhancers of flavonoid production (EFPs), as loss-of-function mutations and suppression in morning glory (*Ipomoea nil*) and torenia, respectively, resulted in a significant diminution of flavonoid contents (Morita et al., 2014). CHIL is also produced by *A. thaliana*. Y2H analyses indicated that in *A. thaliana*, CHIL binds to CHI (**Figure 2B**, green arrow), suggesting that CHIL is a component of the flavonoid metabolon (Jiang et al., 2015). Recently, CHIL of *A. thaliana* (AtCHIL) was also shown to physically interact with CHS of the same plant species (AtCHS) (**Figure 2B**, green arrow) by Y2H and luciferase-complementation imaging assays (LuCIA) (Ban et al., 2018). The coexpression of *AtCHIL* with *AtCHS* in yeast *Saccharomyces cerevisiae* resulted in a 1.8-fold enhancement of AtCHS-catalyzed production of THC. The interactions of CHIL with CHI and CHS might be related to the observed role of CHIL as an EFP. It must be mentioned that the binding of CHIL to CHS has also been observed in the flavonoid systems of hops, rice (**Figure 2A**), *Selaginella moellendorffii* (a lycophyte), and *Physcomitrella patens* (a bryophyte), as assayed by LuCIA (Ban et al., 2018). The coexpression of CHILs with CHS of these plants in *S. cerevisiae* also enhanced the CHS-catalyzed production of THC, suggesting the conservation of the EFP role of CHIL proteins throughout land plants.

The dynamism and versatility of CHS-mediated proteinprotein interactions likely take place in organ- and organellespecific manners in *A. thaliana*. Immunofluorescence and immunoelectron microscopic analyses showed that CHS and CHI co-localize at the ER and tonoplasts in epidermal and cortex cells of *A. thaliana* roots (Saslowsky and Winkel-Shirley, 2001). This observation suggests that a subset of CHS and CHI enzymes in root cells might not be assembled into metabolons that are mentioned above. As both of these enzymes are soluble, these data suggest that one or more other proteins function in recruiting these enzymes to membranes, although this protein remains to be identified. It remains to be determined whether CHS and CHI interact with cytochromes P450 in *A. thaliana*. In this context, in *A. thaliana* root cells, F3′H is unlikely to be involved in recruiting CHS and CHI to the ER, as suggested by the results of immunolocalization in the *A. thaliana* F3′H mutant *tt7*(88) (Saslowsky and Winkel-Shirley, 2001).

CHS and CHI have also been shown in the nucleus of *A. thaliana* by multiple immunolocalization methods (Saslowsky et al., 2005). CHS of *A. thaliana*, a dimeric enzyme, possesses sequences resembling a nuclear localization signal, which is located on the surface opposite from the dimerization interface. This signal could direct CHS and associated enzymes into the nucleus. Moreover, immunoblotting of nuclear CHI suggested post-translational modifications that also might be responsible for the nuclear localization of the enzyme. Interestingly, CHS was recently found to interact with MOS9 (a nuclear protein associated with epigenetic control of *R* genes that mediate effector-triggered immunity) as analyzed by Y2H, SPR, and FRET, with a *K*d of 210 nM (**Figure 2B,** blue arrow) (Watkinson et al., 2018). Addressing this finding further may uncover additional mechanisms for controlling flavonoid pathways, as well as linking them to defense mechanisms and other physiological functions.

# THE SOYBEAN ISOFLAVONOID METABOLON

The soybean (order Fabales) produces isoflavones, which are a class of flavonoids with a 3-phenylchromone structure and distributed almost exclusively in legumes (Aoki et al., 2000). Isoflavones play important roles in symbiotic plant-microbe interactions and defensive mechanisms against pathogen infection in soybean (Barz and Welle, 1992). Moreover, soybean isoflavones show a variety of bioactivities that are beneficial to human health (Wiseman, 2006). The soybean produces two distinct types of isoflavonoids: 5-deoxyisoflavonoids (daidzein and its conjugates) and 5-hydroxyisoflavonoids (genistein and its conjugates) (**4** and **8**, respectively, **Figure 3A**). In unstressed soybean plants (cv. Enrei), 5-deoxyisoflavonoids accumulate in the roots (93% mol/mol of total root isoflavonoids) and seeds (60% mol/mol of the total seed isoflavonoids) (Mameda et al., 2018).

# Characterization of Isoflavonoid Metabolon

Protein-protein interaction analyses of soybean isoflavonoid enzymes suggested that biosynthesis of isoflavones takes place *via* the formation of a metabolon on cytochromes P450. Specifically, the analysis using split-ubiquitin Y2H and bimolecular fluorescence complementation (BiFC) assay systems revealed that each enzyme located upstream of the isoflavonoid pathway (CHS, CHI, and GmCHR5 (an isozyme of soybean CHR); **Figures 1, 3A**) interacts with isozymes of IFS (CYP93C) to form a metabolon (**Figure 2Ca**; Waki et al., 2016; Mameda et al., 2018). It has been proposed that C4H also serves as a nucleus for the metabolon formation as analyzed *via* BiFC and IP (**Figure 2Cb**; Dastmalchi et al., 2016). Moreover, arogenate dehydratase (ADT), a shikimate pathway enzyme that has primarily been reported to be a plastidial enzyme, was reported to interact with IFS on the basis of IP (**Figure 2Cb**; Dastmalchi et al., 2016). The fluorescence localizations observed during these BiFC analyses were consistent with P450-mediated interactions taking place at the ER (Waki et al., 2016). As the activities of IFS and C4H are indispensable for the formation of isoflavones, these cytochromes P450 are considered to play both catalytic and structural roles in the metabolon.

The affinity of the soybean isoflavonoid enzymes for IFS isozymes varies among paralogs. Isoflavonoid enzymes shown in **Figure 2C** are encoded by multiple genes in soybean. For example, there are at least nine paralogs (GmCHSs) encoding CHS (Schmutz et al., 2010; Shimomura et al., 2015), 12 encoding CHI (Shimada et al., 2003; Ralston et al., 2005), and two encoding IFS (Cheng et al., 2008). For each enzyme, different isozymes exert different physiological functions (Shimizu et al., 1999; Tuteja et al., 2004; Livingstone et al., 2010). The analysis using split-ubiquitin Y2H system suggested GmCHS1 has a higher affinity for GmIFS1 than for GmCHS7 (Waki et al., 2016), and GmCHR5 binds to GmIFS isozymes but other GmCHR isozymes cannot (see below for details) (Mameda et al., 2018). These observations could be related to differential regulation and physiological roles of each enzyme paralog.

## An Implication for Functional Significance of Protein-Protein Interactions During 5-Deoxyisoflavonoid Biosynthesis

Although 5-deoxyisoflavonoids accumulate in the roots and seeds of unstressed plants in a high ratio (Mameda et al., 2018), its mechanistic details remained unknown. During the course of 5-deoxyisoflavonoid biosynthesis, isoliquiritigenin (**Figure 3A, 6**) (a 6′-deoxychalcone) is produced *via* a

FIGURE 3 | Biosynthesis of flavonoids in soybean (A) and hops (B). (A) Biosyntheses of 5-hydroxy- and 5-deoxyisoflavonoids in soybean. CHR, chalcone reductase. See Figure 1 for abbreviations for other enzymes. (B) Biosyntheses of prenylated flavonoids in hops. PT, aromatic prenyltransferase; OMT, *O*-methyltransferase; DMAPP, dimethylallyl diphosphate; PPi, pyrophosphate; SAM, *S*-adenosyl-l-methionine; SAH, *S*-adenosyl-l-homocysteine. Flavonoids and related metabolites are *p*-coumaroyl-CoA (1), THC (2), naringenin (3), genistein (4), *p*-coumaroylcyclohexantrione (5), isoliquiritigenin (6), liquiritigenin (7), daidzein (8), demethylxanthohumol (9), and xanthohumol (10).

CHS-catalyzed reaction coupled to CHR catalysis (Bomati et al., 2005). The soybean genome encodes 11 CHR paralogs (Mameda et al., 2018), among which only GmCHR1 had been characterized enzymatically (Welle and Grisebach, 1988; Welle et al., 1991). Although CHR has been referred to as chalcone reductase, it does not actually act on THC (**2**, **Figure 3A**) but instead on one of the diffusible intermediates of the CHS-catalyzed reaction, most likely *p*-coumaroylcyclohexantrione (**5**, **Figure 3A**), which is highly unstable and is rapidly aromatized to produce THC in an aqueous system (Bomati et al., 2005). THC and isoliquiritigenin then undergo the reactions catalyzed by CHI, IFS, and 2-hydroxyisoflavanone dehydratase (HID) to produce genistein (**4**) and daidzein (**8**), respectively (**Figure 3A**; Mameda et al., 2018). The amount of the CHR product isoliquiritigenin does not generally exceed 25% (mol/mol) of the total CHS products (isoliquiritigenin, THC, and naringenin) during the combined action of dilute CHS and GmCHR1 (0.05 μM each) *in vitro*. These low product ratios for CHR catalysis during *in vitro* assays could arise from the fact that only a small fraction (<25%) of **5** produced during CHS catalysis is transferred to the active site of GmCHR1 while the majority (>75%) escapes and diffuses to the aqueous system to give rise to THC.

To establish a high 5-deoxyisoflavonoid ratio in the cells of soybean roots and seeds (i.e., a high product ratio for CHR catalysis), **5** has to be immediately transferred, prior to aromatization, from the active site of CHS to that of CHR. One possible mechanism for achieving this would be binding of CHR to CHS, facilitating the channeling of **5** between them. However, the crystal structure of CHR suggested that direct association of the active sites of CHR and CHS is impossible and that passive diffusion may be the only way to transfer **5** from CHS to CHR (Bomati et al., 2005). In fact, Y2H assays showed that GmCHR1 (the only GmCHR paralog whose catalytic activity was confirmed) was unable to interact with any of the GmCHS isozymes (Waki et al., 2016; Mameda et al., 2018). Alternatively, a shorter distance or transit time for **5** between the two enzymes could be achieved in a metabolon and located very close to each other. Because CHS isozymes have been shown to interact with IFS isozymes (GmIFS) (**Figure 2Ca**; Dastmalchi et al., 2016; Waki et al., 2016), the involvement of GmCHR1 in the isoflavonoid metabolon was examined. However, GmCHR1 was not found to interact with any of the enzymes examined including IFS isozymes (Waki et al., 2016; Mameda et al., 2018). Moreover, the product ratio for CHR catalysis did not exceed 50% even when high concentrations of GmCHR1 and CHS were used in *in vitro* enzyme assays (Oguro et al., 2004; Mameda et al., 2018). Therefore, the involvement of GmCHR1 in the observed high proportion of 5-deoxyisoflavonoids in the roots and seeds of unstressed plants was unlikely.

Thus, 11 GmCHR paralogs were comprehensively analyzed for their possible involvement in biosynthesis of 5-deoxyisoflavonoids in the roots and seeds of unstressed plants, and the data obtained strongly suggested the involvement of a previously unappreciated soybean CHR, GmCHR5 (**Figure 2Ca**). Specifically, among the GmCHR paralogs examined, the expression patterns of *GmCHR5* were the most consistent with the observed patterns of the accumulation of daidzein conjugates in the roots and the seeds of unstressed plants. When interactions of these GmCHR isozymes with soybean isoflavonoid enzymes (**Figure 3A**) were analyzed by split-ubiquitin Y2H assays, GmCHR5 uniquely interacted with IFS isozymes (Mameda et al., 2018). Moreover, *in vitro* assay results suggested that the product ratio for CHR catalysis depended on the GmCHR5 concentration, with higher concentrations resulting in higher ratios (approaching 90%) (Mameda et al., 2018). Thus, the results of enzyme assays, transcription analyses, and proteinprotein interaction assays were all consistent with the fact that GmCHR5, but not other CHR isozymes, is the key player in the accumulation of 5-deoxyisoflavonoids in the roots and seeds of unstressed plants. It would be highly likely that the interactions of CHS and GmCHR5 with IFS could allow the microcompartmentalization of the metabolic process, resulting in a product ratio for CHR catalysis high enough for the dominated accumulation of 5-deoxyisoflavonoids in the roots and seeds of unstressed plants. This illustrates the previously proposed functional significance of metabolon formation, i.e., preventing the loss of intermediates by diffusion and reducing the transit time between active sites. This also supports the hypothesis that specific spatial distributions of a flavonoid can be attained by inclusion of a specific isozyme in a flavonoid metabolon in a spatially specific manner.

# Functional Differentiation of GmCHR Isozymes in the Soybean

GmCHR1 and GmCHR6 are unable to interact with any of the isoflavonoid enzymes shown in **Figure 3A** (Mameda et al., 2018). However, this does not necessarily rule out their involvement in 5′-deoxyisoflavonoid biosynthesis but is rather consistent with functional differentiation of GmCHR isozymes in the soybean. Previously, expression of *GmCHR1*, *GmCHR5*, and *GmCHR6* was shown to be induced upon microbial infection (Sepiol et al., 2017). Moreover, *GmCHR6* is located near a quantitative trait locus region linked to resistance to a pathogenic oomycete (Sepiol et al., 2017). In soybean, the production of both types (5-deoxy- and 5-hydroxy-) of isoflavonoids is induced by microbial pathogens. The production of both types of isoflavonoids, rather than the exclusive production of 5′-deoxy type, would be needed to fully implement relevant soybean defense mechanisms. The induced production of GmCHR1 and GmCHR6, showing the maximum product ratio for CHR catalysis of 50%, make it possible to accumulate high levels of both types of isoflavonoids in infected plants. Thus, it would be likely that GmCHR5 plays a key role in the exclusive accumulation of 5-deoxyisoflavonoids in the roots and seeds of unstressed plants while GmCHR1 and GmCHR6 play key roles in the induced defense mechanisms against microbial pathogens.

# FLAVONOID METABOLONS IN THE ORDER LAMIALES

Snapdragon and torenia are flowering ornamentals in which colorful petals are the most eye-catching trait. The petal colors in these lamiales plants are mainly provided by flavonoids, which represent different flavonoid classes from those mainly found in *A. thaliana* and soybean. The petal colors of snapdragon—magenta, orange, red, pink, yellow, cream, and white—are produced by a combination of anthocyanins (orange, pink, red, and reddish purple), aurones (yellow), and flavones (co-pigments) (Ono and Nakayama, 2007). Torenia accumulates anthocyanins and flavones in its flower petals, which are responsible for the bluish purple and pink colors (Ueyama et al., 2002).

In 2018, physical interactions among flavonoid enzymes in snapdragon and torenia were clarified, illustrating the formation of flavonoid metabolons responsible for flower coloration (Fujino et al., 2018). Binary interactions found in split-ubiquitin Y2H and BiFC assays were: FNSII-CHS, FNSII-CHI, FNSII-DFR, CHS-CHI, CHI-DFR, and F3′H-CHI in snapdragon; and FNSII-CHI, FNSII-F3H, FNSII-DFR, FNSII-ANS, CHI-DFR, and F3′H-CHI in torenia (**Figures 2D,E**). Split-ubiquitin Y2H assays also suggested that binding of CHI and DFR to FNSII is not exclusive in snapdragon.

Interestingly, enzymes involved in the late stage of anthocyanin biosynthesis (DFR in snapdragon; DFR, F3H, and ANS in torenia) were found to interact with FNSII (cytochrome P450 CYP93B1) (**Figures 1, 2D,E**; Fujino et al., 2018). The activity of FNSII is not needed for anthocyanin biosynthesis, suggesting that FNSII could function as a scaffold for anthocyanin biosynthesis. Although further studies are needed to test this hypothesis, several findings are consistent with FNSII as an important component of the metabolon for anthocyanin biosynthesis. Previously, attempts were made to engineer torenia flowers showing a deeper petal color using metabolic engineering (Ueyama et al., 2002). To achieve this, *FNSII* was co-suppressed in blue-violet torenia flowers to diminish FNSII activity. As anthocyanin synthesis competes with flavone synthesis for flavanones as the shared precursors (**Figure 1**), this was predicted to favor anthocyanin production at the expense of flavone formation (see **Figure 1**). This strategy was inspired by the observations in black dahlia (*Dahlia variabilis*, order Asterales) accumulating large amounts of anthocyanins, in which FNSII production is suppressed by endogenous posttranscriptional gene silencing (Thill et al., 2012; Deguchi et al., 2013). In the black dahlia, suppression of *FNSII* increased production of anthocyanins while flavone production was decreased. Metabolic engineering of torenia showed that the co-suppression of *FNSII* diminished flavone and increased flavanone levels in petals, as expected (Ueyama et al., 2002). However, anthocyanin levels in the petals of the *FNSII*-suppressed torenia decreased considerably, producing a paler flower. The reason for this result was unknown, but this observation can now be explained by FNSII acting as a component of the metabolon related to anthocyanin production.

Interestingly, in the anthocyanin-accumulating snapdragon petals, flavones were accumulated first, followed by anthocyanins, and finally aurones (Toki, 1988; Fujino et al., 2018). This sequence of flavonoid accumulation is consistent with the transcriptional patterns of snapdragon flavonoid enzyme genes during flower development (Fujino et al., 2018). Thus, on the basis of interactions (**Figure 2D**) and temporal gene expression patterns of flavonoid enzymes in red snapdragon petal cells, a model of the flower stage-dependent formation of the flavonoid metabolon has been proposed (Fujino et al., 2018). In this model, CHS, CHI, and FNSII are expressed and form a flavone metabolon on the ER surface at the beginning of the flower development. Halfway through flower development, F3H and DFR are expressed to form an anthocyanin metabolon by using the preexisting flavone metabolon as a scaffold.

The similarity of interaction partnerships in the flavonoid metabolons of snapdragon and torenia (**Figures 2D,E**) is consistent with the close phylogenetic relationship of these plants. Collectively with the fact that the *A. thaliana* genome lacks *IFS* and *FNSII* genes, interactions in flavonoid metabolons may differ between plant species while those of closely related plant species are more similar to each other (**Figure 2**). This is consistent with the observed structural diversity of flavonoids in plants and the fact that each plant lineage produces structurally distinct flavonoids in a lineage-specific manner.

# PROTEIN-PROTEIN INTERACTIONS OF FLAVONOID ENZYMES AND PROTEINS IN HOPS

Hops uniquely accumulate the prenylated flavonoids xanthohumol (3′-prenyl-6′-*O*-methyl-THC) (**10**, **Figure 3B**) and demethylxanthohumol (3′-prenyl-THC) (**9**) in the glandular trichomes (lupulin glands) of female cones, a key ingredient in beer brewing (Stevens and Page, 2004). Recent studies of the synthesis of these prenylated flavonoids provide examples of the involvement of non-catalytic CHI-fold proteins in flavonoid metabolons as specialized auxiliary proteins (Ban et al., 2018).

In hops, the soluble, trichome-specific isozyme of CHS (CHS\_H1) is involved in the biosynthesis of prenylchalcones and catalyzes the production of THC (**2**, **Figure 3B**); THC is then prenylated by the membrane-bound, aromatic prenyltransferase PT1L, then 6′-*O*-methylated by the soluble *O*-methyltransferase OMT1 to produce xanthohumol (**10**, **Figure 3B**; Ban et al., 2018). Recent studies have shown that non-catalytic members of the CHI-fold protein family, CHIL1 (a type-III, FAP-related protein) and CHIL2 (a type-IV, EFP-related protein), are involved in the syntheses of prenylated flavonoids in hops. CHIL2 was found to interact with CHS\_H1 and PT1L by Y2H, LuCIA, and IP assays (Ban et al., 2018). As PT1L is a membrane-bound enzyme with eight predicted transmembrane domains and proposed to localize in trichome plastids (Li et al., 2015), these results suggest a membrane-anchored metabolon for xanthohumol biosynthesis. *In vitro* enzymatic assays showed that CHIL2 slightly enhances the catalytic efficiencies of CHS\_H1 and PT1L. Specifically, the binding of CHIL2 to CHS\_H1 results in a 6–18-fold increase in *k*cat and 5.5–6.0-fold increase in *K*<sup>m</sup> for *p*-coumaroyl-CoA (**1**, **Figure 3B**) and malonyl-CoA, with up to 2.9-fold increase in *k*cat/*K*m values; whereas, the binding of CHIL2 to PT1L results in a slight increase in *V*max and slight decrease in *K*m for these substrates, with up to a 2.2-fold increase in *V*max/*K*m values. *S. cerevisiae* was engineered for the production of demethylxanthohumol. The engineered yeast co-expressing CHIL2 and CHS\_H1 with PT1L produced greater amounts of demethylxanthohumol than those expressing CHS alone, consistent with CHIL2 functioning as an EFP *in vivo*. Thus, specific binding of CHIL2 to CHS enhances the rate of CHS-catalyzed entry from the general phenylpropanoid pathway to the flavonoid pathway (**Figure 1**) to potentiate flavonoid production. Unlike CHIL2, CHIL1 did not interact with CHIL2, PT1L, CHS\_H1, *p*-coumaroyl-CoA ligase (see **Figure 1**), or OMT, as found by multiple methods (Ban et al., 2018). Binding assays and computational docking studies suggested that CHIL1 binds to demethylxanthohumol and THC to stabilize their ring-opening conformations, circumventing isomerization of THC to naringenin flavanone (Ban et al., 2018). This role of CHIL1 is consistent with the high accumulation of xanthohumol and demethylxanthohumol in hop glandular trichomes, in which almost no THC and naringenin were detected. PT1L is also involved in bitter acid biosynthesis in this plant and physically interacts with

another membrane-bound, plastidial, aromatic prenyltransferase, PT2, to form a metabolon that catalyzes the prenylations in the β-bitter acid pathway. In this pathway, PT1L catalyzes the first prenylation and PT2 catalyzes the subsequent two prenylations (Li et al., 2015; Ban et al., 2018). Thus, PT1L might serve as a key scaffold for the biosynthesis of both terpenophenolics (bitter acids and prenylated flavonoids) in hop glandular trichomes.

# CONCLUSION AND FUTURE PERSPECTIVES

Interactions between enzymes and proteins in the flavonoid metabolons clearly vary between plant species. This is consistent with the species-dependent structural diversity of flavonoids and points to a role for differential metabolon formation in

# REFERENCES


producing different structures of flavonoids in a plant speciesspecific manner. Although specific protein-protein interactions in some flavonoid metabolons have been identified, it is still difficult to grasp the larger picture of flavonoid metabolons and understand how enzymes and proteins dynamically form metabolons to regulate flavonoid biosynthesis. Specifically, the flavonoid metabolon models proposed to date are primarily based on the results of binary interaction analyses and do not describe how three or more enzymes and/or scaffolding proteins are simultaneously and cooperatively associated. Moreover, the protein-protein interactions in flavonoid metabolons remain to be investigated at the atomic level, and their functional significance is yet to be addressed in most important plant taxa. Finally, it is likely that the interactions underlying metabolon formation have evolved to implement higher orders of metabolic functions in the cell, allowing for the structural diversification

of flavonoids. Taking full advantage of these important phenomena in synthetic biology has the potential to enhance the efficiency of production of many useful flavonoids in heterologous systems.

# AUTHOR CONTRIBUTIONS

TN mainly wrote the manuscript and was responsible for the general opinions stated in the manuscript. All authors reviewed and agreed with the final version of the submitted manuscript.

# FUNDING

This study was supported in part by JSPS KAKENHI grant number 18H03938.

fungi" in *Phenolic metabolism in plants*. eds. H. Stafford, and R. Ibrahim, vol. 26 (New York: Plenum Press), 139–164.


in the course of flavonoid biosynthesis. *Phys. Chem. Chem. Phys.* 18, 10337–10345. doi: 10.1039/C5CP05059F


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Nakayama, Takahashi and Waki. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Evolutionary Diversification of Primary Metabolism and Its Contribution to Plant Chemical Diversity

### Hiroshi A. Maeda\*

Department of Botany, University of Wisconsin–Madison, Madison, WI, United States

#### Edited by:

Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan

#### Reviewed by:

Anthony Michael, UT Southwestern Medical Center, United States Philipp Zerbe, University of California, Davis, United States

#### \*Correspondence:

Hiroshi A. Maeda maeda2@wisc.edu

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 01 May 2019 Accepted: 20 June 2019 Published: 10 July 2019

#### Citation:

Maeda HA (2019) Evolutionary Diversification of Primary Metabolism and Its Contribution to Plant Chemical Diversity. Front. Plant Sci. 10:881. doi: 10.3389/fpls.2019.00881 Plants produce a diverse array of lineage-specific specialized (secondary) metabolites, which are synthesized from primary metabolites. Plant specialized metabolites play crucial roles in plant adaptation as well as in human nutrition and medicine. Unlike welldocumented diversification of plant specialized metabolic enzymes, primary metabolism that provides essential compounds for cellular homeostasis is under strong selection pressure and generally assumed to be conserved across the plant kingdom. Yet, some alterations in primary metabolic pathways have been reported in plants. The biosynthetic pathways of certain amino acids and lipids have been altered in specific plant lineages. Also, two alternative pathways exist in plants for synthesizing primary precursors of the two major classes of plant specialized metabolites, terpenoids and phenylpropanoids. Such primary metabolic diversities likely underlie major evolutionary changes in plant metabolism and chemical diversity by acting as enabling or associated traits for the evolution of specialized metabolic pathways.

Keywords: plant chemical diversity, metabolic enzymes, primary metabolism, specialized metabolism, evolution of plant metabolism, amino acid biosynthesis

# INTRODUCTION

Plants produce a diverse array of secondary or specialized metabolites, which play critical roles in plant adaptation under various environmental conditions. These phytochemicals are also widely used in human nutrition and medicine. Nearly one million metabolites are estimated to be produced throughout the plant kingdom (Afendi et al., 2012), though many of them are yet to be discovered. All of these specialized metabolites are synthesized from a certain primary metabolite precursor(s), such as sugars, amino acids, nucleotides, organic acids, and fatty acids, which are essential for maintaining cellular homeostasis and the life of whole organisms. Besides their vital nature, primary metabolic pathways are highly regulated and integrated to complex metabolic networks (Baghalian et al., 2014; Sulpice and McKeown, 2015; Beckers et al., 2016; Filho et al., 2018). Consequently, genes encoding primary metabolic enzymes are subjected to purifying selection and generally considered to be conserved among the plant kingdom, unlike highly diversified

specialized metabolism (Pichersky and Lewinsohn, 2011; Weng et al., 2012; Moghe and Last, 2015; Moore et al., 2019). Yet, some primary metabolic pathways were altered during plant evolution, which had profound impacts on overall plant physiology, metabolism, and adaptation. This review describes examples of primary metabolic diversification in different plant lineages and discusses their potential roles in the evolution of downstream specialized metabolic pathways and plant chemical diversity as enabling or associated traits.

# ENABLERS OF EVOLUTIONARY DIVERSIFICATION OF THE PHOTOSYNTHETIC CARBON FIXATION PATHWAYS

One of the most fundamental metabolic pathways of plants, photosynthetic carbon fixation, has been modified in a number of plant lineages to what is known as C<sup>4</sup> photosynthesis and Crassulacean acid metabolism, though the former will be mainly discussed here. Unlike 3-phosphoglycerate (3PGA), a three carbon molecule produced by ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) in C<sup>3</sup> photosynthesis, C<sup>4</sup> photosynthesis initially generates a four carbon molecule, i.e., oxaloacetate, by phosphoenolpyruvate (PEP) carboxylase (PEPC). Oxaloacetate is further converted to malate or aspartate and shuttled from mesophyll to bundle sheath cells, where CO<sup>2</sup> is released for refixation by Rubisco (**Figure 1**) (Langdale, 2011; Sage et al., 2012; Furbank, 2016). This highly intricate mechanism is seemingly maladaptive due to high metabolic costs (e.g., fixing carbon twice, regeneration of PEP), but provides adaptive advantage under arid, warm, and high light conditions by concentrating CO<sup>2</sup> and attenuating the oxygenation side reaction of Rubisco and hence photorespiration (Christin and Osborne, 2014; Sage and Stata, 2015). Thus, besides the decline in atmospheric CO<sup>2</sup> around 30 million years ago (Pagani et al., 2005), such extreme environmental conditions, in which some plants existed, likely acted as an "environmental enabler" for the evolutionary diversification of the photosynthetic carbon fixation, the entry step of plant metabolic pathways.

The C<sup>4</sup> photosynthetic pathway evolved more than 60 times independently across the plant phylogeny (Sage et al., 2011, 2012). Notably, C<sup>4</sup> photosynthesis is unevenly distributed across the phylogeny and particularly prevalent in specific plant lineages, such as Poaceae and Caryophyllales (Christin et al., 2009, 2015; Sage et al., 2011). Recent comparative analyses of C<sup>3</sup> and C<sup>4</sup> plants as well as C3-C<sup>4</sup> transitory species revealed that the repeated evolution of C<sup>4</sup> photosynthesis was likely facilitated by certain "pre-conditions" or "enabling traits" that emerged or were present in certain plant lineages (Ludwig, 2013; Sage et al., 2014; Heckmann, 2016; Miyake, 2016; Schlüter and Weber, 2016). These enabling traits include "genetic enablers," such as C4-like cell-type specific expression of C<sup>4</sup> enzymes (e.g., PEPC, Williams et al., 2012; Christin et al., 2013a, 2015) and "anatomical enablers," such as proto-Kranz anatomy (Christin et al., 2013b; Lundgren et al., 2014; Sage et al., 2014), in C<sup>3</sup> ancestors. These pre-conditions further facilitated emergence of "metabolic enablers," such as shuttling of photorespiratory glycine from mesophyll to bundle sheath cells acting as CO<sup>2</sup> pump (Sage et al., 2013; Schulze et al., 2013). This so-called C<sup>2</sup> photosynthesis is present in many sister species to C<sup>4</sup> lineages (Sage et al., 2011, 2012; Khoshravesh et al., 2016) and appears to be accompanied by shuttling of other metabolites, such as alanine/pyruvate or aspartate/malate, for balancing of nitrogen between the mesophyll and bundle sheath cells (Mallmann et al., 2014; Schlüter and Weber, 2016). Once these pre-conditions were established, C<sup>4</sup> photosynthesis could evolve relatively easily and thus repeatedly, such as through optimization of kinetic properties of C<sup>4</sup> enzymes (e.g., PEPC) and bundle sheath specific expression of Rubisco (Langdale, 2011; Sage et al., 2012; Furbank, 2016; Reeves et al., 2017). Thus, the combination of environmental, genetic, anatomical, and metabolic enablers allowed astounding alterations in the core primary metabolic pathway, photosynthetic carbon fixation, in certain plant lineages.

# DIVERSIFICATION OF AMINO ACID BIOSYNTHETIC PATHWAYS AT THE INTERFACE OF PRIMARY AND SPECIALIZED METABOLIC PATHWAYS

Amino acid biosynthetic pathways not only provide essential protein building blocks but connect central carbon metabolism to a variety of specialized metabolism. Some of these amino acid pathways have diversified in certain plant lineages and likely contributed to the chemical diversity of their downstream specialized metabolism.

Isopropylmalate synthase (IPMS) catalyzes the committed step of leucine biosynthesis (de Kraker et al., 2007). IPMS competes for the 3-methyl-2-oxobutanoate (3MOB) substrate with valine biosynthesis (**Figure 1**) and is typically feedback inhibited by the end product, leucine, through its C-terminal allosteric regulatory domain (Koon et al., 2004; de Kraker and Gershenzon, 2011). Glandular trichomes of Solanaceae plants accumulate insecticidal specialized metabolites, acylsugars, which have various aliphatic acids attached to a sugar backbone (e.g., sucrose, Fan et al., 2019). A wild tomato Solanum pennellii and the cultivated tomato, Solanum lycopersicum, have 2-methylpropanoic and 3-methylbutanoic acid (iC4 and iC5) acyl chains, which are derived from 3MOB and 3 isopropylmalate, intermediates of valine and leucine metabolism, respectively (**Figure 1**). Analysis of introgression lines between S. lycopersicum and S. pennellii, followed by expression and biochemical analyses, revealed that the C-terminal regulatory domain of the IPMS3 isoform is truncated in S. lycopersicum, making this isoform insensitive to leucinemediated feedback inhibition (Schilmiller et al., 2010; Ning et al., 2015). In contrast, the IPMS3 isoform of S. pennellii is further truncated into its catalytic domain and has lost the enzyme activity. Thus, the de-regulated and inactive IPMS3 in S. lycopersicum and S. pennellii directs more carbon flow

toward leucine and valine metabolism, respectively. Having the broad substrate specificity of downstream acyl-CoAdependent acyltransferase (Schilmiller et al., 2015), increased availability of 3MOB and 3-isopropylmalate contributes to the formation of iC4 and iC5 acylsugars, respectively. Brassicaceae species including Arabidopsis thaliana also has a truncated IPMS homolog but with point mutations that alter substrate specificity to now function as methylthioalkylmalate synthase in the initial step of methionine-derived glucosinolate biosynthesis (de Kraker and Gershenzon, 2011). Unlike the latter example of recruitment of specialized metabolic enzymes from primary metabolism, as discussed in previous reviews (Weng, 2014; Moghe and Last, 2015), the study by Ning et al. (2015) revealed a role of altered branch chain amino acid biosynthesis in the acyl chain diversity of acylsugars in the Solanum genus.

Anthranilate synthase (AS) catalyzes the committed step of biosynthesis of an aromatic amino acid, L-tryptophan, and its enzyme activity is strictly regulated through feedback inhibition of one of the AS enzyme complex, ASA, by tryptophan (Romero et al., 1995; Li and Last, 1996). Two copies of ASA genes, ASA1 and ASA2, were found in Ruta graveolens (the Rutaceae family) that uses anthranilate to produce unique specialized metabolites, acridone alkaloids (Bohlmann et al., 1995). While ASA2 was constitutively expressed, ASA1 was induced under elicitor treatment, which stimulates the accumulation of acridone alkaloids. Interestingly, the ASA1 enzyme was much more resistant than ASA2 to the tryptophan-mediated feedback inhibition, suggesting that the expression of the de-regulated ASA1 enzyme allowed elevated accumulation of the anthranilate precursor and hence efficient production of the downstream specialized metabolites, acridone alkaloids, in this unique plant lineage (Bohlmann et al., 1996; **Figure 1**). A naturally occurring feedback-insensitive ASA enzyme has also been identified in Nicotiana tabacum (the Solanaceae family, Song et al., 1998), but its in planta function is currently unknown. Further evolutionary analyses across the Rutaceae family can evaluate if the increased availability of anthranilate served as an enabling trait for later evolution of acridone alkaloid biosynthesis. Alternatively, the de-regulated ASA1 might have evolved after the emergence of the acridone alkaloid pathway as an associated trait and further elevated the alkaloid production.

L-Tyrosine is another aromatic amino acid required for protein synthesis but also used to produce diverse plant natural products, such as tocochromanols, benzylisoquinoline alkaloids, cyanogenic glycosides (e.g., dhurrin), and rosmarinic acids (Schenck and Maeda, 2018). Tyrosine is typically produced

via arogenate dehydrogenase (TyrAa) that is localized within the plastids (Rippert et al., 2009; Wang et al., 2016) and strongly feedback inhibited by tyrosine (**Figure 1**; Connelly and Conn, 1986; Rippert and Matringe, 2002a,b). Recent studies, however, uncovered diversification of the tyrosine biosynthetic pathways in different plant lineages. In addition to the highly regulated plastidic TyrAa-mediated pathway, many legumes including Glycine max (soybean) and Medicago truncatula have an additional tyrosine biosynthetic pathway mediated by prephenate dehydrogenase (TyrAp) (Rubin and Jensen, 1979; Schenck et al., 2015), which is often found in microbes (Bonner and Jensen, 1987; Bonner et al., 2008; Schenck et al., 2017b). Notably, these legume TyrA<sup>p</sup> enzymes are localized outside of the plastids and completely insensitive to feedback inhibition by tyrosine (Schenck et al., 2015, 2017a), suggesting that the alternative tyrosine pathway is physically separated from the canonical plastidic pathway and escaped feedback inhibition by tyrosine (**Figure 1**). While the metabolic and physiological functions of the alternative cytosolic TyrA<sup>p</sup> pathway in legumes is largely unknown, some legumes accumulate very high levels of tyrosine and tyrosine-derived compounds (e.g., L-DOPA in Mucuna pruriens, Wichers et al., 1993; Lokvam et al., 2006). A recent study found that the expression of gene encoding the tyrosine-insensitive TyrA<sup>p</sup> enzyme is elevated in Inga species that accumulate tyrosine and its derived secondary metabolites (e.g., tyrosine-gallates) at 5 to 20% of seedling dry weight (Coley et al., 2019). Thus, the presence of the feedback-insensitive TyrA<sup>p</sup> enzyme in the legume family likely provided a unique pre-condition that enabled increased tyrosine biosynthetic activity and hyperaccumulation of tyrosine-derived compounds in this specific genus of legumes.

Betalains are red to yellow alkaloid pigments uniquely produced in the plant order Caryophyllales, which include Beta vulgaris (beet), spinach, quinoa, and cactus. Betalain pigments are derived from tyrosine and replaced more ubiquitous red to purple anthocyanin pigments derived from phenylalanine in many Caryophyllales species (Tanaka et al., 2008; Brockington et al., 2011; Polturak and Aharoni, 2018; **Figure 1**). Like Arabidopsis and unlike legumes, Caryophyllales species only have arogenate-specific TyrA<sup>a</sup> enzymes; however, one TyrA<sup>a</sup> isoform (TyrAaA) exhibits relaxed sensitivity to tyrosine inhibition (Lopez-Nieves et al., 2018; **Figure 1**). The presence of the de-regulated TyrAaA enzymes positively and negatively correlates with those of betalain and anthocyanin pigmentation, respectively, across Caryophyllales. Evolutionary analyses, by utilizing transcriptome data of over one hundred Caryophyllales species (Brockington et al., 2015), revealed that the de-regulated TyrAaA enzymes emerged before the evolution of the betalain biosynthetic pathway (Lopez-Nieves et al., 2018). Thus, the enhanced supply of the tyrosine precursor, due to relaxed regulation of the TyrA<sup>a</sup> enzyme, likely acted as a metabolic enabler for the subsequent evolution of a novel downstream specialized metabolic pathway, betalain biosynthesis, in this specific plant order (**Figure 1**). Further evolutionary analyses of associated genes and enzymes involved in the betalain pathway and the competing phenylalanine and phenylpropanoid pathways will provide novel insight into how primary and specialized metabolism evolved coordinately in a macroevolutionary scale beyond the levels of species and genera.

# ANCIENT DIVERSIFICATION OF IPP AND PHENYLALANINE BIOSYNTHETIC PATHWAYS IN PLANTAE

In the ancient history of Plantae, alternative primary metabolic pathways evolved and likely contributed to later evolution of plant specialized metabolism and chemical diversity. Terpenoids and phenylpropanoids are the two major classes of plant natural products, which are synthesized from the primary metabolite precursors, isopentenyl pyrophosphate (IPP) and phenylalanine, respectively (McGarvey and Croteau, 1995; Gershenzon and Dudareva, 2007; Vogt, 2010; Tohge et al., 2013). Notably, plants possess two alternative pathways to synthesize IPP and phenylalanine.

In addition to sterols and quinones, plants use IPP to synthesize photosynthetic pigments (chlorophylls, carotenoids), plant hormones (brassinosteroids, abscisic acid, gibberellins), and a diverse array of terpenoid compounds (McGarvey and Croteau, 1995; Gershenzon and Dudareva, 2007; Tholl, 2015). Such a high demand of IPP for synthesis of diverse terpenoid compounds in plants is supported by the two alternative IPP biosynthetic pathways, the methylerythritol phosphate (MEP) and mevalonate (MVA) pathways, which take place in the plastidic and extra-plastidic subcellular compartments, respectively (Vranová et al., 2013; Rodríguez-Concepción and Boronat, 2015). The MEP pathway utilizes glyceraldehyde 3 phosphate derived from the pentose phosphate pathways in the plastids and hence can draw carbon flux directly from photosynthetic carbon fixation (**Figure 1**). While the MVA pathway appears to be an ancestral pathway that evolved in all three domains of life (i.e., eukaryotes, archaea, and most bacteria) or in their last universal ancestor (i.e., cenancestor) (Lombard and Moreira, 2011), the plastidic MEP pathway has mosaic evolutionary origins (Lange et al., 2000; Matsuzaki et al., 2008). A common ancestor of plastid bearing eukaryotes likely acquired MEP pathway enzymes from various bacterial ancestors (i.e., cyanobacteria, A-proteobacteria, Chlamydia) through horizontal gene transfers (Matsuzaki et al., 2008) and the MEP pathway was vertically transmitted to the descendants, the entire Plantae including algae and plants.

L-Phenylalanine is the primary metabolite precursor of phenylpropanoids and is synthesized via two alternative pathways in plants (Tzin and Galili, 2010; Maeda and Dudareva, 2012; Yoo et al., 2013; Qian et al., 2019). In many microbes, phenylalanine is synthesized via the phenylpyruvate intermediate, catalyzed by prephenate dehydratase (PDT) and phenylpyruvate aminotransferase (**Figure 1**) (Bentley, 1990). Although an analogous phenylpyruvate pathway also exists in the plant cytosol (Yoo et al., 2013; Qian et al., 2019), plants synthesize phenylalanine mainly in the plastids via the

L-arogenate intermediate: prephenate is first transaminated by prephenate aminotransferase (PPA-AT) to arogenate (Graindorge et al., 2010; Dal Cin et al., 2011; Maeda et al., 2011), which is then converted to phenylalanine by arogenate dehydratase (ADT; Siehl and Conn, 1988; Cho et al., 2007; Maeda et al., 2010; **Figure 1**). Evolutionary analyses of the PPA-AT and ADT enzymes suggested that an ancestor of green algae and land plants appear to have acquired both of these two enzymes from an ancestor of Chlorobi/Bacteroidetes bacteria, likely through horizontal gene transfer (Dornfeld et al., 2014). Some cyanobacteria also have PPA-AT enzymes but with a distinct evolutionary origin from those of plants and Chlorobi/Bacteroidetes bacteria

(Graindorge et al., 2014; Giustini et al., 2019). Thus, these dual primary metabolic pathways of isoprenoid and phenylalanine biosynthesis appear to have evolved in a common ancestor of Plantae. Although evolutionary analyses of such deep phylogenetic nodes are challenging, these dual precursor supply pathways potentially served as metabolic enablers for the evolutionary expansion of terpenoids and phenylpropanoids, the hallmarks of chemical diversity uniquely seen in the plant kingdom today.

# DIVERSIFICATION OF LIPID METABOLISM IN PLANTS

Notable chemical diversity also exists in plant lipid metabolism (Badami and Patil, 1980; Ohlrogge et al., 2018), which makes the boundary of primary and specialized (secondary) metabolism difficult to define. Besides major acyl chains (e.g., oleic 18:1, linolenic 18:3) found in most plant lipids, some plants produce unusual fatty acids: For example, oils of castor (Ricinus communis, Euphorbiaceae family) and Vernonia galamensis (Asteraceae family) consist of primarily (80–90%) hydroxylated and epoxy fatty acids, respectively (Canvin, 1963; Ayorinde et al., 1990). Also, diverse acetylenic natural products having a carbon-carbon triple bond(s) or alkynyl functional group can be produced by modification of the fatty acid precursors (Minto and Blacklock, 2008; Negri, 2015). The production of these hydroxylated fatty acids and polyacetylenes are mediated by divergent fatty acid desaturases with altered product specificities and catalytic properties (van de Loo et al., 1995; Broun et al., 1998; Liu et al., 1998; Broadwater et al., 2002; Minto and Blacklock, 2008; Negri, 2015). Tremendous diversity of cuticular waxes has been also documented across the plant kingdom likely due to the presence of specialized acyl chain elongation and modifying enzymes (Jetter et al., 2007; Busta and Jetter, 2018).

Recent studies also revealed an intriguing alteration in the core lipid metabolic pathway, triacylglycerol (TAG) biosynthesis, in a specific plant lineage. The fruits of Bayberry (Myrica pensylvanica, Myricaceae family) accumulate abundant and unusual extracellular glycerolipids: TAG, diacylglycerol (DAG), and monoacylglycerol with completely saturated acyl chains at up to 30% of fruit dry weight (Harlow et al., 1965; Simpson and Ohlrogge, 2016). This unique surface wax attracts birds for seed dispersal and is used for making scented candles (Fordham, 1983). Fleshy fruits of oil palm, olive, and avocado also accumulate a large quantity of glycerolipids but intracellularly and by upregulating conventional fatty acid and TAG biosynthetic pathways (Bourgis et al., 2011; Kilaru et al., 2015). In contrast, a novel TAG biosynthetic pathway evolved in Bayberry through "re-purposing" genes and enzymes involved in cutin biosynthesis by altering their gene expression (Simpson and Ohlrogge, 2016; Simpson et al., 2016). These alterations include elevated expression of genes encoding the G subfamily of ABC (ABCG) transporters and lipid transporter proteins likely required for lipid transport across cell membranes and walls, respectively, which will allow extracellular formation of TAG (Simpson et al., 2016). It will be interesting to examine how such reprograming of existing lipid metabolic pathways occur in a step-wise manner during evolution, which will provide useful information for engineering other plants to produce and secrete abundant extracellular glycerolipids.

# SUMMARY AND PERSPECTIVE

Although not as frequent as those of specialized metabolism, accumulating evidence indicates that pathways and enzymes of primary metabolism can be diversified during the plant evolution. Such relatively rare alterations in primary metabolism likely contributed to major evolutionary innovations in the plant kingdom, including the evolution of downstream specialized metabolic pathways and hence plant chemical diversity. Some alterations in primary metabolism appear to have acted as enabling traits for the evolution of novel specialized metabolism, at least in the case of de-regulated tyrosine biosynthesis in Caryophyllales that preceded the emergence of betalain pigmentation (Lopez-Nieves et al., 2018). In other instances, primary metabolic alterations likely co-evolved with and support efficient operation of specialized metabolic pathways. It remains to be examined how prevalent the phenomenon is beyond the pathways and plant lineages that have been examined so far and what impacts such primary metabolic diversification had on overall metabolism, physiology, and environmental adaption of diverse plant species. Another intriguing question is how seemingly maladaptive alterations in highly conserved and constrained primary metabolism were maintained in certain plant lineages, especially until the emergence of a new downstream pathway which might have eventually provided adaptive advantage. What are the environmental, anatomical, and genetic enablers underlying primary metabolic diversification? In the case of tomato feedback-insensitive IPMS and legume TyrA<sup>p</sup> enzymes, their specific expression in the apical trichome cells (Ning et al., 2015) and extra-plastidic subcellular compartment (Schenck et al., 2015) likely allow minimal disturbance to de novo biosynthesis of branch chain and aromatic amino acids, respectively. Further addressing these questions will lead to broader understanding of the evolution of plant metabolism at a macroevolutionary scale. The acquired knowledge of primary metabolic diversification and its underlying genetic and biochemical basis will also allow us to redesign plant metabolism in a holistic manner from primary to specialized metabolism.

# AUTHOR CONTRIBUTIONS

HM wrote the manuscript.

fpls-10-00881 July 8, 2019 Time: 16:10 # 6

# FUNDING

This work was supported by the National Science Foundation grant IOS-1354971 and the Agriculture and Food Research

# REFERENCES


Initiative Competitive grant (2015-67013-22955) from the USDA National Institute of Food and Agriculture to HM.

# ACKNOWLEDGMENTS

I would like to thank Dr. Luke Busta for helpful discussion and suggestions.



Bayberry (Myrica pensylvanica) fruits. Biochim. Biophys. Acta 1861, 1243–1252. doi: 10.1016/j.bbalip.2016.01.022


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Maeda. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fpls-10-00881 July 8, 2019 Time: 16:10 # 8

# Identification and Functional Characterization of Genes Involved in the Biosynthesis of Caffeoylquinic Acids in Sunflower (*Helianthus annuus* L.)

*Ketthida Cheevarungnapakul1 , Gholamreza Khaksar1 , Pawinee Panpetch1 , Patwira Boonjing1 and Supaart Sirikantaramas1,2 \**

#### *Edited by:*

*Danièle Werck, Centre National de la Recherche Scientifique (CNRS), France*

#### *Reviewed by:*

*David Gagneul, Lille University of Science and Technology, France Alain Hehn, Université de Lorraine, France*

### *\*Correspondence:*

*Supaart Sirikantaramas supaart.s@chula.ac.th*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 01 March 2019 Accepted: 10 July 2019 Published: 31 July 2019*

#### *Citation:*

*Cheevarungnapakul K, Khaksar G, Panpetch P, Boonjing P and Sirikantaramas S (2019) Identification and Functional Characterization of Genes Involved in the Biosynthesis of Caffeoylquinic Acids in Sunflower (Helianthus annuus L.). Front. Plant Sci. 10:968. doi: 10.3389/fpls.2019.00968*

*1 Molecular Crop Research Unit, Department of Biochemistry, Faculty of Science, Chulalongkorn University, Bangkok, Thailand, 2 Molecular Sensory Science Center, Faculty of Science, Chulalongkorn University, Bangkok, Thailand*

Sunflower (*Helianthus annuus* L.) sprouts accumulate high amounts of caffeoylquinic acids (CQAs) including chlorogenic acid (5-CQA) and 1,5-diCQA. These compounds, which can be found in many plants, including tomato, globe artichoke, and chicory, have many health benefits, including antioxidant, antihepatotoxic, and antiglycative activities. However, CQA profiles and biosynthesis have not previously been studied in sunflower sprouts. In the present study, we found that 5-CQA and 1,5-diCQA were the major CQAs found in sunflower sprouts. We also identified minor accumulation of other CQAs, namely 3-CQA, 4-CQA, 3,4-diCQA, and 4,5-diCQA. According to genome-wide identification and phylogenetic analysis of genes involved in CQA biosynthesis in sunflower, three genes (*HaHQT1*, *HaHQT2*, and *HaHQT3*) encoding hydroxycinnamoyl CoA:quinate hydroxycinnamoyl transferase (HQT) and two genes (*HaHCT1* and *HaHCT2*) encoding hydroxycinnamoyl CoA:shikimate/quinate hydroxycinnamoyl transferase (HCT) were identified. Expression analysis of these five genes in hypocotyls and cotyledons strongly suggested that HaHQT2 could be the main enzyme responsible for CQA biosynthesis, as *HaHQT2* had the highest expression levels. In addition, when transiently expressed in the leaves of *Nicotiana benthamiana*, all three HaHQTs, which were soluble and not membrane-bound enzymes, could increase the content of 5-CQA by up to 94% compared to that in a control. Overall, our results increase understanding of CQA biosynthesis in sunflower sprouts and could be exploited by plant breeders to enhance accumulation of health-promoting CQAs in these plants.

Keywords: hydroxycinnamoyl-coenzyme A:quinate hydroxycinnamoyl transferase, hydroxycinnamoyl-coenzyme A:shikinate/quinate hydroxycinnamoyl transferase, caffeoylquinic acid, sunflower (*Helianthus annuus* L.), sprout, functional characterization

# INTRODUCTION

Increased health consciousness among consumers and concerns about the negative health effects of chemical preservatives used in the food industry has led to increase an interest in natural and herbal substances. Fruits and vegetables accumulate a wide range of bioactive compounds with many health-promoting benefits. Among these bioactive compounds, phenolics are of high importance, and caffeoylquinic acids (CQAs) comprise one of the most common phenolic groups. When caffeoyl moieties combine with quinic acid, CQAs are formed. These CQAs can be categorized into various groups based on the position, number, and identity of their acyl group. The monocaffeoylquinic acid (monoCQA) group includes 1-CQA, 3-CQA (known as neochlorogenic acid), 4-CQA (known as cryptochlorogenic acid), and 5-CQA (known as chlorogenic acid). The dicaffeoylquinic acid (diCQA) group includes 1,3-diCQA, 1,4-diCQA, 1,5-diCQA, 3,4-diCQA, 3,5-diCQA, and 4,5-diCQA (**Figure 1**).

CQAs can be found in numerous plant species, including *Cichorium intybus* (chicory; Legrand et al., 2016), *Cynara cardunculus* L. var. *scolymus* (globe artichoke; Moglia et al., 2016), and *Helianthus annuus* L. (sunflower; Sun et al., 2012) in the family Asteraceae; *Solanum lycopersicum* L. (tomato) and *Nicotiana tabacum* L. (tobacco) in the family Solanaceae (Niggeweg et al., 2004); *Coffea* spp. (coffee; Lallemand et al., 2012) in the family Rubiaceae; and *Ipomoea batatas* L. (sweet potato; Kojima and Kondo, 1985) and *Ipomoea aquatica* (water spinach; Lawal et al., 2016) in the family Convolvulaceae. The antioxidant, anti-inflammatory, anti-hypertension, and antimicrobial properties of CQAs are well documented by previous studies. In addition, several *in vitro* and *in vivo* studies have shown additional benefits of CQAs, such as a reduction in the risk of cardiovascular diseases (de Sotillo and Hadley, 2002), hepatoprotective properties (Salomone et al., 2017), and inhibition of HIV replication and integration (McDougall et al., 1998; Gu et al., 2007). Notably, CQAs can confer resistance to abiotic stressors such as UV light (Cle et al., 2008) and to biotic stressors (Niggeweg et al., 2004; Leiss et al., 2009; Legrand et al., 2016) in plants. The bioactivity of CQAs mainly depends on their isomerization. For example, the number and position of caffeic acid moieties in diCQAs affect their antioxidant properties (Xu et al., 2012).

Sunflower (*Helianthus annuus* L.) seeds and sprouts are rich in phenolic compounds and vitamins and thus exhibit a wide variety of potential health-beneficial characteristics, including anti-inflammatory, antimicrobial, antioxidant, antihypertensive, and wound-healing properties (Fowler, 2006; Bashir et al., 2015; Guo et al., 2017). In a study by Weisz et al. (2009), monoCQA and diCQA contents comprised up to ~3,359 and 460 mg per 100 g dry weight of seed kernels, respectively. Among the 11 phenolic compounds analyzed in sunflower seed kernels, 5-CQA was the most abundant. In another study by Sun et al. (2012), the antiglycative and antioxidant characteristics of four edible sprouts were investigated, and it was found that sunflower sprout extract exhibited similar antiglycative properties compared with aminoguanidine, a well-known synthetic antiglycative agent. The strong antioxidant and antiglycative properties of sunflower sprouts were attributed to their rich 1,5-diCQA content. Pająk et al. (2014) examined the effect of germination


FIGURE 1 | Structures of caffeoylquinic acids found in plants. The names, abbreviations, common names, and chemical structures of caffeoylquinic acid derivatives, including monocaffeoylquinic acids (monoCQAs) and dicaffeoylquinic acids (diCQAs) are shown.

on nutritional value (total phenolics and flavonoids) and antioxidant properties of sunflower seeds. Interestingly, they found that total phenolics, flavonoids, and antioxidant capabilities were significantly higher in sunflower sprouts than seeds. Moreover, HPLC profiling of sunflower phenolics revealed that CQA content was 3.7-fold higher in sprouts than seeds. Taken together, these observations prompted us to investigate sunflower sprouts for our study.

CQAs are biosynthesized *via* the phenylpropanoid pathway in plants (Comino et al., 2009). The starting point of this pathway is the aromatic amino acid phenylalanine (Phe), which is deaminated by phenylalanine ammonia lyase (PAL) to form cinnamic acid. Then, cinnamate-4-hydroxylase (C4H) and 4-coumarate coenzyme A ligase (4CL) sequentially convert cinnamic acid to form *p*-coumaroyl-CoA. Two possible routes have been proposed for the next step in CQA synthesis. In the first route, hydroxycinnamoyl-CoA:quinate hydroxycinnamoyl transferase (HQT) converts *p*-coumaroyl-CoA to coumaroylquinate, which is then hydroxylated by *p*-coumarate-3′-hydroxylase (C3′H) to form CQA (Niggeweg et al., 2004; Comino et al., 2009; Menin et al., 2010). In the alternative second route, hydroxycinnamoyl-CoA:shikimate/ quinate hydroxycinnamoyl transferase (HCT) catalyzes the formation of *p*-coumaroylshikimate from *p*-coumaroyl-CoA. The *p*-coumaroylshikimate is subsequently hydroxylated to caffeoylshikimic acid by C3′H (Mahesh et al., 2007; Moglia et al., 2009). Notably, HQT and HCT both catalyze reversible reactions (**Figure 2**).

HQT and HCT enzymes belong to the BAHD superfamily of plant-specific acyl-CoA-dependent acyltransferases (Yu et al., 2009; Tuominen et al., 2011). However, our knowledge regarding the biosynthesis of diCQAs is still limited. In a study by Lallemand et al. (2012), a recombinant HCT enzyme cloned from coffee was shown to synthesize diCQAs from 5-CQA. In addition, the enzyme HQT was reported to convert 5-CQA to diCQAs in tomato (Moglia et al., 2014). Legrand et al. (2016) identified two HCTs (HCT1 and HCT2) and three HQTs (HQT1, HQT2, and HQT3) in chicory. Notably, increased levels of 3-CQA were detected in *N. benthamiana* leaves transiently expressing HQT1 or HCT1. Moreover, genes involved in CQA biosynthesis, including *HQT*, *HCT*, *C3*′*H*, *Acyltransf\_1*, and *Acyltransf\_2*, have been isolated and characterized in artichoke (Comino et al., 2007, 2009; Moglia et al., 2009; Menin et al., 2010). Moglia et al. (2016) found two *HCT*s and three *HQT*s in artichoke, which is the same number of *HCT*s and *HQT*s found chicory. To the best of our knowledge, there are currently no published studies in the literature on how CQAs are formed during sprouting stages and how high the CQA content is. Therefore, we focused on sunflower sprout, which is currently gaining popularity among health-conscious consumers. In addition, none of HQTs and HCTs have been identified in sunflower. Acquiring a deeper understanding of CQA biosynthetic enzymes in sunflower is critically important to aid efforts to biofortify sunflower sprouts as functional foods. In the present study, we report the identification and functional characterization of key genes involved in CQA biosynthesis in sunflower sprouts.

# MATERIALS AND METHODS

## Plant Material and Growth Conditions

Sunflower (*H. annuus* L.) seeds were purchased from a local supplier ("Super Top," Thailand). The seeds were first washed, soaked in tap water for 8 h, and wrapped with wet cheesecloth overnight. Next, they were germinated on coconut dust at 30°C under 60% relative humidity for 9 days, with dark conditions for the first 48 h, followed by a 12/12 h light/dark photoperiod for the remaining days. The seed coat was gently removed at hour 56 (after exposure to the light for 8 h; **Supplementary Figure S1A**). To compare metabolites and genes expression of the sample obtained from different developmental stages and tissues, the experiment was divided into two sets. For the first set, whole plants excluding roots were collected at different time points (days 3–9, 2 h after light exposure at each day; **Supplementary Figure S1B**) after germination. For the second set, cotyledons and stems were collected similarly at day 5. For both sets, there were five replicates per time point, and each replicate consisted of 10 sprouts. Replicates were collected separately, frozen in liquid nitrogen, ground into a fine powder using an MM400 mixer mill (Retsch®, Germany) at 30 Hz for 1 min, and then stored at −80°C until further use. In addition, half of the samples were also freeze dried for HPLC analysis.

To grow five-week-old plants for an agroinfiltration experiment, *N. benthamiana* seeds were sown on peat moss and grown in a controlled-climate room at 25°C and with a 16/8 h light/dark photoperiod (artificial light of 4,500 Lux). Two-week-old plants were transplanted individually into pots and were left to continue growing under the same conditions.

# Determination of Caffeoylquinic Acids Contents of Plant Tissues

To analyze CQA contents of plant tissues, 20 mg dry weight of sunflower sprout tissue and 20 mg fresh weight of *N. benthamiana* leaves were extracted with 1 ml of 80% (v/v) methanol containing an internal standard, 0.05 g L−1 puerarin. The reactions were mixed vigorously at 15°C for 15 min by shaking at 1,500 rpm and then centrifuged at 12,000 × *g* for 15 min. Supernatant was collected and filtered through 0.2 μm nylon syringe filters.

A Shimadzu UFLC system equipped with an SPD-M20A photodiode array detector (Shimadzu, Japan) and Kinetex® C18 (250 mm × 4.6 mm, 5 μm; Phenomenex®, USA) was used to analyze 10 μl of the extract from sunflower sprouts and *N. benthamiana* leaves. Chromatographic separation was performed using 0.1% (v/v) TFA in water (solvent A) and 0.1% (v/v) TFA in acetonitrile (solvent B) as the mobile phase. The following elution gradient was used: 5% B for 5 min, 5–15% B for 10 min, a 25-min hold, 15–100% B for 4 min, a 2-min hold, 100–5% B for 4 min, and a 5-min hold. The flow rate was set at 1.5 ml min−1, and the column oven temperature was maintained at 40°C. UV spectra were acquired in the range of 190–800 nm, and chromatograms were obtained at 320 nm. Peaks corresponding with the retention time and UV spectrum of a commercial standard were identified as CQAs. Amounts of each CQA were calculated according to the calibration curve in the range of 0.5–0.007825 mg ml−1. Puerarin was used as an internal standard (Sigma-Aldrich, USA). All CQA standards used in this study were purchased from Carbosynth, England.

Additionally, to confirm identities of the CQAs, the components were analyzed using an Agilent UHPLC system (Agilent Technologies, USA) using Kinetex® C18 (250 mm × 4.6 mm, 5 μm; Phenomenex®, USA). The following elution gradient was used: 0–5% B for 5 min, 5–15% B for 30 min, a 65-min hold, 15–100% B for 5 min, a 5-min hold, 100–5% B for 5 min, and a 10-min hold. The flow rate was set at 0.5 ml min−1, and the column oven temperature was maintained at 40°C. For MS/MS analysis, QTRAP® 4,500 MS/ MS System (AB Sciex™, USA) in multiple reaction monitoring (MRM) and negative ionization mode (ESI-) was used. Operating conditions for MS analysis were as follows: heat block temperature of 500°C, curtain nitrogen gas 30 psi, nebulizer and auxiliary gases of 50 psi, collision nitrogen gas at medium position, ionization voltage of −4,500 V, and entrance potential (EP) of −10. For the tested compounds, the following transition under optimal instrumental conditions of collision energy (CE) of −35 eV, declustering potential (DP) of −50 V, and collision cell exit potential (CXP) of −12 V.

# Identification of Putative *HQT* and *HCT* Genes in Sunflower (*HaHQT*s and *HaHCT*s)

The HQT of tomato (*Solanum lycopersicum*; NP\_001234850.2; Moglia et al., 2014) was used as a query for tBlastn search against the sunflower genome database HA412HO bronze assembly1 . The open reading frames of *HQT* and *HCT* were identified. Then, using EMBL-EBI Clustal Omega (McWilliam et al., 2013), the amino acid sequences of putative HaHQTs and HaHCTs were aligned with well-characterized HQTs/HCTs belonging to different plant species (**Supplementary Figure S2**). These candidates including *HaHQT1* (accession number MK598073), *HaHQT2* (accession number MK598074), *HaHQT3* (accession number MK598075), *HaHCT1* (accession number MK598076), and *HaHCT2* (accession number MK598077) were selected for further study.

# Phylogenetic Analysis

Amino acid sequences of putative HaHQTs and HaHCTs were aligned with sequences of previously characterized enzymes using BioEdit ClustalW multiple alignment (Hall, 1999), and a Neighbor Joining (NJ) tree was created using MEGA7 software (Kumar et al., 2016) with 1,000 bootstrapped data sets.

# RNA Isolation, cDNA Synthesis, and Cloning of Putative *HaHQT*s

Total RNA was isolated from 100 mg fresh weight of sunflower sprouts using TRI reagent® (Molecular Research Center, Inc., USA). Next, RNA concentration and integrity were analyzed by measuring A260 and A280 on an Eppendorf Biophotometer® D30 (Eppendorf, Germany) and by agarose gel electrophoresis, respectively. The RNA was treated with RNase-free DNase I (Thermo Fisher Scientific, USA), and then, the first strand cDNA was synthesized by RevertAid Reverse Transcriptase using oligo(dT)20 primers (Thermo Fisher Scientific, USA) according to the manufacturer's instructions.

The full-length putative *HaHQT*s were amplified with Phusion Hot Start II High-Fidelity DNA Polymerase (Thermo Fisher Scientific, USA) using the prepared cDNA of sunflower sprout as a template. Then, the amplified DNA was cloned into pCR™8/ GW/TOPO®TA vectors (Invitrogen, USA) resulting in pCR™8/ GW/TOPO®-*HaHQT*s and subsequently sequenced. One clone of each putative gene was used for further study (sections Promoter Analysis and Transient Overexpression of *HaHQT*s in *N. benthamiana*).

# Gene Expression Analysis of Sunflower Sprout

Total RNA was extracted from sunflower sprouts as described above (section RNA Isolation, cDNA Synthesis, and Cloning of Putative *HaHQT*s). Then, qRT-PCR was performed using gene-specific primers (**Supplementary Table S1**). Eukaryotic translation initiation factor 5A (*ETIF5A*; XM\_022156448.1), elongation factor 2 (*EF2*; XM\_022137686.1), and actin 7 (*ACT7*; XM\_022154554.1) of sunflower were used as reference genes (Ochogavía et al., 2017). Reactions were conducted in volumes of 10 μl in a 96-well PCR plate using Luna® universal qPCR master mix (New England Biolabs®, USA). A CFX Connect™ Real-Time PCR Detection System and CFX Manager™ Software (BIO-RAD, USA) were used to conduct PCR, and melting curve analysis was used to confirm the existence of a single product. Relative expression level of each gene was calculated using 2−ΔCt (Schmittgen and Livak, 2008) according to the average Ct values of three reference genes (Beekman et al., 2011).

For droplet digital PCR (ddPCR), a 20-μl reaction mixture containing gene-specific primers (**Supplementary Table S1**), QX200™ ddPCR™ EvaGreen Supermix (BIO-RAD, USA), and cDNA was generated as a droplet with QX200™ Droplet Generation Oil for EvaGreen (BIO-RAD, USA) using QX200™ droplet generator (BIO-RAD, USA). EF2 was used as a reference gene. The PCRs were performed in a 96-well PCR plate using a T100™ Thermal Cycler (BIO-RAD, USA). After amplification, QX200™ Droplet Reader (BIO-RAD, USA) was used to measure the fluorescence intensity of each individual droplet. Absolute transcript levels (copies/20 μl reaction) were processed using QuantaSoft™ Software (BIO-RAD, USA). Relative transcript number of each gene was presented as a ratio of the absolute transcript levels (copies/20 μl reaction) of the target gene to the reference gene (Taylor et al., 2017).

# Promoter Analysis

The 2,000 bp upstream regions of start codon of putative HaHQTs were *in silico* scanned for regulatory elements using MatInspector (Cartharius et al., 2005). The genomic localization of the analyzed promoters was Chr10: 227227684…227229684 for *HaHQT2* and Chr2: 166074069…166076069 for *HaHQT3*.

# Transient Overexpression of *HaHQT*s in *Nicotiana benthamiana*

The putative *HaHQT*s from pCR™8/GW/TOPO®-*HaHQT*s were transferred into pEAQ-HT-DEST1 (pEAQ1) expression vectors (Peyret and Lomonossoff, 2013) using Gateway® LR Clonase® II (Invitrogen, USA). The resultant pEAQ1-*HaHQT*s were then transformed into *Agrobacterium tumefaciens* LBA4404 by electroporation.

*A. tumefaciens* colonies containing each construct were grown in 25 ml of LB broth containing 50 mg L−1 kanamycin, 50 mg L−1 streptomycin, and 50 mg L−1 rifampicin and shaken at 250 rpm at a temperature of 30°C overnight. Cells were harvested by centrifugation at 3,000 × *g* for 10 min and washed in MM buffer twice (10 mM MES and 10 mM MgCl2, pH 5.6). Then, the pellet was resuspended in MM buffer to an optical density of 0.4 at OD600, and acetosyringone was added to a final concentration of 100 mg L−1. The culture solution was incubated at room temperature for 2 h. Genes of interest were transferred into the abaxial leaves of 5-week-old plants, by first nicking the leaf on the backside with a needle and then infiltrating the gene-harboring *A. tumefaciens* using a needleless 1-ml syringe. After 5 days, the infiltrated leaves were collected, frozen in liquid nitrogen, and ground into a fine powder for HPLC analysis.

<sup>1</sup> www.sunflowergenome.org

# Subcellular Localization

*In silico* subcellular prediction of localization was performed using the iPSORT (Bannai et al., 2002), WoLF PSORT (Horton et al., 2007), LOCALIZER (Sperschneider et al., 2017), TargetP (Emanuelsson et al., 2000), and ChloroP servers (Emanuelsson et al., 1999).

For *in planta* experiments on subcellular localization, four biological replicates were used. First, *HaHQT*s were amplified with Phusion Hot Start II High-Fidelity DNA Polymerase (Thermo Fisher Scientific, USA) using pCR™8/GW/TOPO®-*HaHQT*s as templates. The primers (excluding stop codons) listed in **Supplementary Table S1** were used in the PCRs. The PCR products were cloned into pCR™8/GW/TOPO®TA vectors (Invitrogen, USA), and nucleotide sequences were verified. Then, the *HaHQT*s were transferred into the C-terminal green fluorescent protein (GFP)-fused destination vector pGWB5 (Nakagawa et al., 2007) using Gateway® LR Clonase® II (Invitrogen, USA), generating pGWB5-*HaHQT*s. The pGWB5-*HaHQT*s were then transformed into *A. tumefaciens* LBA4404 by electroporation.

*A. tumefaciens* containing each construct and *A. tumefaciens* containing a silencing suppressor *p19* gene (Lindbo, 2007) were co-infiltrated into 5-week-old plants (section Plant Material and Growth Conditions) as in section Transient Overexpression of *HaHQT*s in *N. benthamiana* but with some modifications. In brief, cells obtained from each culture were washed, suspended in MM buffer, and adjusted to an optical density of 0.8 at OD600. The culture suspensions of each *A. tumefaciens* harboring pGWB5-*HaHQT* construct were then mixed with that of *A. tumefaciens* harboring *p19* at a ratio of 1:1. Then, acetosyringone was added to a final concentration of 100 mg L−1. At 3 days after infiltration, protein localization was visualized under FluoView® FV10i-DOC confocal laser scanning microscope (Olympus, Japan). Excitation/emission of GFP, autofluorescence of chloroplast, and phase contrast detection were recorded at 473/510, 559/600, and 559/600 nm, respectively.

## Statistical Analyses

Statistical analyses were performed using IBM® SPSS® Version 22.0 (IBM, USA) statistical software. Following one-way ANOVA, mean concentrations of CQAs and expression levels of genes were compared between days for each CQA or gene type using Duncan's multiple-range test (*p* < 0.05). In addition, concentration of each CQA and expression levels of genes were compared between hypocotyl and cotyledon tissue types by Student's *t* test (*p* < 0.05).

# RESULTS

# Caffeoylquinic Acids Profiling in Sunflower Sprouts

Although CQAs in sunflower seeds have been reported (Weisz et al., 2009), previous studies have, to the best of our knowledge, not clearly quantified and characterized CQA content in sunflower sprouts. Therefore, we analyzed CQAs during germination from days 3 to 9. Six CQAs were identified (**Figure 3A**), and 1,5-diCQA was the most abundant. Accumulation of 1,5-diCOA increased during germination, reaching a maximum of ~15 mg/g dry weight (**Figure 3B**). This increasing accumulation level during sprouting was also observed in other CQAs, including 3-CQA, 4-CQA, 3,4-diCQA, and 4,5-diCQA. Notably, the amount of the second most abundant derivative, 5-CQA, did not significantly change over the period. In addition, at day 5 post-germination, we profiled CQA content in two sunflower sprout tissue types: hypocotyl and cotyledon. Cotyledons accumulated much higher levels of 5-CQA and 1,5-diCQA than hypocotyls and contained ~6-fold higher concentrations of 5-CQA. The other CQAs were detected at much lower levels in both tissues. The identities of all CQAs were also confirmed using LC–MS to compare their fragmentation patterns and molecular masses with those of authentic standards (**Supplementary Figure S3**).

# Genome-Wide Identification and Phylogenetic Analysis of *HaHQT*s and *HaHCT*s

A total of five genes encoding sunflower HQTs and HCTs were identified. Multiple alignment of amino acid sequences of all identified HQTs and HCTs showed the conserved motifs of HXXXD and DFGWG, which are signature to the members of the BAHD superfamily (**Supplementary Figure S2**). Phylogenetic analysis revealed that three HaHQTs and two HaHCTs were clustered together with HQTs and HCTs from chicory and globe artichoke (**Figure 4**). The number of HQT and HCT isoforms found in sunflower was identical to the number identified previously in chicory and globe artichoke. We annotated HaHQTs (HaHQT1, HaHQT2, and HaHQT3) and HaHCTs (HaHCT1 and HaHCT2) based on their clustering with the previously characterized HQTs and HCTs from those two species. Moreover, these *HaHQT*s and *HaHCT*s were located on different chromosomes, e.g., *HaHQT1* on chromosome 9, *HaHQT2* on chromosome 10, *HaHQT3* on chromosome 2, *HaHCT1* on chromosome 16, and *HaHCT2* on chromosome 5.

# Gene Expression Profiles of *HaHQT*s and *HaHCT*s in Sunflower Sprouts

To investigate a possible correlation between CQA content and expression levels of corresponding biosynthetic genes, qRT-PCR was used to analyze the expression profiles of identified *HaHQT*s and *HaHCT*s. Sunflower sprouts for this analysis were sampled at seven time points, from days 3 to 9 after germination. As shown in **Figure 5A**, expression level of *HaHQT1* was not that much different among these time points. However, the expression level of *HaHQT2* was peaked at day 3 and was followed by a significant decrease from day 4. Similarly, expression level of *HaHQT3* was peaked at day 3 and was decreased until day 9, whereas expression level of *HaHCT1* was constant from days 3 to 7 and was significantly decreased from days 7 to 8. Then, it was kept constant again. Noticeably, expression level of *HaHCT2* was increased during germination period. In addition, we investigated the expression levels of these genes in hypocotyls and cotyledons

different time points within each caffeoylquinic acid derivative. Bars represent the mean values ± standard deviation (SD) of five biological and independent replicates; for each derivative, different alphabets indicate significant differences according to Duncan's multiple-range test (*p* < 0.05) (B); and tissue specific concentrations of caffeoylquinic acid derivatives measured in the hypocotyls and cotyledons of sunflower sprouts sampled at 5 day post-germination; an asterisk (∗) above the bars indicates a significant difference between the two tissues (Student's *t* test, *p* < 0.05) (C).

at day 5. Interestingly, *HaHQT1* and *HaHQT2* were expressed at significantly higher levels in cotyledons than in hypocotyls, while that of *HaHQT3* was significantly higher in hypocotyl than in cotyledon (**Figure 5B**). However, the expression levels of two *HaHCT*s were not significantly different between the two tissues. Gene expression analysis by ddPCR confirmed the higher expression level of *HaHQT2* than other *HaHQT*s and *HaHCT*s (**Figure 5C**). These results provided compelling evidence that *HaHQT2* could be the main CQA biosynthetic gene in the sunflower sprouts. Therefore, we selected *HaHQT*s for further functional characterization.

# Promoter Analysis

To gain more insights into the regulatory network controlling the expression levels of HaHQT genes, we analyzed the promoter regions of those HaHQT genes. Due to the incomplete genome database, only promoter regions of HaHQT2 and HaHQT3 were analyzed. As shown in **Supplementary Table S2**, these two promoters share some common motifs such as phytochrome, defense response, circadian rhythm, axillary bud outgrowth, light, stress, and phytohormone (auxin, salicylic acid, abscisic acid, jasmonate, gibberellin, ethylene, cytokinin) responsive elements as well as sulfur and sucrose responsive elements. In addition, we found several binding elements for MYB and Dof transcription factors. These two transcription factors were reported to be involved in the phenylpropanoid biosynthesis.

# Subcellular Localization of HaHQTs

*In silico* subcellular analysis of sunflower HaHQTs predicted their localization in either cytosol or chloroplast (**Figure 4**) with no detection of nuclear localization signal (NLS) in all HaHQTs. For assessment *in planta*, *Agrobacteria* harboring each GFP-fused expression construct (pGWB5-*HaHQT1*, pGWB5-*HaHQT2*, or pGWB5-*HaHQT3*) together with the gene-silencing suppressor *p19* were infiltrated in *N. benthamiana* leaves. Protein localization was analyzed using a confocal laser scanning microscope. *In planta*, all three GFP-tagged HaHQTs were soluble and not membrane-bound proteins, possibly localized in the cytosol (**Figure 6; Supplementary Figure S4**). Fluorescence signals were also detected in the nucleus. In addition, observation of mesophyll clearly confirmed that these HaHQTs were not localized in chloroplast (**Supplementary Figure S5**).

# Transient Expression of *HaHQT*s in *Nicotiana benthamiana*

In *N. benthamiana* leaves infiltrated to transiently express *HaHQT*s, levels of 5-CQA were significantly higher than the control (**Figure 7**). *HaHQT2* and *HaHQT3* also increased the level of 4-CQA. *HaHQT3* is the only isoform that could increase the level of all monoCQAs and 1,3-diCQA. The levels of 1,5-diCQA and 4,5-diCQA did not show any significant differences between the infiltrated leaves versus the control. These results indicated that all *HaHQT*s were involved in CQA biosynthesis.

# DISCUSSION

Mono and diCQAs are known to be beneficial to human health. The monoCQA 5-CQA, also known as chlorogenic acid, has a wide variety of health benefits (see review Naveed et al., 2018) and is one of the most abundant CQAs. Although sunflower seed kernels accumulate much higher amounts of monoCQAs than diCQAs (Weisz et al., 2009), we found significantly higher content of diCQAs than of monoCQAs in sunflower sprouts. Our results concur with those of Sun et al.

FIGURE 5 | (SD) of five biologically independent replicates; comparisons are shown among different time points within each gene; different alphabets indicate significant differences according to Duncan's multiple-range test (*p* < 0.05) (A). Tissue specific gene expression analysis of *HaHQT*s and *HaHCT*s sampled at day 5 post-germination; for each gene, comparisons are shown between the hypocotyl and cotyledon; an asterisk (\*) above the bars indicates a significant difference between the two tissues (Student's *t* test, *p* < 0.05) (B). Absolute gene expression analysis of *HaHQTs* and *HaHCTs* from whole plant, hypocotyl, and cotyledon using digital droplet PCR (ddPCR). Bars represent the mean values ± standard deviation (SD) of five biologically independent replicates; comparisons are shown for all genes among different tissues; different alphabets indicate significant differences according to Duncan's multiple-range test (*p* < 0.05) (C).

(2012) in which 1,5-diCQA is the most abundant CQA in sprouts (**Figure 3B**). The lower level of monoCQAs found in sprouts could be related to the role of monoCQAs as intermediates used for both diCQA biosynthesis and lignification in growing hypocotyls during sprouting (Escamilla-Treviño et al., 2014). Although 3-CQA, 4-CQA, 1,5-diCQA, 3,4-diCQA, and 4,5-diCQA contents tended to increase in sunflower sprouts until day 9 (**Figure 3B**), we did not analyze their content after day 9. This was because sunflower sprouts available in markets are generally harvested at day 5 or 6, when the average hypocotyl length is 12–15 cm (**Supplementary Figure S1B**). Therefore, we designed our experiment to include sampling shortly before (days 3–4) and after (days 7–9) the common harvesting period. Harvest after day 9 would negatively affect the texture of the sprouts, *e.g.* they would be less crisp, which is inconsistent with consumer preferences. Among CQAs, only 5-CQA did not increase in concentration during the germination period, perhaps because it was being converted to 1,5-diCQA or used for lignin biosynthesis (**Figure 2**). Competing use of 5-CQA for lignin biosynthesis versus production of 1,5-diCQA during sprouting could contribute to the less abundance of monoCQAs than diCQAs in sprouts. Regarding the distribution of CQAs in hypocotyls and cotyledons, higher amounts of CQAs in cotyledons (**Figure 3C**) were probably because the role of these compounds is in protecting against herbivores, pathogens, and harmful UV light to which cotyledons are exposed to in nature (Niggeweg et al., 2004; Cle et al., 2008; Leiss et al., 2009).

We performed a genome-wide analysis to increase our understanding of CQA biosynthesis and discovered three *HaHQT*s and two *HaHCT*s. These enzymes were well clustered with the previously characterized *HaHQT*s and *HaHCT*s from chicory and globe artichoke (**Figure 4**). These results demonstrated close evolutionary relationships and some level of domain consensus among different HQT and HCT isoforms between sunflower and other Asteraceae family plants (chicory and globe artichoke). Although predictions from our *in silico* analysis by ChloroP suggested HaHQT3 would be localized to chloroplasts (**Figure 4**), we found that all HaHQTs fused with GFP localized in the cytoplasm (**Figure 6**); but we also observed fluorescence signals in the nucleus. This may result from protein diffusion

multiple-range test (*p* < 0.05).

into the nucleus, consistent with the previous characterization of HQT2 from globe artichoke (Moglia et al., 2016). A nucleocytoplasmic localization was in addition documented for other members of the BAHD superfamily, namely spermidine hydroxycinnamoyl transferases (Delporte et al., 2018).

Both HQTs and HCTs are involved in biosynthesis of CQAs in chicory (Legrand et al., 2016). Our expression analysis showed that *HaHQT2* was expressed at a dramatically higher level than other *HQT*s and *HCT*s in both hypocotyl and cotyledon tissues (**Figure 5C**). In addition, in cotyledons, the higher expression of *HaHQT*2 coincided with higher content of CQAs than that in hypocotyls (**Figures 5B,C**). This suggests that, rather than the other genes, HaHQT2 could be the main enzyme involved in CQA biosynthesis during sunflower sprout germination. Studies in tomato (Niggeweg et al., 2004), potato (Payyavula et al., 2015), globe artichoke (Moglia et al., 2016), and Japanese honeysuckle (*Lonicera japonica*; Zhang et al., 2017) reached a similar conclusion that HQTs have an important role in CQA biosynthesis. By contrast, based on functional characterization of *N. benthamiana*, Legrand et al. (2016) showed that HCT1 from chicory played a major role in CQA biosynthesis. These results suggest that the mechanisms of CQA biosynthesis regulation could vary between closely related plant species.

To gain a better understanding of factors controlling the expression of *HaHQT*s, we searched the promoter regions of *HaHQT2* and *HaHQT3* for relevant regulatory elements. Multiple elements associated with stress response and hormonal signaling were found in the promoters of both genes. In addition, several MYB and Dof regulator elements were also found in the promoter regions of both genes. MYB transcription factor has been known to regulate CQA biosynthesis. Transient overexpression of MYB1 from eggplant (*Solanum melongena*) in *N. benthamiana* could enhance the accumulation level of 5-CQA (Docimo et al., 2016). Moreover, 5-CQA content was significantly increased in leaves and fruit of transgenic tomato lines overexpressing Arabidopsis MYB12 (Pandey et al., 2015). The involvement of Dof transcription factor in phenylpropanoid biosynthesis has been reported. Arabidopsis Dof4.2 negatively affected flavonoid biosynthesis but positively regulated hydroxycinnamic acid biosynthesis (Skirycz et al., 2007). Since sunflower sprouts accumulate comparably high levels of CQAs, this sprout might be suitable for further identification of novel transcription factors controlling CQA biosynthesis. Nevertheless, the different expression levels of those two *HaHQT*s might be due to other regulating factors such as different positions/ numbers of the cis-acting elements in the promoter regions. Further *in planta* functional analysis confirmed the role of all HaHQTs in CQA biosynthesis. Compared with control *N. benthamiana* plants infiltrated with *GFP*, leaves infiltrated with *HaHQT*s had higher 5-CQA content, and the highest increase (94%) occurred in leaves infiltrated by *HaHQT2* (**Figure 7**). These results suggest that *HaHQT*s are involved in the biosynthesis of 5-CQA, the most abundant monoCQA found in sunflower sprouts. Consistently, in *N. benthamiana* leaves infiltrated with artichoke *HQT*s, among other monoCQAs, 5-CQA content was mainly affected (Moglia et al., 2016). As for diCQAs, only 1,3-diCQA content was increased slightly by the infiltration of *N. benthamiana* leaves with *HaHQT3* construct. This observation was different from a study by Moglia et al. (2016) which found that *N. benthamiana* leaves infiltrated with artichoke HQTs had increased diCQA contents at much higher level than that of *HaHQT3-*transiently expressed *N. benthamiana* leaves. Because of this, we hypothesized that the increased amount of diCQA in *N. benthamiana* leaves infiltrated with artichoke HQTs might be due to the dramatically increased level of monoCQAs (up to 500% for 5-CQA; Moglia et al., 2016), as diCQAs are synthesized from monoCQAs. However, in our study, we did not observe as high an amount of 3-CQA in the *HaHQT*s-infiltrated *N. benthamiana* leaves as Moglia et al. (2016) did, which might explain why we did not observe an increase in diCQA contents. In a study by Legrand et al. (2016), infiltration of *N. benthamiana* leaves with either *HQT1* or *HCT1* from chicory did not lead to a great increase in 3-CQA content (only up to ~19 and ~56%, respectively), which is similar to results from our study. Unfortunately, Legrand et al. (2016) did not mention any changes occurring in diCQA contents. Differences in the expression vectors used and in the growth stages and conditions of *N. benthamiana* between studies may partially explain why production of monoCQAs in *N. benthamiana* infiltrated with *HQT*s from different sources is variable. In addition, there may be differences in the catalytic efficiency of HQTs investigated in different studies, and this possibility should be investigated further.

Although sunflower sprouts accumulate much higher level of 1,5-diCQA than globe artichoke and chicory (Pandino et al., 2011; Willeman et al., 2014), our results did not provide a strong evidence to support the role of HaHQTs and HaHCTs in diCQA biosynthesis. Therefore, mechanisms behind biosynthesis of diCQAs remain unclear. It is possible that other acyltransferases might be involved in diCQA biosynthesis. Since the sunflower genome has recently been reported (Badouin et al., 2017), identification of additional acyltransferases in the genome, together with transcriptome analysis, might help in identifying novel candidate genes involved in diCQA production. So far, the only HQT enzyme known to produce both monoCQAs and diCQAs is from tomato, where it is localized to both the cytosol and vacuoles (Moglia et al., 2014). The amino acid residue His276 has been identified as partially responsible for the dual function of tomato HQT and mutation of this residue to Tyr decreases the production of diCQA *in vitro*. In chicory, globe artichoke, and sunflower, a Tyr residue is found at this position (**Supplementary Figure S2**). Therefore, it is unlikely that HQTs in these three species would have the dual function seen in tomato HQT.

In conclusion, we have reported biosynthesis of CQAs in sunflower sprouts for the first time. Sunflower sprouts are a rich source of monoCQAs and diCQAs, and HaHQT2 was found to be the major isoform, which could be responsible for CQA biosynthesis during germination in both hypocotyls and cotyledons. Therefore, manipulation of this gene could positively affect CQA contents in sunflower sprouts. Thus, our results provide informative data, which could be applied to further biofortify sunflower sprouts as functional foods.

# REFERENCES


# DATA AVAILABILITY

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

# AUTHOR CONTRIBUTIONS

SS conceived the research. KC performed most of the experiments and analyzed the data. PB participated in genome-wide identification. KC, GK, PP, and SS interpreted the data and drafted the manuscript. All authors have read and approved the final manuscript.

# FUNDING

This research work was financially supported by the 90th Anniversary of Chulalongkorn University (Ratchadaphisek Somphot Endowment Fund to KC) and Chulalongkorn research funding (GRU 6203023003-1 and CU-57-014-FW to SS).

# ACKNOWLEDGMENTS

We would like to thank Tsuyoshi Nakagawa (Shimane University, Japan) for providing the Gateway® expression vector pGWB5, Sophien Kamoun (Sainsbury Laboratory, UK) for providing *A. tumefaciens* GV3101 carrying pJL3:p19, and George Lomonossoff (John Innes Centre, Norwich, UK) and Plant Bioscience Limited (Norwich, UK) for supplying the pEAQ vectors. We also thank Chadin Kulsing (Chulalongkorn University, Thailand) for assisting in LC–MS analysis and Kornlawat Tantivit for assisting in confocal microscopy. We appreciate a scholarship to develop research potential for the Department of Biochemistry, Faculty of Science, Chulalongkorn University, Ratchadapisek Somphot Fund for Postdoctoral Fellowships, Chulalongkorn University (to GK and PP) and the Chulalongkorn Academic Advancement into its 2nd Century Project.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.00968/ full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Cheevarungnapakul, Khaksar, Panpetch, Boonjing and Sirikantaramas. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# The Origin and Evolution of Plant Flavonoid Metabolism

#### Keiko Yonekura-Sakakibara\*, Yasuhiro Higashi and Ryo Nakabayashi

*RIKEN Center for Sustainable Resource Science, Yokohama, Japan*

During their evolution, plants have acquired the ability to produce a huge variety of compounds. Unlike the specialized metabolites that accumulate in limited numbers of species, flavonoids are widely distributed in the plant kingdom. Therefore, a detailed analysis of flavonoid metabolism in genomics and metabolomics is an ideal way to investigate how plants have developed their unique metabolic pathways during the process of evolution. More comprehensive and precise metabolite profiling integrated with genomic information are helpful to emerge unexpected gene functions and/or pathways. The distribution of flavonoids and their biosynthetic genes in the plant kingdom suggests that flavonoid biosynthetic pathways evolved through a series of steps. The enzymes that form the flavonoid scaffold structures probably first appeared by recruitment of enzymes from primary metabolic pathways, and later, enzymes that belong to superfamilies such as 2-oxoglutarate-dependent dioxygenase, cytochrome P450, and short-chain dehydrogenase/reductase modified and varied the structures. It is widely accepted that the first two enzymes in flavonoid biosynthesis, chalcone synthase, and chalcone isomerase, were derived from common ancestors with enzymes in lipid metabolism. Later enzymes acquired their function by gene duplication and the subsequent acquisition of new functions. In this review, we describe the recent progress in metabolomics technologies for flavonoids and the evolution of flavonoid skeleton biosynthetic enzymes to understand the complicate evolutionary traits of flavonoid metabolism in plant kingdom.

#### Edited by:

*Kevin Davies, The New Zealand Institute for Plant and Food Research Ltd., New Zealand*

#### Reviewed by:

*Stefan Martens, Fondazione Edmund Mach, Italy John A. Morgan, Purdue University, United States*

#### \*Correspondence:

*Keiko Yonekura-Sakakibara yskeiko@riken.jp*

#### Specialty section:

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

> Received: *26 April 2019* Accepted: *08 July 2019* Published: *02 August 2019*

#### Citation:

*Yonekura-Sakakibara K, Higashi Y and Nakabayashi R (2019) The Origin and Evolution of Plant Flavonoid Metabolism. Front. Plant Sci. 10:943. doi: 10.3389/fpls.2019.00943* Keywords: secondary metabolites, flavonoid, polyketide synthase, 2-oxoglutarate-dependent dioxygenase, cytochrome P450, short-chain dehydrogenase/reductase, plant

# INTRODUCTION

Plants have the ability to produce a huge variety of metabolites. Over 1,000,000 metabolites are predicted to be present in the entire plant kingdom (Afendi et al., 2012). Most of these are secondary metabolites (also referred to as specialized metabolites) that play a wide range of physiological and ecological roles including defense against herbivores and pathogens, attractants for pollinators and seed carriers, and signaling. During the long process of evolution, plants have gained, expanded, and sometimes lost their capabilities to produce this huge array of metabolites, which provides the adaptive mechanisms needed for survival in changing environments.

Flavonoids form one of the major groups of specialized metabolites, and include over 9,000 compounds (Williams and Grayer, 2004; Anderson and Markham, 2006). According to Nomenclature of flavonoids (IUPAC Recommendations, 2017), the term "flavonoid" is applied to (1) compounds structurally based on derivatives of a phenyl-substituted propylbenzene having a C15 skeleton, (2) compounds with a C16 skeleton that are phenyl-substituted propylbenzene derivatives (rotenoids), (3) flavonolignans based on derivatives of phenylsubstituted propylbenzene condensed with C6-C<sup>3</sup> lignan precursors (Rauter et al., 2018). In a restricted sense, the term "flavonoid" is used only for those compounds with a C6-C3-C<sup>6</sup> carbon framework exhibiting the structure of a chromane or that of a chromene such as flavans, flavones, flavonols, and anthocyanidins. Chalcones, dihydrochalcones, and aurones are flavonoids in a broad sense, but not in a limited sense.

Flavonoids, including chalcones, flavones, flavonols, anthocyanins, and proanthocyanidins, are widely distributed in the plant kingdom, and their metabolic pathways have been extensively studied using both biochemical and molecular biological techniques. Until recently, it was believed that liverworts and mosses were the oldest flavonoid-producing plants (Rausher, 2006; Bowman et al., 2017). The genes encoding enzymes in the phenylpropanoid biosynthetic pathway, including the first two enzymes for flavonoid biosynthesis (chalcone synthase and chalcone isomerase) had not been found in the algal genera Chlamydomonas, Micromonas, Ostreococcus, and Klebsormidium, although genes encoding enzymes in the shikimate pathway were found in algae, liverworts, mosses, lycophytes, ferns and horsetails, gymnosperms, and angiosperms (Bowman et al., 2017). However, flavones, isoflavones, and flavonols were detected in microalgae from five different evolutionary lineages (Cyanobacteria, Rhodophyta, Chlorophyta, Haptophyta, and Ochrophyta) using ultra-high performance liquid chromatography with tandem mass spectrometry (Goiris et al., 2014). This suggests that plants may have acquired the ability to produce flavonoids earlier than we previously thought. Furthermore, in extant plants flavonoids play important roles as ultraviolet-B (UV-B) protectants, pigments that attract pollinators, phytoalexins, signaling molecules, and regulators of auxin transport and fertility (Gould and Lister, 2006). It has been proposed that defense against UV irradiation and regulation of plant hormone action were the original functions of flavonoids in the earliest flavonoid producing plants (Stafford, 1991; Shirley, 1996; Rausher, 2006). These functions have been considerably diversified during plant evolution. Thus, the study of flavonoids is a useful approach to understand how plants acquired the ability to produce specialized metabolites, and then build the metabolic pathways needed to produce such a huge variety of metabolites during the course of their evolution. This study will shed light on the relationships between genes/proteins and metabolites, and between the metabolites and their physiological functions. In this review, we describe the structural diversity of flavonoids distributed in the plant kingdom, and how plants acquired flavonoid biosynthetic genes during their evolution.

# METABOLOMICS SHEDS LIGHT ON THE EVOLUTION OF FLAVONOIDS

# Distribution of Flavonoids in the Plant Kingdom

It is estimated that there are over 9,000 flavonoids in the plant kingdom (Williams and Grayer, 2004; Anderson and Markham, 2006). Research on flavonoids has shown that they are distributed across the plant kingdom, including angiosperms, gymnosperms, and pteridophytes (Harborne, 1988; Tohge et al., 2013). The abundance of information about flavonoids in different species allows us to identify which flavonoid subclasses (e.g., chalcones, flavones, flavonols, anthocyanins, and proanthocyanidins) are found in each subgroup of plants (**Figure 1**). Flavone and flavanone are found in all plant groups except for hornworts. To our knowledge, no flavonoids have been reported in hornworts. As plant groups have evolved and diversified, so have the flavonoid subclasses produced within each group. For example, the flavonoid aglycones are most diversified in the angiosperms. In addition to flavanones and flavones, chalcones, flavonols, and proanthocyanidins are found in multiple groups. Interestingly, the prenylflavonoids are found in both liverworts and angiosperms: while more than 1,000 prenylflavonoids have been isolated from legumes (Yazaki et al., 2009), prenyldihydrochalcone has been found in the liverworts Radula variabilis and Radula spp. (Asakawa et al., 1978, 1982). These data suggest that the two groups of plants gained the ability to produce prenylflavonoids independently, or that many groups have lost the ability to produce prenylflavonoids during their evolution. Flavonoid molecules provide concrete evidence for the existence of the corresponding flavonoid biosynthetic genes in the plants. Analytical approaches for identifying flavonoids is therefore important to understand the evolution of flavonoid metabolism in the plant kingdom.

# Cutting-Edge Metabolomics Technologies Contribute to Understanding the Evolution of Flavonoid Metabolism

So far, chromatographic and spectroscopic approaches have been used to analyze the structures of flavonoids and to reveal their chemical diversity. Previously, paper chromatography,

thin layer chromatography, column chromatography, and liquid chromatography (LC) were the main techniques used to study flavonoids. Crude extracts containing flavonoids are obtained through liquid-liquid partitioning. For instance, flavonoid aglycones and their mono-glycosides can be mainly extracted in ethylacetate, whereas the flavonoid di- or tri-glycosides are extracted in n-butanol. Chromatographic techniques are then used to purify the flavonoids from the ethylacetate or n-butanol fractions, in several steps. Finally, the isolated flavonoids are analyzed by nuclear magnetic resonance and mass spectrometry to elucidate their structures. This is an unequivocal and straightforward approach to identifying flavonoids, however, it is time-consuming.

With the development of metabolomics technologies, flavonoids can now be analyzed much more accurately and precisely than before. LC-tandem mass spectrometry (LC-MS/MS) has become the preferred approach for analyzing flavonoids. It had been considered that there was no flavonoids in algae. However, identification of flavonoids such as intermediates and end products by an LC-MS/MS approach proved the existence of the flavonoid biosynthetic ability in microalgae (Goiris et al., 2014). It suggests that cutting-edge metabolomics technologies in every single plant species can unveil the flavonoids that have been missed.

The integration of LC-MS/MS with cheminformatics approaches provides a powerful tool for surveying flavonoid diversity in a high-throughput way (Tsugawa et al., 2016; Akimoto et al., 2017). Glycosylated, acylated, and prenylated flavonoid molecules and their aglycones can be separated using simple combinations of solvents and LC columns. The separated molecules are then ionized for MS/MS analysis. The product ions are derived from their precursors through the cleavage of ether and ester bonds or even the prenyl moiety. These simple fragmentation steps allow us to estimate the structures of the flavonoids. Cheminformatic tools are used for high-throughput and (semi-)automatic data analysis. The results are generally classified into four classes (levels) according to the guidelines of the Metabolomics Standards Initiative: Level 1, identification using authentic standard compounds; Level 2, annotation using public databases; Level 3, characterization by deciphering MS/MS data; and Level 4, unknown (not noise) (Sumner et al., 2007). The use of these levels reduces the production of false positive data.

An integrated approach using LC-MS/MS with <sup>13</sup>C (carbon) labeling and cheminformatics can be used to assign structures to metabolites whose structures have never been recognized before. In most of the general approaches, unknown metabolites remain uncharacterized. The integrative approach is used to identify the elemental compositions of the unknown metabolites based on the numbers of C atoms, determined by comparisons between <sup>13</sup>Cand non-labeled MS spectra. The MS/MS spectra of the unknown metabolites are then subjected to fragment set enrichment analysis (FSEA) to determine candidates for each metabolite class. This integrative approach was used to characterized 1,133 metabolites, including flavonoids, in 12 angiosperm species (Tsugawa et al., 2019). The results for the characterized flavonoids are summarized in **Table 1**, along with the flavonoids found in the liverwort Marchantia polymorpha, which were analyzed using the same method (Kubo et al., 2018). Our analyses using LC-MS/MS with <sup>13</sup>C labeling and cheminformatics led to the identification of more flavonoid molecules especially in the group of flavonol and flavone. Prenylated flavonoids have been analyzed by LC-MS/MS less extensively than other flavonoid groups, although a chemical assignment strategy is effective for profiling prenylated flavonoids. It is probably due to the lack of authentic standards and publicly available MS/MS spectra. The exact mass of a known structure or one that is presumed to be prenylated can also be useful for profiling prenylated flavonoids in cases where the ontologies for prenylated flavonoids does not work. This means that the observed MS/MS spectra may not be matched to the library derived from the FSEA.

Similarly, recent studies suggest that the tricin metabolism seems to specifically evolve in monocot plants. Tricin serves as a lignin co-monomer and cross couples with monolignols and γ -p-coumaroylated monolignols upon cell wall lignification. Recently, tricin was identified as a nucleation site for lignification in grasses (Lan et al., 2015, 2016a; Lam et al., 2017). Poales plants including Avena sativa, Brachipodium distachyon, Oryza sativa, Triticum durum, and Zea mays accumulate tricin, as does the Fabales plant Medicago sativa (Lan et al., 2016b). An integrative approach was used to detect derivatives of apigenin, luteolin, and tricin in O. sativa, Z. mays, and M. truncatula, and it appears that the total tricin content is much higher than the extractable tricin content in these plants. The integrative approach with an FSEA may be useful for further understanding the role of tricin and its derivatives in the cell wall.

LC-MS/MS-based metabolome analyses have been used to comprehensively analyze the flavonoid molecules in individual plant species. Recent technological developments allow us to perform high-throughput LC-MS/MS analyses. By tracing the ions derived from the aglycones and modified parts of the molecules, or by finding MS/MS similarities, the chemical assignment of flavonoids is easily performed in (semi-)automatic ways with cheminformatic tools across plant species.

Biosynthetic genes for the formation of aglycone or modification can be predicted on the basis of the structure of flavonoids. Functional genomics using both cutting-edge sequencer and more comprehensive and accurate flavonoid profiling allow us to address the evolutionary questions of when and why flavonoids appeared and the biosynthetic pathway are diversified in plants. Furthermore, along with the progress of considerable plant genome projects, it will provide valuable clues to understand the evolutionary traits of flavonoid metabolism in plant kingdom as inconsistencies in the relationships between metabolites and genes that we mentioned in later sections.

# THE FLAVONOID BIOSYNTHETIC PATHWAYS

The first step in flavonoid biosynthesis is catalyzed by chalcone synthase (CHS) (**Figure 2**). The substrates p-coumaroyl-CoA, derived from the cinnnamate/monolignol (phenylpropanoid) pathway, and malonyl-CoA from the acetate-malonate

Evolution of Flavonoid Metabolism

TABLE 1 |Flavonoids experimentally characterized by LC-MS/MS in 12 angiosperm species and*Marchantia polymorpha*.


*The numbers of flavonoids characterized by Tsugawa et al. (2019) are shown. The numbers of flavonoids stored in the Dictionary of Natural Products are shown in parentheses. A conventional type of reverse phase LC-MS/MS was used for these analyses. On the basis of the ontology in the FSEA, the flavonoids are summarized into seven groups. Flavonoids with the prenyl moiety are grouped together as the prenylated flavonoid group. Prenylations (29 patterns) occur in chalcones/dihydrochalcones, flavanones, flavones, flavonols/dihydroflavonols, and isoflavones (Barron and Ibrahim, 1996). Flavonoids consisting of unmodified aglycones and glycosylated, acylated, etherated, and/or prenylated flavonoids were characterized in this study.*

(polyketide) pathway are converted by CHS to naringenin chalcone. The stereospecific cyclization of naringenin chalcone to naringenin is catalyzed by chalcone isomerase (CHI). This step can also proceed spontaneously.

Naringenin is a general precursor for flavonols, anthocyanins, proanthocyanidins, flavones, and isoflavones. It is converted to dihydrokaempferol by flavanone 3-hydroxylase (F3H) (also referred to as flavanone 3β-hydroxylase). Flavonoid 3 ′ -hydroxylase (F3′H) and flavonoid F3′ 5 ′H-hydroxylase (F3′ 5 ′H) catalyze the hydroxylation of the C3′ and C3′ /C5′ positions of dihydrokaempferol, respectively. Dihydroflavonol 4-reductase (DFR) catalyzes the reduction of the dihydroflavonols to leucoanthocyanidins, which are further converted to anthocyanidins by leucoanthocyanidin dioxygenase/anthocyanidin synthase (LDOX/ANS). The leucoanthocyanidins and anthocyanidins are reduced to flavan 3-ols (e.g., catechin and epicatechin) by leucoanthocyanidin reductase (LAR) and anthocyanidin reductase (ANR), respectively. The dihydroflavonols are also converted to flavonols by flavonol synthase (FLS).

Flavones are synthesized from naringenin by flavone synthase I (FNS I) or flavone synthase II (FNS II) (Martens and Mithofer, 2005). Flavanone 2-hydroxylase (F2H) catalyzes the hydroxylation of flavanones (including naringenin) to 2-hydroxyflavanones, which are subsequently converted to flavones, possibly by an unknown dehydratase (Akashi et al., 1998).

Isoflavone synthase (2-hydroxyisoflavanone synthase, IFS) catalyzes the first step of isoflavone biosynthesis. IFS converts flavanones (e.g., liquirtigenin and naringenin) to 2-hydroxyisoflavanones, then the 2-hydroxyisoflavanones are dehydrated to isoflavones by 2-hydroxyisoflavanone dehydratase (HID) (Akashi et al., 2005).

The flavonoid skeletons mentioned above are highly modified by enzymes such as glycosyltransferases (GTs), acyltransferases, and methyltransferases. Here, we focus on the enzymes involved in the biosynthesis of the flavonoid skeleton molecules.

# EVOLUTIONARY HISTORY OF THE FLAVONOID BIOSYNTHETIC PATHWAYS

Generally, the enzymes in secondary metabolic pathways have been derived from those involved in primary metabolism (Pichersky and Gang, 2000; Moghe and Last, 2015; Carrington et al., 2018), and the enzymes in flavonoid metabolism are no exception. CHS and CHI are derived from enzymes involved in fatty acid metabolism: β-ketoacyl ACP synthase and fatty acid binding protein, respectively (Ngaki et al., 2012; Weng and Noel, 2012a). CHS and β-ketoacyl ACP synthase are members of the type III polyketide synthase (PKS) family (Schuz et al., 1983). The enzymes in later steps of the flavonoid biosynthetic pathways belong to families such as the 2OGD, CYP, and shortchain dehydrogenase/reductase (SDR) superfamilies (**Table 2**). The members of these superfamilies are widely involved in primary and secondary metabolisms, suggesting that plants have acquired the enzyme functions in later biosynthetic pathways TABLE 2 | The origin or gene family of flavonoid biosynthetic genes.


via gene duplication and evolution of new functions for the duplicated gene products.

Based on the distribution patterns of the flavonoid subclasses, it has been suggested that the flavonoid biosynthetic pathways may have evolved via a series of steps, and that the first flavonoid biosynthetic enzymes were CHS, CHI, and F3H (Stafford, 1991; Rausher, 2006). It has also been proposed that CHS evolved first, followed by F3H and then CHI, because CHS catalyzes the first committed step in the pathways, and the step catalyzed by CHI can also proceed spontaneously (Rausher, 2006). Flavonoids are widely distributed among mosses, liverworts, and vascular plants, but are not found in hornworts. Algae generally contain no flavonoids (Rausher, 2006), but they have been found in a few evolutionary divergent lineages of microalgae (Goiris et al., 2014). These observations suggest that the ability to produce flavonoids may have evolved multiple times, or that the ability was widely lost during evolutionary processes. An analysis of the evolutionary rates of six genes involved in anthocyanin biosynthesis indicated that the upstream genes (CHS, CHI, and F3H) evolved more slowly than the downstream genes (DFR, LDOX/ANS, and UDP-glucose:flavonoid 3-O-glucosyltransferase) (Rausher et al., 1999). The upstream genes may be evolutionarily constrained due to their profound effects on the pathways. In

FIGURE 2 | General flavonoid biosynthetic pathways in plants. The arrows in green, blue, and magenta indicate enzymes in the CYP, 2OGD, and SDR superfamilies, respectively. ANR, anthocyanidin reductase; ANS, anthocyanidin synthase; CHI, chalcone isomerase; CHR, chalcone reductase; CHS, chalcone synthase; DFR, dihydroflavonol 4-reductase; F2H, flavanone 2-hydroxylase; F3H, flavanone 3- hydroxylase; F3′H, flavonoid 3′ -hydroxylase; F3′<sup>5</sup> ′H, flavonoid 3′<sup>5</sup> ′ -hydroxylase; FLS, flavonol synthase; FNS, flavone synthase; HID, 2-hydroxyisoflavanone dehydratase; IFS, Isoflavone synthase; LAR, leucoanthocyanidin reductase; LDOX, leucoanthocyanidin dioxygenase.

addition, the genes encoding transcription factors that regulate anthocyanin biosynthesis have evolved more rapidly than the structural genes (Rausher et al., 1999).

# THE EVOLUTIONARY HISTORY OF FLAVONOID BIOSYNTHETIC GENES CHS Is a Representative of the Type III PKS

Superfamily The CHS genes are widely distributed in plants, from bryophytes to angiosperms (Jiang et al., 2008; Shimizu et al., 2017; Liou et al., 2018), but they have not been found in other organisms. Every land plant species with available genomic data has at least one putative CHS gene (Shimizu et al., 2017). CHS is a member of the type III PKS superfamily, which provides diverse polyketide scaffolds of secondary metabolites (**Figure 3**) (Winkel-Shirley, 2001; Austin and Noel, 2003; Abe and Morita, 2010). The type III PKSs belong to the thiolase superfamily (Jiang et al., 2008). They are found in land plants, microalgae, fungi, and bacteria, but are not found in animals or archaea (Shimizu et al., 2017). The plant type III PKSs retain the overall folded protein structure and the Cys-His-Asn catalytic triad that characterize the Escherichia coli 3-ketoacyl-ACP synthase isoform III (KASIII) enzyme, which is involved in de novo fatty acid synthesis (Ferrer et al., 1999; Austin and Noel, 2003). The number of land plant type III PKS genes is highly variable among species in the same taxa; for instance, eudicot species have two to 42 genes, whereas the numbers in fungi and bacteria are relatively low (less than five) (Shimizu et al., 2017). The abundance of type III PKSs may contribute the variety of specialized plant metabolites.

The functional diversity of the type III PKSs is generally derived from differences in the starter molecules, the numbers of chain elongation steps, and the mechanisms of the cyclization reactions (Austin and Noel, 2003; Abe and Morita, 2010). The stilbene synthases (STSs), which produce stilbenes such as resveratrol, are also type III PKSs. CHSs and STSs generate the same intermediates from the same starter molecules using the same chain elongation steps, but they catalyze different intra-molecular cyclizations and produce different products (Austin et al., 2004). The CHSs and STSs from the same plant genera are usually classified as the closest neighbors in phylogenetic trees (**Figure 3**). Data from angiosperms (Vitis vinifera, Arachis hypogaea, and Sorghum bicolor), gymnosperms (P. sylvestris), and ferns (Psilotum nudum) suggest that after diverging, the STSs and CHSs evolved independently (Yu et al., 2005; Weng and Noel, 2012b).

# Basal Land Plant CHSs and Non-CHSs

The eudicot Arabidopsis thaliana contains four type III PKS genes, including a single functional CHS (Kim et al., 2010). In contrast, the bryophyte moss Physcomitrella patens has around 19 CHS family genes and four other type III PKS genes (Jiang et al., 2006; Wolf et al., 2010). Five of the moss CHS family genes were derived from a whole genome duplication, and four are suggested to be derived from segmental duplication and transposition (Wolf et al., 2010). Four of the genes are upregulated by broadband UV-B irradiation, and the moss plants show increased levels of a flavonol derivative under UV-B illumination. These observations support the hypothesis that the genes and enzymes involved in the UV stress response evolved with the water-to-land transition, when early plants were exposed to increased levels of sunlight (Wolf et al., 2010).

In the bryophyte liverwort M. polymorpha, UV-B irradiation and nutrient deprivation significantly increase the total flavone glycoside content (Albert et al., 2018; Clayton et al., 2018). M. polymorpha has 24 CHS family genes (Bowman et al., 2017) and one of these is significantly upregulated by UV-B stress treatment. This induction was enhanced in transgenic Marchantia plants that overexpressed the gene encoding the UV RESISTANCE LOCUS8 (UVR8) photoreceptor (Clayton et al., 2018). Furthermore, transgenic Marchantia plants overexpressing a MpMyb14 transcription factor gene showed increased expression levels of the same CHS gene under normal growth conditions (Albert et al., 2018). Knockout Mpmyb14 mutants were partially impaired by an increase in CHS expression levels under nitrogen deficient conditions (Kubo et al., 2018). The results suggest that these liverwort species have at least one CHS gene that is activated by the UVR8 signal transduction pathway.

Phylogenetic analyses show close relationships between basal land plant (bryophyte and lycophyte) CHSs and non-CHS type III PKSs (**Figure 3**) (Wanibuchi et al., 2007; Yu et al., 2018). The non-CHS group includes many enzymes involved in the biosynthesis of secondary metabolites, such as acridone synthases, pyrone synthases, bibenzyl synthases, and p-coumaroyltriacetic acid synthases (Winkel-Shirley, 2001). These non-CHS PKS enzymes evolved through repeated gene duplication, mutation, and functional diversification from their ancestral plant enzymes.

Ectopic expression of a CHS gene from either the bryophyte P. patens or the lycophyte Selaginella moellendorffii can partially complement the phenotype of an A. thaliana CHS-null mutant, transparent testa 4 (Liou et al., 2018). Crystal structures of CHSs from P. patens, S. moellendorffii, the monilophyte Equisetum arvense, the gymnosperm Pinus sylvestris, and the angiosperm A. thaliana revealed that the reactivity of the catalytic Cys residue (Cys164 in M. sativa CHS2) has changed during the 500 million years of evolution of land plants. The Cys residues in the three recent lineages (monilophytes, gymnosperms, and angiosperms) are present in the thiolate anion form, which gives them stronger nucleophilic power (Liou et al., 2018).

The type III PKSs show broad substrate promiscuity. CHSs do not accept bulky substrates, but the lycophyte Huperzia serrata HsPKS1 exhibits remarkable substrate tolerance and catalytic potential (Wanibuchi et al., 2007; Morita et al., 2011). In vitro, HsPKS1 produces naringenin chalcone and other polyketides, including aromatic tricyclic pyridoisoindole compounds, which are not found in natural products. A single amino acid replacement in HsPKS1 increases its active-site cavity volume and alters the product chain length and the mechanism of the cyclization reaction. This substrate promiscuity in the type III PKSs provides diverse polyketide scaffolds for the subsequent biosynthesis of secondary metabolites in land plants.

non-CHS proteins of the plant type III PKSs. The non-CHS proteins include the ARAS/ARS proteins and the ASCL families. The overall three-dimensional protein structure is conserved in the type III PKSs and an *E. coli* KASIII enzyme (the αβαβα-fold). (B) Examples of type III PKS products. (C) The CHI-fold proteins in the CHI, CHIL, and FAP families share a common folded protein structure (the open-faced β-sandwich fold). ARAS, alkylresorcylic acid synthase; ARS, alkylresorcinol synthase; ASCL, anther-specific chalcone synthase-like enzyme; CHIL, CHI-like protein; CHI, chalcone isomerase; CHS, chalcone synthase; FAP, fatty-acid-binding protein; KASIII, 3-ketoacyl-ACP synthase isoform III enzyme; PKSs, polyketide synthases; STS, stilbene synthase.

# The ARAS/ARS and ASCL Families

Plant non-CHS type III PKSs also synthesize polyketides from fatty acyl-CoA substrates. Phenolic lipids such as alkylresorcinols are synthesized by alkylresorcinol synthases (ARSs) and alkylresorcylic acid synthases (ARASs) in the monocots S. bicolor and O. sativa, respectively (Cook et al., 2010; Matsuzawa et al., 2010). Alkylresorcinols in grain crop species show anti-fungal and allelopathic activities.

Hydroxyalkyl-α-pyrone compounds, which are precursors of sporopollenin in the pollen wall exine, are synthesized by anther-specific chalcone synthase-like enzymes (ASCLs) that are specifically and transiently expressed in A. thaliana anthers (PKSA/LAP6 and LAP5/PKSB) (Dobritsa et al., 2010; Kim et al., 2010). The pksa pksb double mutant plants are male sterile.

Phylogenetic analyses show that the ARAS/ARS and ASCL families are classified into distinct groups (**Figure 3**) (Shimizu et al., 2017). Bacteria, fungi, and mosses also have genes encoding type III PKSs that produce these long-chain fatty acyl containingpolyketides (Colpitts et al., 2011; Shimizu et al., 2017; Li et al., 2018). Therefore, these type III PKSs are involved in lipid metabolism across three kingdoms of living organisms.

# CHI-Fold Proteins: CHIs and CHILs

The CHIs and CHI-like proteins (CHILs) are members of the CHI-fold family, which also includes the fatty-acid-binding proteins (FAPs) that are involved in fatty acid biosynthesis (**Figure 3**). These proteins share a common folded threedimensional structure (Jez et al., 2000; Ngaki et al., 2012; Kaltenbach et al., 2018). The FAP family is distributed in many bacteria, fungi, and plant species. CHILs partly lack the catalytic amino acid residues conserved in CHIs, but they bind with CHSs and enhance their activity (Ban et al., 2018). Phylogenetic and genomic analyses of CHI-fold proteins suggest that CHILs first appeared in mosses and evolved from FAPs, and then served as the ancestors of CHIs (Ngaki et al., 2012; Morita et al., 2014; Jiang et al., 2015). The CHILs form a group that is distinct from the CHI and FAP groups (**Figure 3**). The moss P. patens has two CHIL and four FAP genes, but does not appear to have any CHI genes (Ngaki et al., 2012; Cheng et al., 2018). The liverwort M. paleacea has a CHI, a CHIL, and two FAP genes (Cheng et al., 2018) and the lycophyte S. moellendorffii has a CHI, a CHIL, and three FAP genes. A. thaliana has a CHI, a CHIL, and three FAP genes whereas the legume Glycine max has a type I CHI, three type II CHIs, two CHILs, and six FAP genes (Dastmalchi and Dhaubhadel, 2015; Ban et al., 2018). These results suggest that the number of CHI and CHIL genes remains low in many plant lineages, but that leguminous plants have several CHI genes.

# Type I and Type II CHIs in Vascular Plants

There are two types of CHI. Type I CHIs are ubiquitous in vascular plants, whereas type II CHIs are specific to legumes and are involved in flavonoid synthesis during nitrogen-fixing root nodule symbioses (Shimada et al., 2003; Subramanian et al., 2007). Phylogenetic analyses show that the type II CHIs form a distinct group from the type I CHIs (**Figure 3**). The type II CHIs likely evolved from ancestral CHI-fold proteins (Cheng et al., 2018).

The type II CHIs isomerize both naringenin chalcone and isoliquiritigenin (6′ -deoxychalcone) to produce (2S)-naringenin and liquiritigenin (5-deoxyflavanone), respectively. The model legume Lotus japonicas has a type I and three type II CHI genes (Shimada et al., 2003). These four CHI genes form a tandem cluster within a 15-kb region of the genome. In soybean, a type I and two type II CHI genes are organized in a gene cluster on chromosome 20 (Dastmalchi and Dhaubhadel, 2015), and are probably derived from tandem gene duplications. The role of the CHIs in legume symbiosis remains unclear. Nitrogen-fixing root nodule symbioses are found in four angiosperm orders: Fabales, Fagales, Cucurbitales, and Rosales (the latter three are known as actinorhizal plants). The expression of a CHI gene is upregulated in nodules of the actinorhizal plant Datisca glomerata (Gifford et al., 2018), suggesting that these CHIs contribute to increases in flavonoid contents.

# CHIs in Basal Land Plants

The A. thaliana chi-deficient tt5 mutants produce pale yellow seeds due to significant reductions in proanthocyanidin production. This phenotype is largely complemented by ectopic expression of a CHI gene from either the liverwort M. paleacea (MpCHI1) or the lycophyte S. moellendorffii (SmCHI1) (Cheng et al., 2018). The CHIs from basal land plants have broad substrate specificities and are more like the type II CHIs than the type I CHIs. Phylogenetic analyses show that MpCHI1 and SmCHI1 form separate groups from the type I and type II CHIs (**Figure 3**) (Cheng et al., 2018; Kaltenbach et al., 2018).

The liverwort M. polymorpha also has a single CHI gene (Clayton et al., 2018), and Marchantia chi mutants do not contain detectable levels of flavone compounds. These chi mutants are highly sensitive to UV-B stress treatment. These results indicate that this basal plant species already has a gene encoding a bona fide CHI to catalyze the cyclization of naringenin chalcone in the flavonoid biosynthetic pathway.

# The CHIL Family

CHILs are categorized as type IV CHI-fold proteins and are found in basal and higher plant species including mosses, liverworts, lycophytes, ferns, gymnosperms, and angiosperms. CHILs do not have bona fide CHI activity. However, the RNAi knockdown of CHIL expression in Petunia hybrida and Torenia hybrida resulted in decreased levels of total flavonoids in the flowers (Morita et al., 2014). Three independent Ipomoea nil (Japanese morning glory) mutants with alterations in enhancer of flavonoid production (a CHIL gene) showed pale-colored flower phenotypes (Morita et al., 2014). CHIL loss-of-function mutants in A. thaliana show reductions in the levels of proanthocyanidin and flavonols in seeds, and flavonols in leaves (Jiang et al., 2015). However, the A. thaliana CHIL gene could not rescue the phenotypes of the tt5 mutants (Jiang et al., 2015). The liverwort M. polymorpha has a CHIL gene (Clayton et al., 2018), and the total flavone content is reduced in Marchantia chil mutants under normal growth conditions and under UV-B treatment. Thus, the CHILs enhance total flavonoid production but have roles that are distinct from those of the CHIs. As mentioned above, it was recently shown that CHILs from various plant lineages (A. thaliana, O. sativa, S. moellendorffii, and P. patens) can bind CHSs and boost CHS activity (Ban et al., 2018). CHIs also bind CHSs (Jorgensen et al., 2005), and it may be that the type I and type II CHIs evolved from CHILs and gained their CHI activity during subsequent evolutionary processes (Ban et al., 2018).

# Acquisition of Enzymatic CHI Activity During Evolution

The emergence of enzymatic CHIs in plants is an interesting topic in protein evolution, because the CHILs and FAPs are non-enzymatic proteins. The cleft in the CHI active site consists of three highly conserved amino acid residues (Arg36, Thr48, and Tyr106 in the M. sativa type II CHI sequence) and their neighboring residues are also conserved (Jez et al., 2000; Ngaki et al., 2012; Kaltenbach et al., 2018). The common ancestral proteins of the CHIs and CHILs were inferred in an extensive phylogenic analysis, and it appears that all three key catalytic residues were conserved in the ancestral proteins but were inactive (Kaltenbach et al., 2018). The authors performed a stepwise, activity-based screening of recombinant ancestral proteins using an E. coli expression system. The results indicated that mutations in amino acid residues other than the catalytic residues were required to initiate CHI evolution, and to acquire CHI catalytic activity.

# The 2-Oxoglutarate-Dependent Dioxygenase Family: F3H, FNS I, FLS, and LDOX/ANS

The 2OGD superfamily is one of the largest protein families in the plant kingdom. Its members are widely distributed in bacteria, fungi, plants, vertebrates, and even viruses (van den Born et al., 2008; Farrow and Facchini, 2014; Markolovic et al., 2015; Wu et al., 2016). The 2OGDs are non-heme iron containing enzymes that are localized in the cytosol. 2OGDs incorporate 2 oxoglutarate (2OG or α-ketoglutarate) and activated O<sup>2</sup> into a variety of substrates to form the oxidized products along with succinate and CO<sup>2</sup> (RH + 2OG + O<sup>2</sup> → ROH + succinate + CO2). 2OGDs catalyze various oxidative reactions including hydroxylation, halogenation, desaturation, and epimerization (Martinez and Hausinger, 2015) and play important roles in DNA and RNA repair, fatty acid metabolism, oxygen sensing, and biosynthesis of natural products (Farrow and Facchini, 2014; Hagel and Facchini, 2018; Herr and Hausinger, 2018; Islam et al., 2018).

In plants, 2OGDs are involved in histone demethylation, iron sensing, phytohormone metabolism, and the biosynthesis of secondary metabolites (reviewed in Farrow and Facchini, 2014). A phylogenetic analysis of plant 2OGDs from Chlamydomonas reinhardtii, P. patens, S. moellendorffi, Picea abies, O. sativa, and A. thaliana found 3 classes, which the authors named DOXA, DOXB, and DOXC (Kawai et al., 2014). Each class contains proteins from all six species. The DOXA class contains homologs of E. coli AlkB; these enzymes are involved in DNA repair (Lindahl et al., 1988; Meza et al., 2012; Mielecki et al., 2012). The DOXB class contains prolyl 4-hydroxylases that catalyze the hydroxylation of proline residues in plant cell wall proteins (Hieta and Myllyharju, 2002). Proteins in the DOXC class are involved in phytohormone metabolism and the biosynthesis of secondary metabolites including flavonoids, terpenoids, alkaloids, and glucosinolates. The numbers of genes encoding DOXA and DOXB enzymes are limited in the six species, however, the DOXC genes are significantly expanded in the land plants (Kawai et al., 2014).

Four flavonoid biosynthetic enzymes, F3H, FNS I, FLS, and LDOX/ANS belong to the DOXC class. A phylogenetic analysis showed that the genes in the DOXC class can be classified into over 50 clades (Kawai et al., 2014). The F3H and FNS I genes are in the DOXC28 clade while the FLS and LDOX/ANS genes are in DOXC47. Among the flavonoid biosynthetic genes in the 2OGD superfamily, it has been proposed that F3H was the first to appear (Rausher, 2006). FNS I seems to exist only in the Apiaceae, and it is likely that FNS I evolved from F3H as a paraphyletic gene (Martens et al., 2003; Gebhardt et al., 2005).

Arabidopsis thaliana contains one F3H gene, six FLS genes (AtFLS1–AtFLS6), one LDOX/ANS gene, and no FNS gene. AtFLS1 is the major FLS (Owens et al., 2008a; Saito et al., 2013). An analysis of structural divergence between duplicated genes showed that transposed duplication (<16 million years ago) explains the relationship between AtFLS6 (At5g43935) and AtF3H (At3g51240) (Wang et al., 2013). Furthermore, the relationship between AtFLS1 (At5g08640) and AtFLS5 (At5g63600) is likely explained by a whole genome duplication, and those between AtFLS2 (At5g63580) and AtFLS3 (At5g63590), AtFLS3 and AtFLS4 (At5g6359580), and AtFLS4 and AtFLS5 are most likely explained by tandem local duplications (Wang et al., 2013). The A. thaliana AtLDOX/ANS has been shown to produce flavonols in planta (Preuss et al., 2009). Furthermore, FLS and LDOX/ANS can partially complement F3H function in vivo, and this results in the leaky phenotype of tt6 mutants with null mutations in AtF3H (Owens et al., 2008b).

Together, the results suggest that F3H is the ancestral 2OGD gene for flavonoid biosynthesis, and that FLS and FNS I evolved via divergence from F3H. No apparent orthologs of either the DOXC28 or DOXC47 clade genes were found in P. patens and S. moellendorffi, even though these plants produce flavonols (P. patens) and flavones (S. moellendorffi) (Wolf et al., 2010; Weng and Noel, 2013). However, the liverwort Plagiochasma appendiculatum has an active FNS I (PaFNS I), and a phylogenic analysis revealed that PaFNS 1 is related to the angiosperm FNS I and F3H proteins, even though it is not in the same clade as them (Han et al., 2014). These data suggest that in Physcomitrella and Selaginella, the 2OGDs are present in distinct clade(s), or that unrelated enzymes perform the same functions as F3H, FNS, and/or FLS.

The PaFNS I can convert naringenin to either 2 hydroxynaringenin or apigenin (Han et al., 2014). The common horsetail E. arvense L also has 2OGD-type FNS I activity (Bredebach et al., 2011). Further research in the 2OGD genes of bryophytes, lycophytes, and ferns will help to clarify their evolutionary processes.

#### The Cytochrome P450 Superfamily: FNSII, F2H, IFS, F3′H, and F3′5 ′H

The CYPs are widely distributed in viruses, archeae, bacteria, and eukaryotes. They catalyze monooxygenase/hydroxylation reactions in various primary and secondary metabolic processes by insertion of an O atom from molecular O<sup>2</sup> (Mizutani and Ohta, 2010). In eukaryotes, the CYPs are heme-containing membrane proteins localized on the cytosolic surface of the endoplasmic reticulum.

In plants, the CYPs form the largest superfamily of enzymes and account for about 1% of the total number of gene products (Mizutani and Ohta, 2010; Nelson and Werck-Reichhart, 2011; Kawai et al., 2014) The CYPs are categorized into families (e.g., CYP75) that have ∼40% or more amino acid sequence identity, and those with 55 % or more identity are categorized into subfamilies (e.g., CYP75A). Furthermore, the plant CYP families (CYP71 to CYP99 and CYP701–) can be classified into clans whose members are derived from single ancestors. The land plant CYPs form 11 clans and seven of these (clans 51, 74, 97, 710, 711, 727, and 746) consist of single CYP families, while the remaining four (clans 71, 72, 85, and 86) include proteins in multiple CYP families (Nelson and Werck-Reichhart, 2011). Green algae contain CYPs in five single family clans (clans 51, 97, 710, 711, and 746), and members of these clans are involved in fundamental biological processes such as biosynthesis of sterols, xanthophylls, and phytohormones (Nelson, 2006; Nelson and Werck-Reichhart, 2011). Therefore, it is likely that these clans include the ancestral CYPs. The multi-family CYP clans have become highly diversified during plant evolution (Nelson and Werck-Reichhart, 2011). Some CYP families in these clans (clans 71, 72, 85, and 86) are present in bryophytes and/or liverworts but not in green algae. Additional novel CYP families were gained in stepwise processes following the evolution of vascular plants.

Flavonoid biosynthetic enzymes (members of the CYP75 and CYP93 families) are classified into clan 71. This clan contains the largest number of CYP families, and in addition to the flavonoid biosynthetic enzymes, it includes families involved in the biosynthesis of phenylpropanoids (CYP73, CYP84, CYP98), alkaloids (CYP80, CYP82, CYP719), terpenoids (CYP76, CYP99, CYP705, CYP706, CYP726), and glucosinolates (CYP79, CYP83). The clan 71 families CYP73, CYP74, CYP78, CYP88, CYP98, CYP701, CYP703, CYP736, and CYP761 are present in bryophytes, suggesting that these may be ancestral families.

#### The CYP75 Family: F3′H and F3′5 ′H

The F3′H and F3′ 5 ′H enzymes generally belong to the CYP75B and CYP75A families, respectively. The CYP75 family enzymes from monocots and dicots form distinct clusters within each subfamily, suggesting that the F3′H and F3′ 5 ′H functions were established before the divergence of monocots and dicots. However, there are some exceptions. In the Asteraceae, F3′ 5 ′H belongs to the CYP75B subfamily and forms a distinct cluster from F3′H (Seitz et al., 2006). This suggests that the Asteraceae F3 ′H gained F3′ 5 ′H activity before speciation but after the separation of the monocots and dicots. Similarly, the rice CYP75B4 catalyzes the 5′ -hydroxylation of 3′ -methoxyflavone chrysoeriol, and also functions as an F3′H (Lam et al., 2015). Rice contains CYP75B3 as an F3′H and CYP75A11 as a nonfunctional F3′ 5 ′H. Such CYP gene distributions in the CYP75A and CYP75B subfamilies are also found in other Poaceae plants. These data suggest that the CYP75A and CYP75B subfamilies separated before divergence of monocots and dicots, and that genes in the CYP75B subfamily later gained F3′ 5 ′H activity, at least in the Asteraceae and Poaceae.

Arabidopsis thaliana has a single gene for F3′H (CYP75B1, At5g07990) and no genes corresponding to F3′ 5 ′H, FNS II, or F2H. The CYP75B1 gene appears to be related to the CYP701A (At5g25900) gene, which encodes an enzyme involved in gibberellin biosynthesis, via a transposed duplication that occurred 16–107 million years ago and it was proposed that CYP75B1 is the parental locus of CYP701A (Wang et al., 2013). However, the CYP701 family is distributed among bryophytes and vascular plants while the CYP75 family is found in gymnosperms and angiosperms. In addition, a phylogenetic analysis suggested that the moss/liverwort-specific CYP761 family is closely related to the CYP75 family (Nelson and Werck-Reichhart, 2011). These data suggest that either CYP701 or CYP 761 is the ancestral family of CYP75. Although 3′ -hydroxylated flavone derivatives were detected in M. polymorpha and S. moellendorffii (Markham et al., 1998), CYP75 family members can be found only in gymnosperms and angiosperms (Nelson and Werck-Reichhart, 2011). Therefore, in bryophytes, lycophytes, and ferns, enzymes from other CYP families may function as F3′H and/or F3′ 5 ′H.

Such inconsistencies in the relationships between metabolites and genes are also observed in phenylpropanoid metabolism. Among the CYPs involved in phenylpropanoid metabolism from clan 71, CYP73A (cinnamate 4-hydroxylase, C4H), CYP98A (pcoumaroyl shikimate 3′ -hydroxylase, C3′H), CYP73, and CYP98 first appeared in liverworts and mosses, while CYP84 (ferulate 5-hydroxylase, F5H) is found only in angiosperms (Nelson and Werck-Reichhart, 2011). However, the syringyl lignin units from the phenylpropanoid pathway are distributed in land plants (Bowman et al., 2017), suggesting that genes in other CYP families may have F5H activities. In fact, the Selaginella CYP788A1 functions as a F5H (Weng et al., 2008).

# The CYP93 Family: FNS II, F2H, and IFS

A genome-wide analysis of the CYP93 family genes from 60 green plants indicated that the CYP93 family is found only in angiosperms (Du et al., 2016). Among the 10 subfamilies (CYP93A–CYP93K), CYP93A is the ancestral group distributed in both monocots and dicots; CYP93B and CYP93C are distributed only in dicots; CYP93G and CYP93J are found only in monocots; and CYP93E and CYP93F are specific to legumes and grasses, respectively. Thus, the CYP93 family shows plant lineage-specific evolution.

The CYP93 family contains enzymes involved in the biosynthesis of flavones and isoflavones (FNS II, F2H, and IFS). The monocot FNS II and F2H belong to the CYP93G subfamily, while those in the dicots are categorized in the CYP93B subfamily. Therefore, phylogenetic analyses suggest that the functions of the FNSII and F2H enzymes were established after the divergence of monocots and dicots. The IFS enzyme in legumes is a member of the CYP93C subfamily, which may be derived from the CYP93B subfamily (Du et al., 2016).

As with the CYP75 family, there are inconsistencies in the relationships between metabolites and genes in the CYP93 family. The CYP93 family genes are found only in angiosperms (Nelson and Werck-Reichhart, 2011), but flavones are widely distributed in plants, from bryophytes to angiosperms (**Figure 1**). This suggests that enzymes involved in flavone biosynthesis belong to the CYP93 family in some species of angiosperms, but that different CYP families and/or other enzymes play the same roles in other plant taxa. For example, the Apiaceae family adapted the 2OGD-type FNS I rather than the CYP-type FNS II to produce flavones (Gebhardt et al., 2007).

Isoflavonoids are the typical flavonoids found in legumes, and isoflavone biosynthetic genes are found only in legumes. However, isoflavonoids are also found in non-legume plants including Iris, which contains a wide variety of isoflavones, some mosses (e.g., Bryum capillare), gymnosperms, and monocot and dicot angiosperms (Dewick, 1994; Lapcik, 2007). This suggests that plants independently acquired the ability to produce isoflavonoids during the evolution of the CYP families, except for CYP93C and/or other non-CYP enzymes that function as IFSs.

# The Short-Chain Dehydrogenase/Reductase Family: DFR, ANR, and LAR

DFR, ANR, and LAR are members of the SDR superfamily, which is widely distributed in viruses, archaea, prokaryotes, and eukaryotes (Jornvall et al., 1999; Kavanagh et al., 2008). The SDRs constitute one of the largest NAD(P)(H) dependent oxidoreductase families and are involved in the primary metabolism of lipids, carbohydrates, and hormones, and the secondary metabolism of molecules such as terpenoids, alkaloids, and phenolic compounds (Jornvall et al., 1999; Kavanagh et al., 2008; Tonfack et al., 2011). In spite of the low overall sequence similarities among SDRs (15–30 %), the SDRs possess a conserved 3D structure consisting of a Rossmann-fold β-sheet surrounded by α-helices for nucleotide binding. Generally, the SDRs can be classified into several types ("classical," "extended," "intermediate," "divergent," and "complex") based on their primary structures, cofactor binding motifs, and active sites (Kavanagh et al., 2008; Moummou et al., 2012). A recent study revealed that the "intermediate" and "complex" types are not found in plants while the types "atypical" and "unknown" are found in plants. Thus, the plant SDRs can be categorized into five types: "classical," "divergent," "extended," "atypical," and "unknown") (Moummou et al., 2012). Among the plant SDRs, "classical" and "extended" are the major types, as they are in other organisms. The "classical" type is composed of about 250 amino acid residues, and the "extended" type has a domain of 100 additional amino acid residues in the C-terminal region. The "atypical" type, an uncommon type of SDRs, was included in the SDR family because of its Rossmann-fold structure, which is typical of SDRs (Moummou et al., 2012).

A genome sequence analysis using 10 species including C. reinhardtii, P. patens, S. moellendorffii, four dicots (A. thaliana, Populus trichocarpa, V. vinifera, and G. max) and three monocots (O. sativa, S. bicolor, and Z. mays) showed that most plant SDRs can be classified into 49 families that are distributed among the five types mentioned above (Moummou et al., 2012). The DFR and ANR enzymes are classified into the SDR108E family in the "extended" type. The LAR enzymes belong to the SDR460A family, which is an "atypical" type (Kallberg et al., 2010; Moummou et al., 2012).

# The SDR108E Family: DFR and ANR

Compared to other families that contain a few genes per species, the SDR108E family contains the largest number of SDR genes; for example, this family includes 24 genes in A. thaliana and 44 genes in O. sativa. Furthermore, the SDR108E family shows the lowest average sequence identity, indicating that the SDR108E family genes are highly diversified (Moummou et al., 2012). The distribution of SDR108E family genes in 10 species indicates that this family has expanded significantly in vascular plants.

In addition to the DFR and ANR enzymes, the SDR108E family contains other enzymes involved in secondary metabolism, including cinnamoyl-CoA reductase (CCR) for lignin biosynthesis, and phenylacetaldehyde reductase for the production of volatile 2-phenylethanol. Furthermore, enzymes involved in phytohormone metabolism, such as phaseic acid reductase for abscisic acid catabolism (Weng et al., 2016) and BEN1 for brassinosteroid homeostasis (Yuan et al., 2007) belong to this family. Each enzyme type forms a distinct cluster in the phylogenetic tree. The CCR and phenylacetaldehyde reductase clusters contain enzymes from P. patens and S. moellendorffii, whereas the clusters of DFRs, ANRs, phaseic acid reductases, and BEN1 enzymes are derived from flowering plants. These data suggest that these DFRs, ANRs, and other enzymes appeared more recently than the CCRs and phenylacetaldehyde reductases.

Other SDR families show similar gene expansion patterns to that of SDR108E. These families (SDR110C, SDR114C, SDR65C, and SDR460A) contain genes involved in the biosynthesis of alkaloids, terpenoids, phenylpropanoids, and phytohormones (Moummou et al., 2012). SDR families that are less diversified contain genes involved in primary metabolism, such as lipid and chlorophyll biosynthesis (Moummou et al., 2012).

# The SDR460A Family: LAR

The SDR460A family is also referred to as the PIP family, named after the first three enzymes (pinoresinol-laricirecinol reductase, isoflavone reductase, and phenylcoumaran benzylic ether reductase) that were discovered to belong to this family (Gang et al., 1999; Min et al., 2003; Wang et al., 2006). In addition, vestitone reductase, eugenol synthase, and isoeugenol synthase also belong to this family (Koeduka et al., 2008). Pinoresinollaricirecinol reductase and phenylcoumaran benzylic ether reductase function in the lignan biosynthetic pathway; isoflavone reductase and vestitone reductase are involved in isoflavonoid biosynthesis; and eugenol synthase and isoeugenol synthase are involved in the biosynthesis of volatile phenylpropenes. Thus, the SDR460A family members are involved in the biosynthesis of various phenolic compounds. The SDR460A family members ("atypical" type SDRs) are limited to and has greatly expanded in vascular plants (Moummou et al., 2012), suggesting that they are needed for vascular plant prosperity.

A phylogenetic analysis of LARs from various plants revealed that the plant LARs can be classified into two clusters: proteins from dicotyledons, and proteins from monocotyledons and gymnosperms (Wang et al., 2018). Therefore, the monocotyledon LARs are more closely related to the gymnosperm LARS than the dicotyledon LARS.

# FUTURE PERSPECTIVES

The progress in metabolomics technologies including chemoinformatics and the abundant genomic information of flavonoid biosynthetic genes facilitated a fully understanding of evolution of the flavonoid/phenylpropanoid metabolisms in plant kingdom. In this review, we have focused on the evolution of enzymes involved in the biosynthesis of flavonoid skeleton molecules. The basic structures of flavonoids are formed by a type III PKS, CHS. The broad substrate promiscuity and functional diversity of the type III PKSs may be a driving force for expanding the chemical variety of specialized metabolites. In other specialized metabolisms, isomerases (e.g., oxidosqualene synthase) and lyases (e.g., terpene synthase) are also involved in scaffold formation and may contribute the chemical diversity of secondary metabolites. Modification enzymes such as GTs (glycosyltransferases) and acyltransferases also contribute greatly to the huge diversity of flavonoids and other secondary metabolites. Interestingly, plants have two types of flavonoid GTs: the cytosolic family 1 GTs and the vacuolar glycoside hydrolase family 1 (GH1) (Cao et al., 2017). Acylation is also catalyzed by differentially localized enzymes: cytosolic BAHD acyltransferases and vacuolar SCPL acyltransferases, derived from serine carboxypeptidase (Milkowski and Strack, 2004; Moghe and Last, 2015). It is still unknown why plants have evolved these differentially localized enzymes for the modification of flavonoids and other specialized metabolites. The evolution of the family 1 GTs, the GH1s, and the BAHD acyltransferases in plants have been reviewed elsewhere (St-Pierre and De Luca, 2000; Yu et al., 2009; Tuominen et al., 2011; Yonekura-Sakakibara and Hanada, 2011; Caputi et al., 2012; Moghe and Last, 2015; Cao et al., 2017). Throughout their long history, plants have engineered their metabolic pathways to adapt themselves to their habitats and growth conditions in tissue and organ specific manners. A detailed understanding of the evolutionary history of metabolic enzymes involved in biosynthesis, modification, transport, secretion, transcriptional regulation, and chemodiversity will assist us in the engineering of specialized metabolic pathways to produce desirable metabolites with minimal energy expenditures.

# AUTHOR CONTRIBUTIONS

KY-S proposed the concept. KY-S, YH, and RN developed it and wrote the manuscript.

# FUNDING

This work was partially supported by the JSPS KAKENHI program (grant number 17K07460 to KY-S),

# REFERENCES


the Integrated Lipidology Program of RIKEN (YH), and the Project of the NARO Bio-oriented Technology Research Advancement Institution (Research program on development of innovative technology) (RN).

between Arabidopsis and rice response to stressors. Front. Plant Sci. 8:350. doi: 10.3389/fpls.2017.00350


of the liverwort Marchantia paleacea. Plant Physiol. Biochem. 125, 95–105. doi: 10.1016/j.plaphy.2018.01.030


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Yonekura-Sakakibara, Higashi and Nakabayashi. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Atypical Myrosinase as a Mediator of Glucosinolate Functions in Plants

### Ryosuke Sugiyama and Masami Y. Hirai\*

RIKEN Center for Sustainable Resource Science, Yokohama, Japan

Glucosinolates (GLSs) are a well-known class of specialized plant metabolites, distributed mostly in the order Brassicales. A vast research field in basic and applied sciences has grown up around GLSs owing to their presence in important agricultural crops and the model plant Arabidopsis thaliana, and their broad range of bioactivities beneficial to human health. The major purpose of GLSs in plants has been considered their function as a chemical defense against predators. GLSs are physically separated from a specialized class of beta-thioglucosidases called myrosinases, at the tissue level or at the single-cell level. They are brought together as a consequence of tissue damage, primarily triggered by herbivores, and their interaction results in the release of toxic volatile chemicals including isothiocyanates. In addition, recent studies have suggested that plants may adopt other strategies independent of tissue disruption for initiating GLS breakdown to cope with certain biotic/abiotic stresses. This hypothesis has been further supported by the discovery of an atypical class of GLS-hydrolyzing enzymes possessing features that are distinct from those of the classical myrosinases. Nevertheless, there is only little information on the physiological importance of atypical myrosinases. In this review, we focus on the broad diversity of the beta-glucosidase subclasses containing known atypical myrosinases in A. thaliana to discuss the hypothesis that numerous members of these subclasses can hydrolyze GLSs to regulate their diverse functions in plants. Also, the increasingly broadening functional repertoires of known atypical/classical myrosinases are described with reference to recent findings. Assessment of independent insights gained from A. thaliana with respect to (1) the phenotype of mutants lacking genes in the GLS metabolic/breakdown pathways, (2) fluctuation in GLS contents/metabolism under specific conditions, and (3) the response of plants to exogenous GLSs or their hydrolytic products, will enable us to reconsider the physiological importance of GLS breakdown in particular situations, which is likely to be regulated by specific beta-glucosidases.

Keywords: glucosinolate, myrosinase, beta-glucosidase, metabolism, stress response

# INTRODUCTION

Over the years, a number of bioactive metabolites have been identified in plants, many of which are utilized as beneficial sources of pharmaceuticals and/or research tools. Glucosinolate (GLS), a class of sulfur-rich natural products mainly produced by the family Brassicaceae, is among the most studied plant metabolites owing to its potential health-related benefits and availability in the model plant Arabidopsis thaliana (Halkier and Gershenzon, 2006; Agerbirk and Olsen, 2012). GLSs impart

#### Edited by:

Jens Rohloff, Norwegian University of Science and Technology, Norway

#### Reviewed by:

Ute Wittstock, Technische Universitat Braunschweig, Germany Verena Jeschke, University of Copenhagen, Denmark

> \*Correspondence: Masami Y. Hirai masami.hirai@riken.jp

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 02 May 2019 Accepted: 18 July 2019 Published: 06 August 2019

### Citation:

Sugiyama R and Hirai MY (2019) Atypical Myrosinase as a Mediator of Glucosinolate Functions in Plants. Front. Plant Sci. 10:1008. doi: 10.3389/fpls.2019.01008

**145**

specific pungency and flavors to Brassicaceae vegetables such as mustard and cabbage (Fahey et al., 2001; Halkier, 2016; Possenti et al., 2017). Moreover, a few GLS compounds such as glucoraphanin (4-methylsulfinyl-n-butyl glucosinolate) are known to produce health-promoting chemicals with diverse bioactivities (Traka, 2016; Banerjee and Paruthy, 2017). Very recently, the enormous body of research activities on GLS has been compiled as two books entitled Glucosinolates with few overlaps (Kopriva, 2016; Mérillon and Ramawat, 2017).

In recent years, chemical ecology, which focuses on gaining an understanding of the physiological functions of specialized metabolites (previously referred to as secondary metabolites) in organisms, has also become a burgeoning scientific field in natural product research. In this context, GLSs have traditionally been considered as defense chemicals deployed against predators. GLSs generally accumulate in specific cells (S-cells), separated from cells containing their hydrolytic enzymes (beta-thioglucosidases called myrosinases) (Halkier, 2016; Wittstock et al., 2016a). In addition, it has been suggested that GLSs and myrosinases could co-exist even within single cells, probably being compartmentalized in different organelles (Koroleva and Cramer, 2011). They are mixed upon tissue damage to release toxic volatiles such as isothiocyanates (ITCs) (**Figure 1**). Specifier proteins and side chain structures play an important role in converting the unstable aglycon to various end products with different bioactivities (Lambrix et al., 2001; Hanschen et al., 2014; Wittstock et al., 2016a,b; Eisenschmidt-Bönn et al., 2019). For instance, simple nitriles are generated in the presence of nitrile specifier proteins, and ITCs possessing a hydroxyl group at position 2 can be further cyclized. These compartmentalizations, which enable plants to safely control harmful chemicals, is referred to as the GLS–myrosinase system or "mustard oil bomb" and has long fascinated many plant scientists (Lüthy and Matile, 1984).

Knowledge on the chemical diversity embedded in the GLS metabolism and on the molecular mechanisms underlying the GLS–myrosinase system is frequently updated. For example, energetical investigations have been made of GLS biosynthetic genes (Halkier, 2016; Barco and Clay, 2019), end products directed by specifier proteins (Wittstock et al., 2016a), tissue localization via GLS transporters (Jørgensen et al., 2015; Halkier, 2016), differences in GLS contents among species and accessions (Kliebenstein and Cacho, 2016), and proteins that interact with myrosinases to regulate their activity and stability (Bhat and Vyas, 2019; Chen et al., 2019). However, most of the insights that have been gained regarding the molecular basis and physiological importance of GLS breakdown are based on the intercellular and tissue damage-dependent GLS–myrosinase system. On the other hand, there are several reports on fluctuations of endogenous GLS levels even in nondisrupted tissues, caused by pathogen attack or abiotic stress (Martinez-Ballesta et al., 2013; Variyar et al., 2014; Burow, 2016; Pastorczyk and Bednarek, 2016). These observations suggest the existence of different system(s) that regulate GLS breakdown independent of tissue disruption, to cope with such environmental stresses. In connection with the subcellular GLS– myrosinase compartmentalization, additional functions of GLSs not limited to their role as the defense chemicals against herbivores have been recognized (Bednarek, 2012b; Katz et al., 2015; Francisco et al., 2016b; Burow and Halkier, 2017). In fact, many of the latest studies have demonstrated broad physiological functions of GLSs (Francisco et al., 2016a; Katz and Chamovitz, 2017; Malinovsky et al., 2017; Nintemann et al., 2018; Urbancsok et al., 2018a,b).

In order to gain a more comprehensive understanding of the possible multi-functionality of GLSs in planta, we should take a global view of the profound diversification embedded in GLSs (**Figure 1**). Nearly 150 GLS compounds have been identified to date, and there are at least 36 GLSs with different sidechain structures in A. thaliana (Brown et al., 2003; Agerbirk et al., 2018). Three subclasses of GLSs — aliphatic, benzenic and indole GLSs — are biosynthesized from different precursor amino acids with independent regulatory systems by MYB and MYC transcription factors (**Figure 1**) (Frerigmann, 2016). Thus, it is conceivable that each GLS class could participate in distinct biological processes, as indole GLSs are known to play an essential role in plant immunity via coordination with the metabolism of other phytoalexins (Bednarek et al., 2009; Pedras et al., 2011; Klein and Sattely, 2015, 2017; Frerigmann et al., 2016). In addition, different bioactivities of ITCs, dependent on their side-chain structures, are well recognized despite their non-specific nucleophilicity, which may contribute to fitness performance of plants (Burow et al., 2010; Andersson et al., 2015; Urbancsok et al., 2017). Moreover, the production of various end products even from a single GLS species, directed by the specifier proteins, is likely to expand the endogenous biological targets of GLSs in different signaling pathways (Lambrix et al., 2001; Wittstock et al., 2016a,b). In contrast, the genetic and biochemical diversity in myrosinases is less understood, even though a new class of betaglucosidases capable of hydrolyzing GLSs has been identified. Compared with the well-documented class of myrosinases that are widely found in the order Brassicales, these so-called atypical myrosinases possess unique features with respect to both amino acid sequences and enzymatic profiles. In this review, we therefore focus on the considerable diversity of betaglucosidases in A. thaliana as a model to discuss their possible contribution to the broad utility of GLSs in plants, even at the subcellular level.

# CLASSICAL VERSUS ATYPICAL MYROSINASES

In A. thaliana, glucosyl hydrolase family I is composed of 47 BETA-GLUCOSIDASE (BGLU) genes and a BGLUlike gene, AFR2 (Xu et al., 2004). According to the crystal structure of a thioglucosidase from Sinapis alba, a few amino acid residues are conserved in myrosinases among a wide range of GLS-producing plants (Burmeister et al., 1997). For example, Gln and Glu residues within the catalytic site are considered to be essential for cleavage of the

thioglucoside moiety. Thus, six genes (BGLU34–BGLU39) named THIOGLUCOSIDE GLUCOHYDROLASE (TGG) had previously been considered as the only class encoding myrosinases in A. thaliana. Myrosinases possessing these amino acid signatures have been found in a wide range of GLS-producing plants (Rask et al., 2000).

In 2009, however, it was revealed that PENETRATION 2 (PEN2)/BGLU26 is capable of hydrolyzing indole GLSs and that generation of the putative degradation products is critical for the plant immune response (Bednarek et al., 2009; Clay et al., 2009). Although the key Gln residue is replaced by Glu in PEN2, the recombinant PEN2 protein clearly showed myrosinase activity against indol-3-ylmethyl glucosinolate (I3G) and its 4-methoxy analog (Bednarek et al., 2009). Moreover, Nakano et al. (2014) demonstrated that PYK10/BGLU23 is a major component of the endoplasmatic reticulum (ER) body an organelle found primarily in the family Brassicaceae — and also has functions as a myrosinase against I3G (Nakano et al., 2017). PYK10 also has Glu instead of the key Gln residue found in TGGs. Based on these findings, PEN2 and PYK10 were newly categorized as atypical or EE-type myrosinases, in contrast to TGG1–TGG6, which are referred to as classical or QE-type myrosinases ("EE" and "QE" represent their conserved amino acid residues).

It should be noted that the two Glu residues identified in PEN2 and PYK10 are conserved among the 16 genes named BGLU18–BGLU33 in A. thaliana (**Table 1**). A phylogenomic analysis of BGLUs from more than 50 plant species revealed that the monophyletic clade composed of these 16 BGLUs is specific for the order Brassicales (Nakano et al., 2017). These BGLU members lack the Gln and other amino acid signatures conserved in the classical myrosinases, whereas additional basic residues oriented to the deduced substrate-binding pocket occur only in this subclass. Detailed amino acid signatures in A. thaliana BGLUs are shown in Nakano et al. (2017), especially in their Figure 5 and Supplementary Figure S11. Considering the distinct amino acid signatures conserved in each BGLU subclass and their frequent emergence across plant species, it is suggested that the QE and EE myrosinases have arisen independently during evolution (Nakano et al., 2017). Not limited to PEN2 and PYK10, interestingly, transcriptional changes and mutations in some of these BGLUs have suggested their relevance in response to specific stresses (**Figure 2**, **Tables 1**, **2**, and **Supplementary Tables S2**, **S3**). It is also to be noted that GLS metabolism is affected independent of tissue disruption under those conditions (**Table 2**). Therefore, other members of this BGLU subclass may have myrosinase activities and regulate different machineries for GLS turnover that are perhaps more specialized in substrate selectivity, tissue localization and developmental stage, rather than the broad-scale chemical defense against herbivores deployed by classical myrosinases.

In this review, we aim to discuss the hypothesis that a wide range of these Brassicales-specific BGLUs can function as myrosinases to regulate the multiple functions of GLSs in planta. Although current insights into this BGLU subclass are highly limited, essentially three types of previous studies could be useful for considering this hypothesis: analyses of (1) the phenotype of mutants lacking genes responsible for the GLS metabolic/breakdown pathway; (2) the changes in GLS levels/metabolism under specific conditions; and (3) the response of plants treated with GLSs or their hydrolytic products (**Table 2**). Here, mainly based on studies in A. thaliana, we consider phenotypic information related to BGLU18–BGLU33, fluctuations in GLS metabolism, and the effects of exogenous GLS breakdown products on plants to review the physiological importance of GLS breakdown in particular situations. In addition to a specific focus on atypical myrosinases, more diverse functions of classical TGGs are described with reference to the most recent insights. Finally, we suggest the types of experiment that are effective in gaining


TABLE 1 | Current insights on BGLU18–BGLU39, putative EE-type myrosinases in Arabidopsis thaliana.

<sup>a</sup>Xu et al. (2004). <sup>b</sup>Glucosinolates are bolded. ABE-GE, abscisic acid glucosyl ester; 4MI3G, 4-methoxyindol-3-ylmethyl glucosinolate; I3G, indol-3-ylmethyl glucosinolate; 4-MUG, 4-methylumbelliferyl-beta-D-O-glucoside. <sup>1</sup>Lee et al. (2006), Nakazaki et al. (2019), Yamada et al. (2011), Ogasawara et al. (2009), Cao et al. (2017), <sup>6</sup>Ahn et al. (2010), Matsushima et al. (2003), Nagano et al. (2008), Schmid et al. (2005), Nakabayashi et al. (2005), Bednarek et al. (2009), Lipka et al. (2005), Hirai et al. (2003, 2004), Nikiforova et al. (2003), Maruyama-Nakashita et al. (2003), Fujiki et al. (2001), Lee J. et al. (2007), Xu et al. (2012).

a more complete understanding of the multi-functionality of GLSs in plants.

# A PLANT IMMUNE PATHWAY REGULATED BY PEN2, THE FIRST ATYPICAL MYROSINASE

Here, we describe the general understanding of the PEN2 pathway in brief, as the relevance of PEN2 and indole GLSs in plant immunity has been well documented (Bednarek, 2012a; Johansson et al., 2014; Frerigmann et al., 2016; Pastorczyk and Bednarek, 2016; Xu et al., 2016). PEN2 was first identified as a component of the pre-invasive resistance in A. thaliana against the powdery mildew fungi, Blumeria graminis f. sp. hordei and Erysiphe pisi (Lipka et al., 2005; Bednarek et al., 2009). PEN2 is also responsible for the callose deposit in A. thaliana seedlings induced by fungal pathogens or flg22, a bacterial flagellin-derived peptide (Clay et al., 2009), even though it is suggested that penetration resistance and callose deposition are not directly linked (Lipka et al., 2005; Maeda et al., 2009; Hiruma et al., 2010). These stimuli induce a decrease of indole GLS levels mediated by PEN2 and the accumulation of plausible end products including indol-3-ylmethylamine (I3A), raphanusamic acid (RA), and 4-O-beta-D-glucosyl-indol-3-yl formamide (4OGlcI3F) (Bednarek et al., 2009, 2011; Lu et al., 2015). Notably, hydroxylation of I3G at position 4 mediated by CYP81F2 is a critical step for the PENdependent immune response. This is an interesting example of a GLS with a particular side chain exhibiting a specialized biological function. On the other hand, the unstable indol-3 ylmethyl ITC can serve as an intermediate for the biosynthesis of several phytoalexins (Bednarek et al., 2009; Klein and Sattely, 2017). Thus, the molecular mechanisms and actual bioactive metabolite(s) relevant in this pathway remain to be further investigated.

In a recent study, a co-expression analysis for genes involved in the PEN2 immune system identified GLUTATHIONE S-TRANSFERASE CLASS-TAU MEMBER 13 (GSTU13) as a critical component of that pathway (Pi´slewska-Bednarek et al., 2018). The gstu13 mutants were more susceptible to several pathogens and were impaired in

callose deposition induced by the bacterial flg22 epitope. Furthermore, the formation of pathogen-triggered specific metabolites such as indol-3-ylmethyl amine was broadly repressed in these mutants. Therefore, the conjugation of ITCs with glutathione catalyzed by GSTU13 was revealed to be strictly essential for the activation of the indole GLS-related immune system.

# PYK10, BGLU18 AND ER-RETAINED BGLUs COULD MAINTAIN AN INTRACELLULAR GLS–MYROSINASE SYSTEM

In the GLS–myrosinase system, physical separation of glucoseconjugated precursor compounds from their hydrolases is an efficient means to control the bioactivity of these metabolites, and this strategy may function even at the subcellular level. The ER body, a rod-shaped organelle continuous with the ER, is considered to provide such an intracellular compartment; a few BGLUs including PYK10/BGLU23 are significantly enriched in these structures in A. thaliana (Matsushima et al., 2003; Ogasawara et al., 2009). Among the Brassicales-specific BGLUs, BGLU18–BGLU25 commonly have ER-retention signals in their signal peptides and C-terminal regions (Nakano et al., 2014). Indeed, ER bodies that are constitutively present in roots accumulate large amounts of PYK10 and the closest homologs BGLU21 and BGLU22, whereas BGLU18 is the major component of another class of ER body that is induced by wounding or methyl jasmonate treatment (Matsushima et al., 2002, 2003; Nagano et al., 2008; Ogasawara et al., 2009). More detailed information on the physiology and molecular network in ER bodies is available in specific reviews (Yamada et al., 2011; Nakano et al., 2014; Shirakawa and Hara-Nishimura, 2018).

Notably, ER bodies have been observed in only a few families of the order Brassicales, namely, Brassicaceae, Capparaceae, and Cleomaceae. Therefore, substrate(s) of the BGLUs could also be restricted to a narrow range of phylogenetic clades. Although PYK10, BGLU21, and BGLU22 neither have the amino acid signatures conserved in classical myrosinases nor display myrosinase activities toward aliphatic allyl GLS (sinigrin) in vitro (Ahn et al., 2010), previous studies on PEN2 led Nakano and co-authors to hypothesize that PYK10 can hydrolyze indole GLSs. As expected, root protein extracts of the pyk10 and several ER-body mutants in A. thaliana, as well as the recombinant PYK10 protein, showed myrosinase activity against I3G (Nakano et al., 2017). Interestingly, coexpression analysis revealed that PYK10 is more closely related

TABLE 2 | Relevance of glucosinolates (GLSs) and the Brassicales-specific beta-glucosidases (BGLUs) in Arabidopsis thaliana under abiotic stress.


<sup>a</sup>AITC, allyl isothiocyanate; PEITC, phenethyl isothiocyanate. <sup>1</sup>Ren et al. (2009), Lee et al. (2006), Zhao et al. (2008), Islam et al. (2009), Khokon et al. (2011), <sup>5</sup>Martinez-Ballesta et al. (2015), Cao et al. (2017), Xu et al. (2012), Falk et al. (2007), Hirai et al. (2003, 2004), Maruyama-Nakashita et al. (2003), Nikiforova et al. (2003), Nour-Eldin et al. (2012), Huseby et al. (2013), Fujiki et al. (2001), Brandt et al. (2018), Haughn et al. (1991), Ludwig-Müller et al. (2000), Hara et al. (2012).

to biosynthetic genes for indole GLSs than to those of coumarin glucosides, putative substrates predicted from in vitro assays (Ahn et al., 2010; Nakano et al., 2017). Therefore, the physiological function of PYK10 in planta is more likely to be associated with GLS metabolism. Furthermore, a suite of informatics analyses performed in Nakano et al. (2017) indicated that a broad class of BGLUs, not limited to the classical myrosinases, may have the potential to hydrolyze GLSs, as described above.

The myrosinase activity of PYK10 toward indole GLSs is further supported by a very recent study on a new class of ER bodies, referred to as leaf ER bodies (Nakazaki et al., 2019). Compared with the aforementioned known classes of ER bodies, leaf ER bodies occur constitutively in a few types of epidermal cells in rosette leaves and accumulate both PYK10 and BGLU18. The pyk10 bglu18 mutant was shown to lack leaf ER bodies and became more susceptible to attack by the terrestrial isopod Armadillidium vulgare compared with the wild type plants (Nakazaki et al., 2019). In addition, the levels of most endogenous GLS species decrease rapidly in homogenates of the rosette leaves, mainly due to the activities of TGG1 and TGG2, whereas degradation of 4-methoxyindol-3-ylmethyl glucosinolate (4MI3G) was found to be selectively and significantly delayed in the pyk10 bglu18 mutant (Nakazaki et al., 2019). 4MI3G is a key GLS species in the PEN2 dependent immune pathway (Bednarek et al., 2009; Clay et al., 2009). Given that damage to leaves of the tgg1 tgg2 mutant caused by A. vulgare in a feeding assay was comparable to that in the wild type, it has been suggested that leaf ER bodies are involved in the production of the defensive chemicals from 4MI3G that protect A. thaliana leaves against herbivore attack. Since neither pyk10 nor bglu18 single mutants were examined in these experiments, it would be of interest to determine whether BGLU18 can also hydrolyze indole GLSs. In combination with previous findings, these findings suggested that ER bodies can provide a further class of GLS–myrosinase compartment at the subcellular level (i.e., GLSs retained in vacuoles and myrosinases retained in the ER bodies), that plays an important role in plant defense against attacks of herbivores and pathogens.

# DO BGLU18 AND BGLU33 HAVE OTHER SUBSTRATES IN ADDITION TO GLUCOSE-CONJUGATED ABSCISIC ACID?

In addition to its potential roles in the ER bodies, BGLU18 is known to participate in abscisic acid (ABA) metabolism. ABA, one of the most important phytohormones active during a plant's life cycle, is involved in a variety of biological processes, including the adaptation to environmental stresses (Sakata et al., 2014; Daszkowska-Golec, 2016). The cellular ABA level is partially regulated via a complex de novo biosynthetic pathway (Nambara and Marion-Poll, 2005). In addition, the discovery of ABAglucosyltransferase, which generates an ABA glucosyl ester (ABA-GE), led us to hypothesize that release of ABA from the pool of inactive ABA analogs can potentially modulate ABA concentrations more dynamically (Xu et al., 2002). BG1/BGLU18 was reported as the first enzyme that can hydrolyze ABA-GE as part of the drought stress response (Lee et al., 2006). Subsequently, BG2/BGLU33 was identified as a further member of the ABA-GE hydrolases localized in vacuoles and was shown to play an important role under conditions of salt stress (Xu et al., 2012). Although the enzymatic potential of these BGLUs to hydrolyze ABA-GE and their relevance in ABA functions have been well investigated by independent groups (Merilo et al., 2015; Ondzighi-Assoume et al., 2016; Yamashita et al., 2016), it is questionable whether ABA-GE is the sole substrate of these BGLUs under physiological conditions, based on a consideration of the following observation.

In the first place, the response to drought and salinity is subject to complex regulation, not only by ABA but also by other small molecules including GLSs. Their contribution to the drought stress response has been discussed with regard to

stomatal closure in guard cells, which is the major response of plants under low water conditions. The signal cascade for stomatal movement is initiated by multiple inputs from plant hormones and several primary/specialized metabolites in response to environmental stresses, whereas they are finally integrated into a single output — the production of reactive oxygen species (ROS) and Ca2<sup>+</sup> oscillation followed by protein phosphorylation (Murata et al., 2015). Hence, depletion of one signal molecule could be compensated by the function of another compound. Notably, application of exogenous allyl ITC or several GLS breakdown products has been shown to induce stomatal closure in A. thaliana leaves (Khokon et al., 2011; Hossain et al., 2013). Furthermore, TGG1 and TGG2 have been reported to be major components of guard cells and shown to be involved in stomatal movement (Zhao et al., 2008; Islam et al., 2009). The stomatal closure induced by allyl ITC and the subsequent ROS production does not require endogenous ABA, but is dependent on methyl jasmonate (Khokon et al., 2011), indicating that ABA and ITC function as independent inputs for stomatal movement. Some researchers have hypothesized that the accumulation of TGGs and GLSs in guard cells represents the evolutionary origin of the GLS–myrosinase system, because stomata can serve as the initial gateway to bacterial invasion (Shirakawa and Hara-Nishimura, 2018). However, whether breakdown of internal GLSs in guard cells occurs under drought conditions to induce stomatal closure and the relevance of BGLU18 in this process are still to be investigated. Involvement of GLSs in the salt stress response has also been suggested based on the findings of several studies that have examined the fluctuation of GLS contents in Brassicaceae plants and the response of A. thaliana mutants lacking aliphatic GLSs under salinity stress (López-Berenguer et al., 2008; Keling and Zhu, 2010; Martinez-Ballesta et al., 2015), even though the detailed mechanisms remain unclear.

Secondly, the EE-type myrosinases tend to have a dual function as S- and O-glucosidases. PYK10 has been reported to hydrolyze coumarin glucosides such as scopolin in vitro (Ahn et al., 2010), whereas co-expression analysis has indicated that indole GLSs are more likely to be the actual substrates in planta (Nakano et al., 2017). PEN2 is also known to have enzymatic potential to catalyze the deglycosylation of 4-methylumbelliferylbeta–D-O-glucoside, albeit at a lower rate than for the hydrolysis of indole GLSs (Bednarek et al., 2009). Drought stress appears to promote the degradation of a wide range of GLS species not limited to indole GLSs (Ren et al., 2009). One possible reason is that GLSs are hydrolyzed after transportation to the compartments/cells containing classical myrosinases. Alternatively, as is the case for GLS degradation induced by sulfur depletion or prolonged darkness (see the following sections), it is also conceivable that EE-type myrosinases, which may include BGLU18 and BGLU33, could hydrolyze other GLS subclasses.

Thirdly, whereas ABA is one of the indispensable hormone compounds in the plant kingdom, the BGLU subclass containing BGLU18 and BGLU33 is distributed only in the order Brassicales (Xu et al., 2004; Nakano et al., 2017). Notably, BGLU18 is a major and essential component of ER bodies, an organelle observed in only a few families of the order Brassicales, as described above. One possibility is that BGLU18 and BGLU33 may have other substrate(s) restricted to these small evolutionary clades, like PYK10. Another possible explanation is that there are several molecular systems to release ABA from the repository of inactive ABA derivatives, and these BGLUs may belong to one of those conserved only in the Brassicales. As a similar case, in some Brassicaceae plants, the metabolism of indole GLSs is closely related to that of auxin, another essential phytohormone (Malka and Cheng, 2017; Vik et al., 2018). Although similar machineries for dynamic regulation of ABA levels may exist in a wide range of plant families, such enzymes hydrolyzing inactive ABA derivatives have yet to be identified.

Taken together, the aforementioned findings indicate that we should not exclude the hypothesis that BGLU18 and BGLU33 can also function as myrosinases. If these BGLUs were proven to be dual-functional, the regulatory mechanism of these enzymatic activities in planta would inevitably attract greater attention.

# GLS BREAKDOWN TO RECYCLE SULFUR MAY BE MEDIATED BY BGLU28 and BGLU30

This and the next section discuss the possibility that GLS itself could work as nutrient storage, not only as a precursor of toxic chemicals and signaling molecules. Brassicaceae plants containing GLSs need larger amounts of sulfur than other plants (Castro et al., 2003; Walker and Booth, 2003; Yang et al., 2006). Since GLSs have at least two sulfur atoms in each molecule and represent 10–30% of the total sulfur content in plant organs, these metabolites have been considered a potential source of sulfur for other metabolic processes (Falk et al., 2007). The effects of sulfur supply and depletion on GLS levels in plants have been well studied, including in agricultural practice (Falk et al., 2007; Schnug and Haneklaus, 2016). In A. thaliana, sulfur deficiency strongly induces down-regulation of GLS content as well as expression levels of GLS biosynthetic genes (Hirai et al., 2005; Zhang et al., 2011). To date, a few transcription factors, SULFUR LIMITATION 1 (SLIM1), SULFUR DEFICIENCY-INDUCED 1 (SDI1) and SDI2, have been reported to work as repressors of GLS biosynthesis under low sulfur conditions (Maruyama-Nakashita et al., 2006; Aarabi et al., 2016). In addition, developmental defects of GLS-less seeds as a result of mutations in GLS transporters (gtr1 gtr2) under sulfur deficiency further supported the potential role of GLSs as a sulfur reserve (Nour-Eldin et al., 2012). Nevertheless, most mechanisms underlying GLS turnover in that condition are still not known. It would be desirable to gain more direct evidence for the re-distribution of sulfur atoms from GLSs, e.g., by incorporating isotopes into primary sulfur metabolites from labeled GLSs.

Based on their observed up-regulation, BGLU28 and BGLU30 are suggested to be relevant in the hydrolysis of GLSs caused by sulfur depletion (Hirai et al., 2003, 2004; Maruyama-Nakashita et al., 2003; Nikiforova et al., 2003). BGLU28 and BGLU30 form a sister clade close to those containing PEN2 and PYK10 in the phylogenetic tree of A. thaliana BGLUs (Xu et al., 2004). Although their possible contribution to GLS turnover was suggested more than 15 years ago (Hirai and Saito, 2004),

there have been few reports on the physiological functions of these genes. To our knowledge, only two studies (Zhang et al., 2014; Jackson et al., 2015) have monitored BGLU28 promoter activity, using the GUS reporter gene as a marker of lowsulfur response upon plant hormone treatment and mutations in SULTR1;2. In this context, it should be noted that sulfur deficiency primarily affects contents of aliphatic GLSs, whereas PYK10 and PEN2 have been reported to hydrolyze only indole GLSs. If BGLU28 and BGLU30 can indeed hydrolyze aliphatic GLSs, the broad chemodiversity of EE-type myrosinases would undoubtedly gain more recognition. Compared with the TGGs that exhibit low selectivity for GLS species (Zhou et al., 2012), EE-type myrosinases may have narrower substrate specificities and thus regulate only particular biotic/abiotic stress responses. Both the protein functions and physiological roles of these BGLUs need to be further investigated beyond the context of sulfur assimilation.

# ACTIVATION OF GLS TURNOVER IN DARKNESS AND POSSIBLE CONTRIBUTION OF BGLU30 TO RECOVER CARBOHYDRATES

Light is an essential energy source for the development and metabolism of higher plants, and the expression levels of a number of genes are known to be regulated by light; for example, the sulfur assimilation pathway is activated during the light period (Kocsy et al., 1997; Kopriva et al., 1999; Pruneda-Paz and Kay, 2010). Although the photo-regulation of sulfur assimilation has been investigated in different species under a variety of growth conditions (Buwalda et al., 1988; Passera et al., 1989; Lee E.-J. et al., 2007; Lee et al., 2011), there tends to be little consensus regarding its coordination with GLS biosynthesis/catabolism.

The regulation of GLS metabolism by light has been discussed from two aspects, namely, the substantial degradation of GLSs under conditions of prolonged darkness and the diurnal fluctuation of GLS contents. In A. thaliana, there is a marked reduction in the GLS content of leaves under conditions of extended darkness (Huseby et al., 2013; Brandt et al., 2018). Given that inhibition of photosynthesis is a strong stimulus inducing carbohydrate starvation and leaf senescence, GLS degradation may be a mechanism designed to cope with nutrient starvation via the mobilization of D-glucose units from those molecules. In this regard, BGLU30, also referred to as DARK-INDUCIBLE 2 (DIN2) or SENESCENCE-RELATED GENE 2 (SRG2), is known to be significantly up-regulated in response to prolonged darkness, senescence, and sugar starvation (Fujiki et al., 2001; Lee J. et al., 2007). The key factor for the induction of BGLU30 expression appears to be endogenous sugar levels rather than light conditions. A high abundance of BGLU30 transcript was detected in detached leaves even under illumination in the presence of a photosynthesis inhibitor, 3-(3,4-dichlorophenyl)-1,1-dimethylurea, whereas it was barely detected under co-treatment with sucrose (Fujiki et al., 2001). Given that the BGLU30 expression is also induced by sulfur depletion, this enzyme may regulate a release of stored GLSs to overcome nutrient starvation under various conditions. Notably, the expression of BGLU28 is induced neither by prolonged darkness nor by sugar starvation. In addition to their enzymatic functions, the difference in the regulatory systems between these closely related BGLUs is of particular interest.

Cooperating with the sulfur assimilation pathways, GLS levels have been shown to be higher during the day than at night in A. thaliana (Huseby et al., 2013). Simultaneously, the expression levels of GLS biosynthetic genes as well as the incorporation of inorganic <sup>35</sup>S into GLSs were found to be enhanced by light. Moreover, a further study revealed a diurnal increase in total myrosinase activity and the abundance of TGG1 and TGG2 proteins in A. thaliana seedlings (Brandt et al., 2018). These findings accordingly indicate that GLS metabolism is highly co-regulated with sulfur assimilation in response to the circadian rhythm, even though the physiological importance of this phenomenon remains unclear. Since the correlation between GLS contents and the expression levels of GLS biosynthetic genes was relatively low during the light period (Rosa, 1997; Klein et al., 2006; Schuster et al., 2006), not only de novo biosynthesis but also turnover could regulate the diurnal rhythm of endogenous GLSs. In this context, however, a contribution of BGLU30 to GLS degradation is less likely because an increase of BGLU30 transcripts was observed only after 12 h or longer of dark treatment (Fujiki et al., 2001; Lee J. et al., 2007). In addition to the transcriptomic changes of other BGLUs, post-translational regulation of myrosinase activities, including those of TGG1 and TGG2 (Brandt et al., 2018), should be considered in order to gain a better understanding of the dynamic control of GLS contents over a 24-h period.

# ADDITIONAL ROLES OF CLASSICAL TGGs BEYOND THE "MUSTARD OIL BOMB"

Several new findings on members of the QE-type myrosinases (TGG1-TGG6) have indicated their broader physiological importance in various situations, beyond the classical intercellular "mustard oil bomb" system deployed against predators. As described above, TGG1 and TGG2 might be involved in guard cell ITC production (Zhao et al., 2008; Islam et al., 2009) and the diurnal control of GLS levels in the absence of tissue disruption (Brandt et al., 2018). In this section we discuss two recent studies that have reported the detailed analysis of root-specific TGGs and the function of TGG6, which had hitherto been considered a pseudogene (Fu et al., 2016a,b).

Currently, QE-type myrosinases are classified into two subclasses, namely, Myr I and Myr II (Wang et al., 2009a). Members of subclass Myr I are found in all GLS-containing plants and are typically deposited in myrosin cells, thereby establishing the compartmentalization required for the intercellular GLS– myrosinase system (Rask et al., 2000). Thus, Myr I class

myrosinases are considered to be critical for biochemical defense against herbivores. Myr II members differ from those in subclass Myr I with respect to several features, including sequence divergence and gene structure (Wang et al., 2009a; Nong et al., 2010; Fu et al., 2016a). Functional analysis of TGG4 and TGG5, the first examples of the Myr II subfamily to be examined, indicated their root-specific roles differ from those of leaflocalized Myr I members. A comparison of the enzymatic properties of TGG4 and TGG5 with those of TGG1 using recombinant proteins expressed in Pichia pastoris revealed that TGG4 and TGG5 have higher stability than TGG1 under adverse conditions, such as high temperature, low pH, and excess NaCl (Andersson et al., 2009). In a more recent study, Fu et al. investigated the tissue localization, regulation of root growth, and possible contribution to auxin biosynthesis of TGG4 and TGG5 (Fu et al., 2016b). Analyses of GUS reporter gene expression and myrosinase activities of the single and double KO mutants tended to indicate that TGG5 is more predominant in roots. The defective root elongation under flooded conditions and expression patterns of the auxin-responsive DR5:GUS reporter system in these mutants indicated that TGG4 and TGG5 may contribute to auxin biosynthesis at the root tip by hydrolyzing indole GLSs to form indole-3-acetonitrile, a direct precursor of indole-3-acetic acid, even though their enzymatic activities against indole GLSs remain to be confirmed. Given that the aforementioned experiments were performed under noninvasive conditions, it is conceivable that GLS breakdown by the Myr II myrosinases may be less dependent on tissue damage. Hence, not only EE-type but also classical myrosinases could regulate subcellular GLS–myrosinase systems in addition to the so-called "mustard oil bomb." Our understanding in this regard will be ameliorateded by single-cell-level analysis of the specialized compartments in different cell types, as reviewed by Chen et al. (2019) in this issue.

A further surprising finding is that TGG6, which had previously been reported to be a pseudogene but is specifically expressed in pollen (Wang et al., 2009b), is still functional in a number of A. thaliana accessions (Fu et al., 2016a). The authors identified 10 functional alleles of TGG6 from 29 accessions and the recombinant TGG6 derived from Tsu-1 showed a clear myrosinase activity against sinigrin. The predominant expression pattern of functional TGG6 alleles in pollen was relatively similar to that of the non-functional TGG6 in Col-0. Given that an ortholog of TGG6 is predicted to be functional in Arabidopsis lyrata, an outcrossing relative of A. thaliana (Kusaba et al., 2001; Sherman-Broyles et al., 2007; Tang et al., 2007), it is suggested that its ancestral role was the defense of pollen against herbivores. However, subsequent evolutionary acquisition of a self-fertilization system rendered it no longer critical in A. thaliana, thereby resulting in a loss of function in most accessions. A hypothesis proposed based on the findings on TGG6 is that BGLUs with low expression levels in Col-0, the most studied accession of A. thaliana, could still be functional in other accessions. As GLS compositions have become significantly differentiated during evolution even within the same species (Kliebenstein et al., 2001; Edger et al., 2015; Kliebenstein and Cacho, 2016; Barco and Clay, 2019), it is possible that a BGLU plays a critical role in a few accessions but is not important in others. Accession- and species-wide analysis of the same BGLU orthologs may help us to gain a better understanding of the specific functions of these enzymes in planta and how they have acquired these specific roles during the course of evolution.

# CURRENT UNDERSTANDING OF THE OTHER BRASSICALES-SPECIFIC BGLU

Other than BGLUs described above, current insights on the Brassicales-specific BGLUs are highly limited. To get an overview of BGLU18–BGLU33, we performed a public data analysis using ePlant<sup>1</sup> (Waese et al., 2017) (**Figure 2**). In many cases, only a few publications were extracted when each BGLU was used as a query (**Figure 2A**). Our lack of knowledge on several BGLUs is probably due to their almost undetectable expression. Using the Plant eFP viewer, we can see that signal levels of these BGLUs in the Affymetrix ATH1 microarray are very low in almost all tissues, except for BGLU18 and PEN2 (**Figure 2B**). Although the low signal does not mean a low abundance of the actual transcript, it hinders performance of many biological analyses. Instead, some BGLUs exhibit very specific expression patterns in particular tissue(s), e.g., BGLU19 in mature seeds (**Figure 2B** and **Supplementary Table S1**). Moreover, expression levels of each BGLU under broad abiotic stresses are summarized (**Figure 2C** and **Supplementary Tables S2**, **S3**). In addition to the known information such as up-regulation of BGLU18 by drought or wounding, we can expect drastic changes in the expression of uncharacterized BGLUs in response to specific stresses, such as the highly increased expression of BGLU24 in roots as a result of osmotic and salt stress. It should also be noted that according to ATTED-II<sup>2</sup> (Obayashi et al., 2018), several BGLUs show high co-expression scores with specifier proteins: BGLU19 with NSP2, BGLU30 with NSP5, and PYK10 with NSP1/NSP3/NSP4 (crosshybridized to the same probe). Broad end products might be generated even in the atypical myrosinase-mediated GLS breakdown. Hence, numerous public data previously collected could help us hypothesize a specific relevance of these BGLU(s) in particular developmental stages or abiotic stress responses.

# CONCLUDING REMARKS AND FUTURE PERSPECTIVE

In the past decade, subsequent to the identification of PEN2 in 2009, only PYK10 has been reported as a further member of the EE-type myrosinases. As reviewed here, however, this does not exclude the possible contribution of Brassicalesspecific BGLUs to GLS breakdown under specific conditions. In particular, discovery of EE-type myrosinases catalyzing the hydrolysis of aliphatic GLSs in addition to indole GLSs would generate heightened interest amongst researchers in their

<sup>1</sup>https://bar.utoronto.ca/eplant/

<sup>2</sup>http://atted.jp

physiological importance and catalytic mechanisms, compared with the classical myrosinases. In addition, recent studies have demonstrated that classical myrosinases such as TGGs can also participate in non-tissue-disruptive GLS breakdown beyond the well-known "mustard oil bomb" system at the tissue level. Furthermore, we should pay attention to the accession-wide functional differentiation of the same ortholog with a single species, as highlighted in the case of TGG6. For example, substrate specificity may be dependent on the GLS composition of an accession. It may also be possible that a BGLU has evolved to regulate specialized signals initiated by GLS species present in only a few accessions. Addressing the broad distribution and different myrosinase activities of BGLUs in Brassicaceae and closely related families will enable us to elucidate how the multi-functionality of GLSs is controlled in planta, and how the GLS–myrosinase systems have diversified during evolution.

In vitro enzymatic assays using recombinant proteins would be helpful in determining the physiological functions of these enzymes in planta. In this regard, Pichia pastoris and tobacco BY-2 cells seem to be the preferable organisms to express the A. thaliana myrosinases with enzymatic functions, according to previous studies (Andersson et al., 2009; Bednarek et al., 2009; Fu et al., 2016a; Nakano et al., 2017). However, we should bear in mind the fact that the myrosinase assay using sinigrin, an easily available substrate, may not identify (and perhaps has not identified) the actual enzymatic potential of candidate BGLUs of interest. As the classical myrosinases tend to hydrolyze a diverse range of GLS structures (Zhou et al., 2012), most studies have examined the "myrosinase activity" using only sinigrin, a GLS with a simple allyl chain. However, the EE-type myrosinases may have a restricted substrate selectivity and require optimal conditions to work under particular conditions, as emphasized by the findings for PEN2 and PYK10, which preferentially hydrolyze indole GLSs within a narrow optimal pH range (Bednarek et al., 2009; Nakano et al., 2017). It is also notable that sinigrin is detected only in certain A. thaliana accessions other than Col-0 (Kliebenstein et al., 2001). Since the side chain structures of GLSs can substantially alter the physicochemical properties of the corresponding degradation products such as ITCs, it would be preferable to examine the activity of myrosinases against a broader range of GLS species to establish the

# REFERENCES


physiological importance of the BGLUs of interest. Recent advances in the methods for extraction of intact GLSs from plant materials and quantitative analysis of GLS contents may contribute to promoting this approach (Bianco et al., 2017; Doheny-Adams et al., 2017).

In addition to the abiotic stresses discussed herein, there are a few physiological conditions that are potentially related to GLS metabolism and catabolism (**Table 2**). For example, pretreatment with phenethyl ITC confers heat stress tolerance on A. thaliana seedlings, probably by up-regulating a suite of heatshock proteins (Hara et al., 2012; Kissen et al., 2016). In addition, the mechanisms underlying the degradation of total GLS amount independent of TGG1 and TGG2 during early developmental stages remain to be clarified (Barth and Jander, 2006). Given that up-regulation of particular BGLUs has yet to be observed under these conditions, we should consider the post-translational regulation of myrosinase activities with regard to myrosinaseassociated proteins or small molecule elicitors such as ascorbate (Wittstock et al., 2016a; Bhat and Vyas, 2019; Chen et al., 2019). Under non-disruptive conditions, the physiological functions of myrosinases, including TGGs, are probably controlled more strictly and dynamically than expected till now.

# AUTHOR CONTRIBUTIONS

RS and MH prepared the manuscript. RS prepared the figures and tables. MH finalized the manuscript for submission.

# FUNDING

This work was supported by the JSPS KAKENHI Grant-in-Aid for Early-Career Scientists (No. 18K14348 to RS) and the RIKEN Special Postdoctoral Researcher Program (to RS).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.01008/ full#supplementary-material



expression in response to sulfur nutrition. Plant J. 33, 651–663. doi: 10.1046/j. 1365-313x.2003.01658.x


to Magnaporthe oryzae in Arabidopsis thaliana. Mol. Plant Microbe Interact. 22, 1331–1340. doi: 10.1094/MPMI-22-11-1331



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Sugiyama and Hirai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Molecular Origins of Functional Diversity in Benzylisoquinoline Alkaloid Methyltransferases

*Jeremy S. Morris and Peter J. Facchini\**

*Department of Biological Sciences, University of Calgary, Calgary, AB, Canada*

*O*- and *N*-methylations are ubiquitous and recurring features in the biosynthesis of many specialized metabolites. Accordingly, the methyltransferase (MT) enzymes catalyzing these modifications are directly responsible for a substantial fraction of the vast chemodiversity observed in plants. Enabled by DNA sequencing and synthesizing technologies, recent studies have revealed and experimentally validated the trajectories of molecular evolution through which MTs, such as those biosynthesizing caffeine, emerge and shape plant chemistry. Despite these advances, the evolutionary origins of many other alkaloid MTs are still unclear. Focusing on benzylisoquinoline alkaloid (BIA)-producing plants such as opium poppy, we review the functional breadth of BIA *N*- and *O*-MT enzymes and their relationship with the chemical diversity of their host species. Drawing on recent structural studies, we discuss newfound insight regarding the molecular determinants of BIA MT function and highlight key hypotheses to be tested. We explore what is known and suspected concerning the evolutionary histories of BIA MTs and show that substantial advances in this domain are within reach. This new knowledge is expected to greatly enhance our conceptual understanding of the evolutionary origins of specialized metabolism.

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Joerg Ziegler, Leibniz-Institut für Pflanzenbiochemie (IPB), Germany Dietrich Ober, University of Kiel, Germany*

#### *\*Correspondence:*

*Peter J. Facchini pfacchin@ucalgary.ca*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 26 April 2019 Accepted: 30 July 2019 Published: 30 August 2019*

#### *Citation:*

*Morris JS and Facchini PJ (2019) Molecular Origins of Functional Diversity in Benzylisoquinoline Alkaloid Methyltransferases. Front. Plant Sci. 10:1058. doi: 10.3389/fpls.2019.01058*

Keywords: benzylisoquinoline, alkaloid, methyltransferase, specialized metabolism, molecular evolution

# INTRODUCTION

The incredible diversity of plant metabolism has been a topic of fascination for centuries, and yet our appreciation for its scope continues to grow. A recent study examined more than a hundred thousand metabolite–plant species relationships and concluded that each species contains, on average, 4.7 unique metabolites, which sum to an estimate of more than 1 million distinct metabolites across the kingdom (Afendi et al., 2012). An earlier analysis of a smaller database calculated the existence of only 1.6 unique metabolites per plant species, suggesting that our estimates of the breadth of plant metabolism will continue to grow as we collect more data (Shinbo et al., 2006). These numbers are in line with previous estimates ranging from 200,000 to 1,000,000 plant metabolites in total (Saito and Matsuda, 2009).

Alkaloids, broadly defined as low-molecular weight heterocyclic nitrogenous compounds, are thought to occur in roughly 20% of plant species (Ziegler and Facchini, 2008). At least 12,000 unique molecules of this class are known, which can be classified as either protoalkaloids (e.g., mescaline, ephedrine), where the nitrogen is not cyclic, pseudoalkaloids (e.g., steroidal and diterpene alkaloids, caffeine), where the primary biosynthetic origin is not an amino acid, and "true" alkaloids where most of the molecule, including the heterocyclic nitrogen, is derived from an amino acid precursor

**159**

(Hegnauer, 1988; Waterman, 1998). This latter group is most diverse and includes biosynthetic end products derived from phenylalanine, tyrosine, tryptophan, ornithine, arginine, lysine, histidine, and anthranillic acid.

Although the foundation of plant alkaloid chemical diversity begins with the combination and rearrangement of the aforementioned building blocks, each basic carbon skeleton can give rise to a great number of "decorated" variants with various functional group substitutions that alter the molecule's biochemical characteristics. For example, *N*-methylation of xanthine/xanthosine during biosynthesis of caffeine allows for the production of up to seven differentially methylated products (Huang et al., 2016). Similarly, the potential for two *N*-methyl and four *O*-methyl groups on simple benzylisoquinoline alkaloids (BIAs) such as norlaudanosoline makes up to 30 distinct molecules possible. The addition of methyl groups to an alkaloid molecule can have important consequences regarding its chemical properties and thus shift biological activity. Methylation can invert the polarity of an electronegative moiety, shift the molecule's stereoelectronic profile, increase overall hydrophobicity, increase steric bulk, and promote or prevent certain conformations of the molecule (Wessjohann et al., 2014). In the extreme case, methylation of a tertiary amine results in a quaternary ammonium cation, which is substantially more hydrophilic and lipophobic. For example, *O*-methylation of the monoterpene indole alkaloid noribogaine results in a much less polar compound (ibogaine), which is readily sequestered to lipophilic compartments of the mammalian brain (Zubaran, 2000). The *O*-methylated molecule displays differential binding to neurotransmitter receptors versus the parent compound, resulting in substantially more toxicity as measured by the dose required to induce tremors and cerebellar damage. Similarly, the *O*-methylated BIA thebaine is much more of a stimulant and much less of an effective painkiller than morphine, which is fully *O*-demethylated (Navarro and Elliott, 1971).

# Plant Alkaloid *O*-Methyltransferases

Underlying the massive number of differentially methylated plant alkaloids is a large and heterogeneous group of methyltransferase (MT) enzymes thought to be specialized for various substrates (**Figure 1**). Several *O*-methyltransferase (OMT) enzymes, which participate in the terminal steps of monoterpene indole alkaloid biosynthesis, have been identified and cloned, including an OMT leading to the production of vindoline in *Catharanthus roseus* (Levac et al., 2008), an OMT producing ibogaine in *Tabernanthe iboga* (Farrow et al., 2018) and a 10-hydroxycamptothecin OMT from *Camptotheca acuminata*  (Salim et al., 2018). Three OMTs contributing to the biosynthesis of monoterpene isoquinoline alkaloids such as emetine have been cloned from *Psychotria ipecacuanha* (Nomura and Kutchan, 2010). Studies on the biosynthesis of Amaryllidaceae alkaloids such as galanthamine in *Narcissus* spp. and *Lycoris aurea* allowed for the isolation of two norbelladine 4′OMT enzymes (Kilgore et al., 2014; Sun et al., 2018). Although not traditionally included in most lists of alkaloids due to an unclear biosynthetic origin, the volatile heterocyclic nitrogenous methoxypyrazines, which contribute to the flavor profile of grapes, also require *O*-methylation in their biosynthesis. To date, four *Vitis vinifera* OMTs implicated in this pathway have been cloned (Dunlevy et al., 2010, Dunlevy et al., 2013; Guillaumie et al., 2013). Quite a few OMTs implicated in benzylisoquinoline alkaloid biosynthesis have been cloned, and these will be reviewed in a dedicated section below.

# Plant Alkaloid *N*-Methyltransferases

Particularly well studied are *N*-methyltransferases (NMTs), which catalyze the terminal biosynthetic steps producing xanthine alkaloids (e.g. caffeine) in various plants. Cloned representatives include caffeine synthase, theobromine synthase and 7-methylxanthosine synthase from *Coffea arabica* (Mizuno et al., 2003a; Mizuno et al., 2003b) as well as homologs in *Camellia*, *Theobroma*, *Paullinia*, and *Citrus* (Kato et al., 2000; Yoneyama et al., 2006; Schimpl et al., 2014; Huang et al., 2016). Cloned NMTs contributing to the biosynthesis of monoterpene indole alkaloids include one from *C. roseus* leading to the production of vindoline (Liscombe et al., 2010) and two related picrinine NMTs from *Apocynaceae* species (Levac et al., 2016). NMTs from less intensively studied pseudoalkaloid pathways have also been cloned, such as those implicated in the biosynthesis of gramine (Larsson et al., 2006) and ephedrine in *Hordeum vulgare* and *Ephedra sinica*, respectively (Morris et al., 2018). The large number of BIA NMTs that have been characterized at the molecular level will be discussed further in the body of this work.

Although most known alkaloid MTs contribute to the final stages of biosynthesis, this is not a firm rule. Putrescine NMT (PMT), which synthesizes *N*-methylputrescine, catalyzes the first step in pathways leading to several alkaloid classes including the pyridines (e.g., nicotine), tropane alkaloids (e.g., scopolamine), or calystegines in various plant species (Biastoff et al., 2009). A larger number of PMTs have been cloned, including those from *Nicotiana tabacum*, *Solanum tuberosum*, and various other Solanaceae and Convolvulaceae species (Hibi, 1994; Stenzel et al., 2006; Teuber et al., 2007; Junker et al., 2013). Similarly, an anthranilate NMT diverts metabolic flux away from tryptophan biosynthesis into the acridone alkaloid biosynthetic pathway and has been cloned from *Ruta graveolens* (Rohde et al., 2007).

Despite the ever-growing list of characterized and cloned MTs, there remain many biosynthetic pathways to explore. For example, *O*- and *N*-methylation of phenethylamine alkaloids in many lineages including the Cactaceae (e.g., mescaline), tryptamine alkaloids in *Acacia*, *Citrus*, *Phalaris*, and others (e.g., *N*,*N*-dimethyltryptamine), quinolizidine alkaloids in Fabaceae (e.g., *N*-methylcytisine), and the diverse Amaryllidaceae alkaloids remains understudied (Smith, 1977a; Smith, 1977b; Wink, 1984; Jin and Xu, 2013).

# Caffeine Biosynthesis as a Model of Research Potential

The wealth of cloned alkaloid MTs has proven to be a fertile area in which to examine the relationships between enzyme function and plant biochemistry. Aside from the characterization of natural variants and concomitant identification of sequence–function correlations, modern structural biology and DNA manipulation

shown in Figure 2.

methods have allowed experimental approaches to directly probe the features controlling enzyme properties. In select cases, the molecular evolutionary trajectories, which resulted in extant enzyme, and their specific properties have also been elucidated, providing some insight into the origins of specialized biochemical pathways and the exceptional chemodiversity of plants.

Perhaps the best example of the research sequence described above relates to the biosynthesis of caffeine and other xanthine alkaloids. Building on a long history of research using radiolabeled tracers to elucidate the pathway, workers eventually showed unequivocally that caffeine is synthesized from xanthosine *via* a series of *N*-methylation reactions (Suzuki and Takahashi, 1975; Suzuki and Takahashi, 1976; Ashihara et al., 1996; Kato et al., 1996; Mösli Waldhauser et al., 1997a; Mösli Waldhauser et al., 1997b; Kato et al., 1999). Shortly thereafter, cDNAs encoding these enzymes were cloned from *Camellia sinensis* (Kato et al., 2000) and *Coffea arabica* (Ogawa et al., 2001; Uefuji, 2003; Mizuno et al., 2003a, Mizuno et al., 2003b) and later from *Paullinia cupana* (Schimpl et al., 2014), *Theobroma cacao* (Yoneyama et al., 2006), and *Citrus sinensis* (Huang et al., 2016). Examination of their coding sequences showed that these xanthine NMTs belong to the SABATH (salicylic acid, benzoic acid, theobromine) family of methyltransferases, which typically methylate oxygen atoms, suggesting a relatively recent change of function in xanthine alkaloid-producing species. Intriguingly, greater sequence similarity between the functionally distinct MTs within one species (e.g., 80% identity between CaXMT, CaMXMT, and CaDXMT from *C. arabica*) compared to those of analogous function in other plants (e.g., less than 40% identity between CaDXMT and TCS1 from *C. sinsensis*) lead to the hypothesis that xanthine MTs in different lineages have parallel and convergent evolutionary histories.

With coding sequences available to produce heterologous proteins, the molecular determinants of function were studied (Yoneyama et al., 2006; McCarthy and McCarthy, 2007). Activesite features facilitating binding of xanthine molecules, in general (e.g., hydrophobic pocket), and individual residues implicated in substrate specificity (e.g., Ser316 and Tyr356 in CaXMT hydrogen bonding with xanthine but not methylxanthine) were proposed based on comparative analysis of crystal structures in complex with various substrates. Next, site-directed mutagenesis allowed for experimental validation of these hypotheses (Yoneyama et al., 2006; Huang et al., 2016; Jin et al., 2016).

More recently, whole genome sequencing of *Coffea*, *Camellia*, and *Theobroma* species began to unveil the genetic mechanisms leading to the evolution of caffeine biosynthesis (Argout et al., 2011; Denoeud et al., 2014; Xia et al., 2017; Wei et al., 2018). Taken together, the studies strongly suggested the occurrence of multiple independent and convergent evolutionary trajectories, in which gene duplication and functional divergence lead to caffeine biosynthesis. However, these results did not yet explain how or why this biochemical feature arose so readily in distantly related plants.

Clarification of one evolutionary trajectory leading to caffeine biosynthesis in *Citrus* was recently provided by a paleomolecular biology method known as ancestral enzyme reconstruction (Thornton, 2004; Huang et al., 2016). Barkmann and colleagues showed that an ancestral SABATH enzyme was likely exapted to catalyze *N*-methylation of caffeine pathway intermediates and that, following its duplication, a single amino acid substitution in each of the descendant enzymes was sufficient to create a fully functional caffeine biosynthetic pathway. Their results concerning the unusually short mutational distance between SABATH OMT and xanthine NMT functions showed how evolutionary innovation was able to repeatedly converge on caffeine biosynthesis in multiple plant orders.

The extent of our knowledge surrounding the evolution of caffeine biosynthesis showcases what can be achieved with modern research tools and paradigms. In the following section, we outline the current state of knowledge regarding *O*- and *N*-methyltransferases involved in BIA biosynthesis and their contribution to host plant chemodiversity. Drawing on structural and functional studies, we outline what is known regarding the molecular determinants of their differing activities and explore what is suspected regarding their evolution. We show that the field is ripe for substantial advances paralleling those obtained with respect to caffeine biosynthesis and suggest key hypotheses and experiments by which they may be tested.

# CONTRIBUTION OF METHYLTRANSFERASES TO BENZYLISOQUINOLINE ALKALOID BIOSYNTHESIS

BIAs have been studied for centuries (Hagel and Facchini, 2013 and references therein), and much of their biosynthesis has been revealed, albeit only in a handful of model systems, which can only approximate the biosynthetic diversity in the thousand or more BIA-producing plant species (Shulgin and Perry, 2002). These discoveries and their historical context have been extensively reviewed elsewhere and will only be summarized here in order to highlight the involvement of methyltransferases. A tremendous number of BIA MTs have been characterized in plant extracts, and although these have contributed greatly to our understanding of BIA biosynthesis, we will focus herein on those that have been cloned and studied at the molecular level. The nine Ranunculales species from which BIA MTs have been characterized at this level are *Papaver somniferum*,

*P. bracteatum*, *Glaucium flavum*, *Thalictrum flavum*, *Coptis japonica*, *C. chinensis*, *C. teeta*, *Dactylicapnos scandens*, and *Eschscholzia californica.* Literature reports concerning the isolation and characterization of each of the MTs discussed are cited in **Supplementary Table 1**.

In *P. somniferum*, the most studied BIA model organism, a central pathway for BIA biosynthesis begins with a condensation of two tyrosine derivatives (dopamine and 4-hydroxyphenylacetaldehyde) to form (*S*)-norcoclaurine (**Figure 2**). Acting on this base skeleton, norcoclaurine-6-*O*methyltransferase (6OMT) transfers a methyl group onto one of the isoquinoline moiety hydroxyl groups to yield (*S*)-coclaurine. Given the core role of this enzymes, it is unsurprising that nine cognate cDNAs from six species have been isolated and shown to encode this activity (**Table 1**; **Supplementary Table 1**) (Morishige et al., 2000; Ounaroon et al., 2003; Facchini and Park, 2003; Samanani, 2005; Tamura et al., 2006; Desgagné-Penix and Facchini, 2012; Chang et al., 2015; Robin et al., 2016; He et al., 2018). Interestingly, several species (e.g., *G. flavum*, *C. chinensis*) seem to express multiple distinct transcripts encoding enzymes with this activity despite sharing only 45–55% amino acid identity (**Supplementary Figure 1**). Downstream in the central pathway, a first *N*-methyl group is installed by coclaurine *N*-methyltransferase (CNMT), which has been cloned from four BIA-producing species (Choi et al., 2002; Facchini and Park, 2003; Samanani, 2005; Minami et al., 2008; Liscombe et al., 2009; Desgagné-Penix and Facchini, 2012; Hagel et al., 2015). Notably, four transcripts encoding enzymes with CNMT activity (59–66% AA identity; **Supplementary Figure 2**) have been cloned from *G. flavum*; however, their individual contributions to biosynthesis in the host plant remain to be assessed. The final step in the central pathway is catalyzed by 3′-hydroxy-*N*-methylcoclaurine 4′-*O*-methyltransferase (4′OMT) and yields the triple-methylated central intermediate (*S*)-reticuline. Corresponding transcripts have been cloned from three species (Morishige et al., 2000; Facchini and Park, 2003; Ounaroon et al., 2003; Ziegler et al., 2005; Desgagné-Penix and Facchini, 2012; Chang et al., 2015). Widespread occurrence of the above three MTs in BIA-producing species is consistent with the current model in which all end product alkaloids derive from reticuline or, less commonly, from upstream central pathway intermediates (Hagel et al., 2015).

The 4′-*O*-methylation also contributes to papaverine biosynthesis *via* a branch that diverges from the central pathway prior to the action of CNMT and, instead, passes through (*S*)-norreticuline. Next, the exceptionally substrate-specific norreticline 7-*O*-methyltransferase (N7OMT) installs a third *O*-methyl to yield (*S*)-norlaudanine. The single N7OMT cloned to date occurs in *P. somniferum*, which is consistent with the somewhat-restricted taxonomic distribution of papaverine (Shulgin and Perry, 2002; Pienkny et al., 2009; Desgagné-Penix and Facchini, 2012). Although the preponderance of evidence at this time supports an *N*-desmethylated biosynthetic scheme for papaverine, it was also proposed that biosynthesis might pass through (*S*)-reticuline. Indeed, a reticuline 7-*O*-methyltransferase (7OMT) has also been cloned from *P. somniferum*, along with eight additional transcripts encoding 7OMT-like enzymes from five species (Ounaroon et al., 2003; Fujii et al., 2007; Dang and

FIGURE 2 | Contributions of *O*- and *N*-methyltransferases to BIA biosynthesis. A central pathway (*dark blue*) leads to the core 1-BIA intermediate (*S*)-reticuline (dashed rectangle) from which various branch pathways diverge, including those leading to aporphines (*red*), pavinans (*purple*), protoberberines (*light blue*), protopines (*pink*), benzo[c]phenanthridines (*green*), and pthalideisoquinolines (*brown*). The bisbenzylisoquinolines (*gray*) are typically produced by dimerization of various 1-BIA intermediates. The branch pathway to papaverine (*yellow*) is unusual in diverging from the central pathway prior to *N*-methylation. The 6OMT (norcoclaurine 6-*O*-methyltransferase), 4′-OMT (3′-hydroxy-*N*-methylcoclaurine 4′-*O*-methyltransferase), N7OMT (norreticuline 7-*O*-methyltransferase), 3′OMT (3′-*O*-methyltransferase), SOMT (scoulerine *O*-methyltransferase), 4′-AOMT (4′‐*O*‐desmethyl‐3‐*O*‐acetylpapaveroxine 4′-*O*-methyltransferase), CNMT (coclaurine *N*-methyltransferase), RNMT (reticuline *N*-methyltransferase), PavNMT (pavine *N*-methyltransferase), and TNMT (tetrahydroprotoberberine *N*-methyltransferase) are drawn on branches representing their major or physiologically relevant activities. *O*- and *N*-methyl groups reportedly installed by the enzymes are circled in *red* and *green*, respectively. A detailed BIA biosynthetic pathway is available in Hagel and Facchini (2013).

Facchini, 2012; Desgagné-Penix and Facchini, 2012; Chang et al., 2015; Purwanto et al., 2017; He et al., 2018). The 3′-*O*-methylation, which completes the series of four methylations required to yield papaverine, may be catalyzed in part *via P. somniferum* SOMT1 despite the fact that its *in vitro* activity substantially favors other substrates (Dang and Facchini, 2012). An additional five transcripts encoding enzymes with comparable 3′OMT activities have been cloned from three species (Fujii et al., 2007; Chang et al., 2015; Purwanto et al., 2017). Aside from papapverine, many 3′- and 7-*O*-methylated simple BIAs (e.g., laudanine, laudanosine) and potential derivatives (e.g., jatorrhizine, tetrahydropalmatine) are known; however, the involvement of any given OMT in their biosynthesis remains speculative due to a lack of *in planta* experimental evidence.

Starting from (*S*)-reticuline, a short branch yields the taxonomically widespread aporphine alkaloid (*S*)-magnoflorine (e.g., Ranunculales, Laurales, Magnoliales, Sapindales, Piperales, *etc*.) (Shulgin and Perry, 2002). Acting on either reticuline or its aporphine derivative (corytuberine), *P. somniferum* reticuline *N*-methyltransferase (RNMT, named as such to reflect *in vitro* substrate preference; **Table 1**; **Supplementary Table 1**) installs a second methyl group resulting in a quaternary nitrogen atom. To date, three additional transcripts encoding RNMT-like enzymes have been reported in two other BIA-producing plant species, both of which are also known to accumulate (*S*)-magnoflorine and related quaternary alkaloids (Shulgin and Perry, 2002; Liscombe et al., 2009; Hagel et al., 2015; Torres et al., 2016). Of particular note is the enzyme cloned from *T. flavum* and named pavine *N*-methyltransferase (TfPavNMT). Whereas the enzyme efficiently catalyzes the signature RNMT-like activity and might contribute to quaternary aporphine biosynthesis, it also uniquely accepts pavinan alkaloids and is thought to participate in the biosynthesis of *N*-methylescholzidine, an uncommon BIA, which accumulates in *Thalictrum* spp (Liscombe et al., 2009). In *D. scandens*, a corytuberine 7-*O*-methyltransferase (DsC7OMT) was identified, and preliminary analysis linked it to the biosynthesis of isocorydine (He et al., 2017).

Several longer branch pathways employing MT reactions also diverge from (*S*)-reticuline to produce protoberberines (e.g., stylopine), protopines (e.g., protopine), benzo[c] phenanthridines (e.g., sanguinarine), and pthatlideisoquinolines (e.g., noscapine) (**Figure 2**). Acting on the first protoberberine intermediate, scoulerine 9-*O*-methyltransferase (SOMT) installs a third *O*-methyl group to yield tetrahydrocolumbamine. In agreement with the relatively common occurrence of protoberberines in BIA-producing plants, known enzymes with SOMT activity are encoded by 12 transcripts isolated from five different species (Takeshita et al., 1995; Morishige et al., 2000; Fujii et al., 2007; Dang and Facchini, 2012; Chang et al., 2015; Purwanto et al., 2017; He at al., 2018). A functionally similar enzyme, which preferentially targets the 2-hydroxyl of quaternary protoberberine columbamine, was cloned only from *C. japonica* (CjCoOMT) (Morishige et al., 2002). En route to noscapine biosynthesis, tetrahydroprotoberberine NMT (TNMT) transfers a methyl group onto a bicyclic nitrogen atom, yielding the quaternary product *N*-methylcanadine. In addition to the canonical representative isolated from *P. somniferum*, six



*6OMT, 6-O-methyltransferase 4*′*OMT, 4*′*-O-methyltransferase; N7OMT, norreticuline 7-O-methyltransferase; 7OMT, 7-O-methyltransferase; 3*′*OMT, 3*′*-O-methyltransferase; SOMT, scoulerine-O-methyltransferase; 4*′*AOMT, 4*′‐*O*‐*desmethyl*‐*3*‐*O*‐*acetylpapaveroxine 4*′*-O-methyltransferase; CNMT, coclaurine N-methyltransferase; RNMT, reticuline N-methyltransferase; TNMT, tetrahydroprotoberberine N-methyltransferase. Ephedra sinica phenylalkylamine N-methyltransferase (EsPaNMT), denoted with an asterisk, has promiscuous activity with various BIAs but does not participate in their biosynthesis in planta. Dactylicapnos scandens cortyuberine 7-O-methyltransferase (DsC7OMT) and Eschscholzia californica 10-hydroxysanguinarine O-methyltransferase (EcG11OMT) have unique substrate ranges and are not included in the table.*

other TNMT-like enzymes have been cloned from four other species (Samanani, 2005; Liscombe and Facchini, 2007; Liscombe et al., 2009; Hagel et al., 2015; Torres et al., 2016). Intriguingly, the final *O*-methylation required to produce noscapine was recently shown to involve a heterodimer composed of PsSOMT2 and either PsSOMT3 or Ps6OMT (Li and Smolke, 2016; Park et al., 2018). No equivalent enzymes are presently known in other species, which is consistent with the lack of reports of noscapine in BIA producing species outside Papaveraceae (Shulgin and Perry, 2002). Aside from the role outlined above, TNMTs also participate in a separate branch pathway leading first to protopines and then to benzo[c]phenanthridines *via* synthesis of (*S*)-*cis*-*N*-methylstylopine. This product accumulates to a substantial degree in *T. flavum*, which is inconsistent with the relatively modest stylopine NMT activity reported for TfPavNMT (Liscombe et al., 2009). Thus, it seems likely that one of the additional NMT transcripts recently identified in that species' transcriptome encodes an enzyme more dedicated to protoberberine substrates (Hagel et al., 2015). Acting further downstream in the benzo[c]phenanthridine branch pathway, a functionally unique OMT from *E. californica* (EcG11OMT) apparently targets 10-hydroxysanguinarine (Purwanto, 2017).

Aside from the above, a number of transcripts have been cloned and found to encode enzymes with high homology to BIA MTs but with no discernible activity *in vitro* (Liscombe and Facchini, 2007; Chang et al., 2015; Morris and Facchini, 2016; Purwanto et al., 2017; He et al., 2018). Although these may, in fact, be inactive with respect to BIAs, it remains possible that they have not been assayed under appropriate conditions or with proper substrates. These mysterious transcripts include at least one NMT each from *P. somniferum* and *Arabidopsis*, as well as a large number of OMTs from *G. flavum*, *P. somniferum*, *E. californica*, and *Coptis* spp. In the latter group, substantial homology to other types of plant OMTs makes identification of those targeting BIAs quite challenging.

As revealed in **Table 1**, the majority of BIA MTs are known to catalyze many additional reactions beyond those prototypical conversions represented by their names and position on orderly, linear biosynthetic pathways as traditionally drawn (**Figure 2**). Nevertheless, the targeted nature of *in vitro* biochemical characterization means that all reports necessarily underestimate the catalytic range of BIA MTs. The lack of specificity reported for BIA MTs occurs at two levels: substrate promiscuity, wherein the enzyme can methylate a number of different molecules, and product promiscuity, wherein a single substrate is methylated one or more times at various positions to yield different products (O'Brien and Herschlag, 1999; Hult and Berglund, 2007). On the other hand, BIA MTs do not show catalytic promiscuity, which is the ability to carry out distinct types of chemical transformations. Thus far, BIA OMTs and NMT have only been shown to catalyze *O*- and *N*-methylation, respectively, unlike certain SABATH MTs that target both *O* and *N* atoms (Huang et al., 2016). As shown for PsRNMT with respect to magnoflorine biosynthesis, the most substantial activity of an enzyme *in vitro* may not correlate with its function *in planta* (Morris and Facchini, 2016). This widespread promiscuity has led to an appreciation for the existence of multidimensional "metabolic grids", which diversify the potential routes by which a plant may make any given end product. Experimental evidence for major, minor, or even "silent" routes in BIA biosynthesis has been given by gene knockdown experiments in whole plants and cell cultures (Fujii et al., 2007; Desgagné-Penix and Facchini, 2012).

# MOLECULAR AND STRUCTURAL DETERMINANTS OF FUNCTION

# *O*-Methyltransferases

To date, structures have been reported for *T. flavum* norcoclaurine 6OMT (Tf6OMT; PDB 5ICE) and *P. somniferum* scoulerine *O*-methyltransferase 1 (PsSOMT1; PDB 6I6K) (Robin et al., 2016; Cabry et al., 2019). While these structures have allowed for the generation of compelling hypotheses concerning substrate binding and catalysis, relatively little experimental work (e.g., sitedirected mutagenesis) is presently available in support of their validity. The overall structures, each composed of an N-terminal dimerization domain linked to a C-terminal substrate binding domain, are consistent with those previously reported for SAMdependent OMTs in plants such as *Medicago sativa* Caffeic acid OMT (MsCOMT; PDB 1KYZ), *M. trunculata* isoflavonoid OMT (MsIOMT; PDB 1FP2), and *M. sativa* chalcone OMT (MsChOMT; PDB 1FP1) (Zubieta et al., 2001; Zubieta, 2002).

### Dimerization

BIA OMTs form dimers in solution as well as in all obtained crystal structures. Dimerization occurs *via* a substantial (150 of 350 residues in Tf6OMT) domain composed primarily of intertwined helices (**Figure 3**). Most of these helices interact with those of the other monomer, resulting in burial of approximately 22% of total protein surface area. Across BIA OMTs, the entire dimerization domain shows relatively modest conservation. Nevertheless, two leucine residues (Leu28, Leu73) show perfect identity, and an additional four hydrophobic residues (Leu34, Ile40, Leu52, Leu66) show strong conservation (**Supplementary Figure 3**). However, their contributions to the dimer interface do not appear to be substantial, suggesting that they may be more important in maintaining secondary and tertiary structures of each monomer (PDBePISA) (Krissinel and Henrick, 2007). Surprisingly, gel filtration analysis of CjSOMT leads authors to report the existence of a trimer in solution (Morishige et al., 2000). Examination of the CjSOMT sequence does reveal a 20-amino acid extension of the N-terminus relative to Tf6OMT. However, it is unclear how this feature could so drastically alter how the monomers associate.

Although heterologously expressed BIA OMTs have generally been studied under the assumption that they form homodimers *in planta*, several studies suggest that dimerization of genetically distinct monomers (i.e., heterodimerization) may play a crucial role in alkaloid biosynthesis (Frick and Kutchan, 1999; Li and Smolke, 2016; Park et al., 2018). Recently, the missing methylation step in *P. somniferum* noscapine biosynthesis was shown to involve a heterodimer of PsSOMT2 and either PsSOMT3 or Ps6OMT (Park et al., 2018). The discovery of an OMT heterodimer that catalyzes a reaction not performed by either homodimer reveals an unexpected diversification strategy by which additional OMT activities can exist without the genomic and metabolic burden of maintaining and expressing dedicated OMT genes. Although the extent of heterodimerization in plant specialized metabolism remains to be established, it is intriguing to consider that many more OMT activities might be linked to heterodimers in the future.

Several helices of the dimerization domain also contribute to the active site, which is found at the interface between the first monomer's C-terminal domain, two helices of its N-terminal domain, and two helices belonging to the N-terminal domain of the second monomer (**Figure 3**). Together, these four helices form a hydrophobic "back wall" at the active site, but no direct interactions with the substrate have been reported. The lack of clear interactions is surprising given several pieces of evidence showing that the N-terminal domains of both monomers make contributions to substrate selectivity. For example, a chimeric enzyme fusing the Cj6OMT dimerization domain to the C-terminal domain of Cj4′OMT displayed substrate- and regiospecificity most similar to Cj6OMT (Morishige et al., 2010). Further shuffling of the two polypeptide sequences suggested that the determinants of function exist between the 34th and 125th amino acids in Cj6OMT, which is a region that includes the two helices mentioned above, which contribute to a monomer's own active site, but excludes the two helices, which contribute to the second monomer's active site. Other lines of evidence suggest that the identity of the second monomer, potentially mediated by the aforementioned two helices, alters substrate specificity in the first monomer's active site. It was shown that although the PsSOMT2 monomer contains the catalytic machinery necessary for turnover in the PsSOMT2:PsSOMT3 heterodimer, substitution of PsSOMT3 with either PsSOMT2 or PsN7OMT abolished activity (Park et al., 2018). Structural and functional mutagenesis studies should help clarify the origin of these indirect effects.

## SAM Binding

Most of the BIA OMT polypeptide (200 of 350 residues in Tf6OMT) forms a *C*-terminal domain composed of alternating alpha helices and beta sheets which together create a Rossmann fold classically associated with nucleotide binding (**Figure 3**). SAM binding occurs *via* a series of motifs, which are highly conserved with other plant SAM-dependent MTs (Kozbial and Mushegian, 2005; Gana et al., 2013). The residues within these motifs, which directly interact with SAM, are almost perfectly conserved in known BIA OMTs. However, three SOMTs display conservative (e.g., Asp to Glu) or semi-conservative (e.g., Asp to Gln) substitutions at two positions (**Supplementary Figure 3**). Unfortunately, neither of these substitutions occurs in the recently crystallized PsSOMT1, leaving the question of how these might alter OMT–SAM interactions unresolved. Comparison of apoenzyme and enzyme–substrate complexes further reveals that the residue equivalent to Thr170 in Tf6OMT (Ser211 in PsSOMT1) interacts with the co-substrate upon binding and contributes to a substantial conformational change (16° hinge movement) of the enzyme likely to be crucial for catalysis (Robin et al., 2016). Intriguingly, binding of SAH was sufficient to induce the closing movement in Tf6OMT, whereas the equivalent movement was only observed after binding of both SAH and the BIA substrate in PsSOMT1 (Cabry et al., 2019). Despite this minor difference, the available evidence suggests that SAM binding occurs in a very similar manner for all BIA OMTs.

## BIA Binding

Binding of the alkaloid substrate occurs at a location proximal to the SAM binding site and *via* residues overlapping the aforementioned conserved SAM-binding motifs (**Figure 3**; **Supplementary Figure 3**). Notably, the relative orientation of the bound BIA molecule is flipped between Tf6OMT and PsSOMT1, such that isoquinoline and benzyl moieties or their derivatives in protoberberines (**Supplementary Figure 4**; **Supplementary Figure 5**) generally make reciprocal interactions in either

enzyme (Robin et al., 2016; Cabry et al., 2019). A histidine residue (His256 in Tf6OMT), in which hydrogen bonds with the target hydroxyl group, is perfectly conserved across BIA OMTs. In Tf6OMT, two closely adjacent residues (Asp257 and Cys253) also interact with the target hydroxyl group and likely fine tune its position and reactivity. On the other hand, the equivalent residues in PsSOMT1 (Asp297 and Trp293) were interpreted as forming a channel through which the SAM methyl group is directed toward the acceptor hydroxyl. Although Asp297 is not proposed to be catalytic *per se*, alanine substitution at this position sharply reduced the activity of PsSOMT1 (Cabry et al., 2019). Precise substrate positioning within the active site is also affected by hydrogen bonds formed between nontargeted moieties and residues equivalent to Gly165 (main chain carbonyl) and Asp169 in Tf6OMT. Whereas the former shows strong semi-conservation (i.e., Gly or Ala), the latter is much more variable while maintaining conservation within certain BIA OMT subtypes (e.g., Glu in 4′OMTs, Asp in 6/7OMTs, His/Phe/Gly in SOMTs). In PsSOMT1, the equivalent residue (Phe210) interacts with a substrate aromatic ring rather than a hydroxyl group.

As seen in the Tf6OMT structure with bound (*S*) norlaudanosoline, only a small adjustment in the angle of the substrate is required to place the 7-hydroxyl in an alignment productive for methyl transfer and, in fact, 7-*O*-methylation activity has not been ruled out for Tf6OMT (**Figure 3**) (Robin et al., 2016). Conversely, PsSOMT1 has been shown to catalyze both 9- and 2-*O*-methylation of (*S*)-scoulerine despite this requiring the substrate to bind in two completely different orientations (Dang and Facchini, 2012). In the absence of crystal structures revealing the details of such alternative binding modes, biophysical modeling supported by mutagenesis studies may be the only practical method by which to understand the determinants of regio-specificity in BIA OMTs.

In Tf6OMT, Asp306 makes a particularly important contribution to alkaloid binding *via* a hydrogen bond with the nitrogen atom (Robin et al., 2016). A close examination of sequence variation among known BIA OMTs at this position reveals a possible explanation for selectivity concerning *N*-methylation status (i.e., un-methylated secondary or mono-methylated tertiary). Whereas PsN7OMT has an aspartic acid residue capable of hydrogen bonding with the secondary nitrogen atom of (*S*)-norreticuline, Ps7OMT and Ec7OMT have leucine and glycine residues instead (**Supplementary Figure 3**). These smaller and uncharged residues might be expected to alleviate steric hindrance, which would occur with the additional methyl group present on tertiary BIAs like (*S*)-reticuline, thus facilitating binding and catalysis. Although the residue equivalent to Asp306 shows strong conservation among BIA OMTs targeting substrates with the simple 1-BIA scaffold (i.e., 6OMT, N7OMT, 4′OMT, etc.), the equivalent is typically hydrophobic (e.g., Leu or Ala) in SOMT-like enzymes. Examination of the PsSOMT1 structures reveals that this substitution pattern likely relates to the "flipped" substrate-binding mode described above. Rather than being positioned near the nitrogen atom, the equivalent residue (Leu346) simply forms a hydrophobic interaction with carbon atoms in one of the adjacent rings.

A number of more generic interactions are also involved in OMT-BIA binding. Two almost perfectly conserved methionine residues (Met166 and Met307 in Tf6OMT) support and position the rings of the isoquinoline moiety *via* sulfur–aromatic interactions (Reid et al., 1985; Robin et al., 2016). The equivalent residues in PsSOMT1 (Met207 and Leu347) instead sandwich the aromatic ring, which derives from the benzyl moiety in 1-BIAs (**Supplementary Figure 3**; **Supplementary Figure 4**) (Cabry et al., 2019). Similarly, two well-conserved aromatic residues (Phe162 and Trp149 in Tf6OMT, Phe190 and Phe203 in PsSOMT1) provide aromatic interactions with the isoquinoline moiety in Tf6OMT but with the benzyl moiety derivative in PsSOMT1. An additional hydrophobic residue (Ile114 in Tf6OMT, Thr157 in PsSOMT1), which helps position the benzyl or isoquinoline moieties, shows much less conservation. Intriguingly, the enzymes, which catalyze *O*-methylation of the benzyl moiety (e.g., Cj4′OMT, Tf4′OMT, Ps4′OMT2) are substituted with a methionine at this position. Given the observation that two methionine residues help position the isoquinoline moiety in Tf6OMT, it is tempting to speculate that an equivalent interaction takes place in 4′OMTs albeit with the BIA molecule positioned such that the isoquinoline and benzyl moieties swap places. This hypothesis is supported by the "flipped" BIA binding pose of PsSOMT1, which reportedly *O*-methylates the benzyl moiety of certain 1-BIAs (Dang and Facchini, 2012). Crystal structures of 4′OMTs will be helpful in testing this hypothesis. More generally, the size and shape of the substrate pocket is thought to control selectivity at a coarse level (e.g., between BIAs, chalcones, or isoflavones). In particular, the bulky Phe156 in PsSOMT1 was recently proposed to act as a "gatekeeper" residue preventing, *via* steric hindrance, binding of substrates with large groups opposite the target hydroxyl (Cabry et al., 2019). In Tf6OMT, a much smaller residue (Thr113) shapes the binding pocket such that the bulkier isoquinoline moiety can be accommodated. This model is an important first step in establishing a unified framework by which to understand and predict substrate specificity of plant OMTs *a priori*.

### Catalysis

The catalytic mechanism of BIA OMTs is thought be conserved with other plant SAM-dependent OMTs (Robin et al., 2016; Cabry et al., 2019). Briefly, a histidine residue (His256 in Tf6OMT, His296 in PsSOMT1) acts as a general base and deprotonates the target hydroxyl group. Subsequently, the newly generated oxyanion carries out a nucleophilic attack on the labile methyl group of SAM, which results in its transfer. The significance of the histidine residue in BIA OMTs was confirmed by targeted mutagenesis experiments, which almost entirely abolished catalytic activity in PsSOMT1 and other *P. somniferum* OMTs (Park et al., 2018; Cabry et al., 2019). An adjacent residue (Asp257 or Asp297), discussed above in the context of substrate binding, can also be thought of as participating in catalysis. In PsSOMT1, substitution of this residue with an alanine yielded an enzyme with roughly 2% activity, leading the workers to conclude that it has an important but non-essential role (Cabry et al., 2019). Structural analysis further implicates a glutamic acid residue (Glu356 in PsSOMT1, Glu315 in Tf6OMT) in hydrogen bonding with the catalytic histidine to promote its necessary basicity. Although the equivalent residue was not discussed with respect to Tf6OMT, examination of their published structures suggests that such an interaction is also present. In fact, perfect conservation of this residue suggests that this aspect of catalysis is maintained in all known BIA OMTs (**Supplementary Figure 3**).

Feedback inhibition is known to be a significant feature affecting BIA OMT activity (Sato et al., 1994; Robin et al., 2016). As shown for Tf6OMT, pathway end products can compete for the active site and thus slow or prevent catalysis. In the case of inhibition by sanguinarine, binding occurs in a position that partially overlaps with that of a productive substrate as described above. Several of the generic interactions (e.g., those with Met166 and Ile114) are preserved, while one of the hydrogen bonding residues (Asp169) interacts with a different O atom. As a result, the planar sanguinarine molecule binds in a position rotated roughly 90° on two axes relative to the productive BIA substrate and forms several new aromatic and hydrogen bonding interactions. Comparison of enzyme–inhibitor interactions to those with a productive substrate can provide rational targets for mutagenesis (i.e., residues interacting with the inhibitor but not the productive substrate), thus potentially allowing for the engineering of feedback-insensitive OMT enzymes, which would be highly useful in biotechnological applications. Unfortunately, the specifics of these interactions are likely to vary significantly from one enzyme–inhibitor pair to the next, thus limiting our ability to generalize from structures already reported in the literature. Although additional crystal structures with bound inhibitors are the gold standard, biophysical modeling approaches (e.g., docking) could also provide some insight on shorter timescales.

# *N*-Methyltransferases

Compared to the OMTs, BIA NMTs have received substantial attention in terms of structure–function investigations despite the much smaller number of functionally characterized representatives. An initial investigation was reported for *T. flavum* pavine NMT (PDB 5KOK) and, recently, for *C. japonica* coclaurine NMT (PDB 6GKV) (Torres et al., 2016; Bennett et al., 2018). A third report concerning the tetrahydroprotoberberine NMT from *G. flavum* was accepted for publication during the preparation of this manuscript (PDB 6P3O) (Lang et al., 2019). Taken together, these three studies cover much of the functional range reported to date for BIA NMTs and reveal many of the molecular determinants of function.

The overall structures (**Figure 4**; **Supplementary Figure 6**), which include a canonical Rossmann SAM-binding domain as well as a C-terminal substrate-binding domain, are consistent with those reported for other SAM-dependent NMTs including *Plasmodium* phosphoethanolamine NMT (PfPMT; PDB 3UJA) and *Mycobacterium tuberculosis* cyclopropane synthase (MtPcaA; PDB 1KPH) (Huang et al., 2002; Lee et al., 2012). Although we annotate the BIA NMT polypeptide sequences with distinct SAM and BIA substrate-binding regions for simplicity, interacting residues are not strictly found within these regions, and domains become more evident in the tertiary structure. In addition to the typical NMT domains, the BIA NMTs also contain an N-terminal extension composed of three helices, which wrap around the substrate-binding domain and contribute to a homodimerization interface. This N-terminal extension also contributes to positioning a loop and helix proposed to gate the active site.

### Dimerization

As with the OMTs, BIA NMTs form dimers both in solution and under crystallization conditions (Torres et al., 2016; Bennett et al., 2018). However, the extent of the dimerization interface is substantially less and corresponds to only 6.5% and 6.9% of surface area in TfPavNMT and CjCNMT, respectively. Although buried surface area is similar in GfTNMT, the occurrence of several salt bridges renders the dimerization substantially more favorable, suggesting that the importance of dimerization may vary across BIA NMTs. In all cases, dimerization occurs at the "rear" of the monomer with respect to the substrate-binding pocket and *via*  somewhat conserved residues located primarily in two helices and two beta sheets distributed between the N-terminal extension and C-terminal BIA-binding domain. Notably, reciprocal interactions between tri-lysine motifs present in the C-terminus of all BIA NMTs also contribute to the dimerization interface. In Arabidopsis and other eukaryotes, such a motif has been shown to result in retrograde trafficking (ER retention) of membrane-bound proteins *via* interaction with COPI proteins (Wang et al., 2011). However, given that BIA NMTs have been experimentally shown to localize to the cytosol, it appears that this motif may have a more general utility in enabling protein–protein interaction (Hagel and Facchini, 2012). Neither the *in planta* occurrence or functional significance of homo- or hetero-dimerization has been verified for BIA NMTs. However, it has been speculated that small hinge movements in one monomer might be transferred *via* the dimerization domain to the other monomer, resulting in a cooperativity effect. Nevertheless, in the absence of close interactions between the dimerization domain and catalytic site, it would be surprising to discover consequences as substantial as those discussed for the OMTs.

## SAM Binding

The Rossmann fold SAM-binding domains of various BIA NMTs are structurally quite similar, yielding RMSD values of ~0.4–0.5 Å when aligned to each other (Torres et al., 2016; Lang et al., 2019). Binding of the cosubstrate occurs *via* sequence motifs largely conserved with other plant SAM-dependent MTs (Kozbial and Mushegian, 2005; Gana et al., 2013). In the available crystal structures, up to 13 direct or water-mediated hydrogen bonds appear to position the cosubstrate (**Figure 4**). The implicated residues are almost perfectly conserved in all cloned BIA NMTs, with the exception of one nonconservative substitution (Gln to His) in PsRNMT (**Supplementary Figure 6**) (Torres et al., 2016; Bennett et al., 2018; Lang et al., 2019). Comparison of the apoenzyme and binary complexes (e.g., TfPavNMT *versus* TfPavNMT + SAH) revealed only a minor hinge movement of domains relative to that seen in the OMTs. Instead, active site closure seems to depend on the gate-like 70s loop, which becomes more ordered upon cosubstrate binding. Sequence identity in these features strongly suggests that the mechanisms of SAM binding are conserved for all known BIA NMTs.

## BIA Binding

As expected, given the diversity of substrates turned over by various BIA NMTs, the largely helical BIA binding domain shows substantially less conservation than the SAM-binding domain. Although crystal forms binding the enzymes' preferred or physiological substrates (e.g., reticuline or pavine for TfPavNMT, coclaurine for CjCNMT, stylopine for GfTNMT) have been elusive, structures with bound analogs (e.g., *N*-methylheliamine, tetrahydropapaverine), nevertheless, provide some insight into substrate recognition and catalysis. Fortuitously, a GfTNMT crystal was obtained in complex with the endogenous product (*S*)-*cis*-*N*-methylstylopine (SMS), providing the most reliable view to date of biologically meaningful interactions. In all BIA NMTs, the binding pocket is lined with many hydrophobic residues and apparently gated by a loop, which becomes ordered and partially helical upon substrate binding. While a limited number of hydrogen bonds are recognizable in the GfTNMT crystal structure, as a general rule, it appears that substrate recognition and binding depends primarily on steric effects, Van der Waals interactions, and aromatic interactions (**Figure 4**) (Torres et al., 2016; Lang et al., 2019).

The availability of structures for three BIA NMTs reveals that the key substrate-binding interactions differ somewhat across functional subtypes. In TfPavNMT, 13 residues line the binding pocket, and the most significant substrate interactions involve Phe96, His232, Phe292, and Phe330. In CjCNMT, three residues lining one side of the binding pocket (Tyr328, Trp329, Phe332) were probed by site-directed mutagenesis and shown to significantly impact function (Bennett et al., 2018). Characterization of the mutant enzymes' kinetics with respect to isoquinoline and benzylisoquinoline substrates revealed that Trp329 primarily interacts with the benzyl moiety, whereas Phe332 primarily interacts with the isoquinoline moiety. In GfTNMT, comparable results were obtained in which substitution of the equivalent residues (Ile329 and Phe332) with alanine resulted in mutant enzymes with roughly 10% activity (Lang

FIGURE 4 | Structure and major active site interactions of representative BIA *N*-methyltransferases. (A) Key features of BIA NMTs are labeled according to their relative location along the *T. flavum* pavine *N*-methyltransferase (TfPavNMT) polypeptide with residues denoted by ovals and larger features by rectangles. (B) Crystal structures of TfPavNMT with bound SAH and (*R,S*)-tetrahydropapaverine are reported by Torres et al. (2016) (PDB 5KOK), (E) *C. japonica* coclaurine NMT with bound SAH and *N*-methylheliamine is reported by Bennett et al. (2018) (PDB 6GKV), and (H) *G. flavum* tetrahydroprotoberberine NMT with bound SAH and (*S*)-*cis*-*N*methylstylopine is reported by Lang et al. (2019) (PDB 6P3O). Domains correspond to those shown in (A) and Supplementary Figure 6. The alkaloid substrate is shown in *cyan*, and the cosubstrate SAH is shown in *orange.* Major active site interactions in (C, D) TfPavNMT, (F, G) CjCNMT, and (I, J) GfTNMT were visualized using PoseView and modified where necessary to reflect literature reports. Dashed lines indicate hydrogen bonds, and green lines indicate hydrophobic or aromatic interactions. Dashed lines drawn in *gray* indicate multiple alternative potential interactions. (A) Drawn using Illustrator for Biological Sequences (Liu et al., 2015).

et al., 2019). While GfTNMT's lack of activity with isoquinoline substrates precluded a comparative kinetic analysis such as was carried out for CjCNMT, examination of the GfTNMT crystal

structure allowed for the authors to suggest that Phe332 is better positioned to interact with the benzyl moiety and associated C9/ C10 methylenedioxy bridge of SMS.

In contrast to the other structurally characterized BIA NMTs, GfTNMT does appear to form a limited set of hydrogen bonds with the BIA molecule's functional groups. Whereas the side chains of residues Gln98 and Tyr81 interact directly with the oxygen atoms of the C9/C10 methylenedioxy bridge in SMS, residues Phe332 (main chain carbonyl) and Gln339 (side chain carboxyamide) interact with the C2/C3 oxygen atoms *via* a network of water-mediated hydrogen bonds. Substitution of Tyr81with either phenylalanine or alanine resulted in comparable mutant enzymes displaying 10–20% activity, indicating that the hydroxyl group of the tyrosine side chain is crucial to the residue's function (Lang et al., 2019). While the involvement of water molecules in substrate binding was only explicitly reported for GfTNMT, careful examination of the TfPavNMT and CjCNMT crystal structures reveals the existence of several well-ordered water molecules within potential hydrogen bonding distance of the BIA molecule.

Interestingly, the size of the substrate-binding pocket in TfPavNMT is somewhat larger than that reported for the other BIA NMTs (Torres et al., 2016). Enlargement of the cavity apparently results from many small contributions, including individual amino acid substitutions (e.g., A204), rotation of side chains and slight repositioning of secondary structural elements (e.g., helix α4, 240s loop). Although two separate molecules of a 1-BIA substrate were bound in the reported structure, it was suggested that BIA dimers (bisbenzylisoquinolines) common in *Thalictrum* spp. might bind to the active site *in planta.* While bis-BIAs might conceivably be substrates for *N*-methylation, their status as pathway end products makes it tempting to speculate that they could instead act as feedback inhibitors in a manner analogous to that described for sanguinarine and Tf6OMT. Alternatively, binding of a second BIA monomer at the adjacent (non-productive) location in the active site might cause either substrate or product inhibition and thus regulate NMT activity and adjust pathway flux. Given that binding of a second 1-BIA molecule was proposed to be inhibitory in TfPavNMT, it is reasonable to suspect that the active site might be structured differently when a single substrate molecule is bound. One likely possibility is a tighter interaction with the 70s loop "gate", which is notably displaced in the available TfPavNMT structure relative to CjCNMT.

In sum, the substrate-binding interactions presently described in the literature vary quite a bit despite substantial conservation of active site residues and domain structure. Given that crystal structures are only available for one representative of each BIA NMT subtype (e.g., TNMT, CNMT, RNMT/PavNMT), it is not presently clear whether the details of substrate binding are conserved across functionally analogous enzymes from different species (e.g., TfCNMT, CjCNMT, PsCNMT, GfNMT1). However, comparably low levels of sequence conservation (i.e., 45–80% identity) make it reasonable to suspect that the specific interactions might be variable. Accordingly, attempts to generalize from available structural information may be misleading, and additional studies are warranted.

### 70s Loop "Gate"

The 70s loop or active site "gate" mentioned above is a particularly intriguing feature that is not well understood. This region of the polypeptide undergoes a transition from disordered to ordered form (including structuration of helix α4) upon SAM and BIA binding, yet no direct interactions are made with the substrate (Torres et al., 2016, Bennett et al., 2018). Nevertheless, patterns of sequence conservation within BIA NMT subtypes (e.g., CNMTs *versus* TNMTs) strongly suggest that the gate contributes to functional differences (**Supplementary Figure 6**). To date, no mutagenesis studies have examined the functional consequences of these variable residues.

On the other hand, two residues adjacent to the 70s loop, which show perfect conservation among the BIA NMTs have been mutagenized with dramatic outcomes. Unexpectedly, replacement of Glu80 in TfPavNMT with alanine resulted in a substantial increase in activity, which was especially notable for non-endogenous substrate (*R,S*)-tetrahydropapaverine (Torres et al., 2016). In GfTNMT, mutagenesis of the equivalent Glu82 leads to a decrease in all activities. However, the effect was comparable in the sense that the mutant enzyme's substrate preference was shifted in favor of scoulerine, which is not considered to be an endogenous substrate for TNMT *in planta*. Examination of crystal structures shows that Glu80/Glu82 hydrogen bonds with an adjacent helix and suggests that the residue might contribute to substrate selectivity by anchoring the 70s loop. The second adjacent residue, Tyr79/Tyr81, is less consistently positioned across BIA NMTs, and mutagenesis also had variable effects depending on the enzyme–substrate pair. In CjCNMT and GfTNMT, the tyrosyl side chain points into the active site and might directly interact with the BIA amino group or benzyl moiety functional groups (**Figure 4**). GfTNMT mutants in which Tyr81 was replaced with alanine, phenylalanine, or arginine all showed substantially reduced activities, indicating that a rather specific interaction (likely involving a hydrogen bond) takes place in the wildtype enzyme. In TfPavNMT, the Tyr79 side chain is rotated approximately 45° with reference to the other BIA NMT structures and does not directly interact with the substrate. Interestingly, replacement of this residue with alanine almost entirely abolished activity with a potential endogenous substrate ((*S*)-reticuline) but activity with the nonendogeous (*R,S*)-tetrahydropapaverine was greatly enhanced. Together, these pieces of evidence suggest that Tyr79/Tyr81 also contributes to selectivity for particular substrates. Thus, it is clear that the 70s loop "gate" and adjacent residues have important consequences regarding BIA NMT activity. While it appears that the outcome of these contributions is substrate selectivity, the precise mechanisms by which this occurs remain to be elucidated.

### Catalysis

The mechanisms of catalysis in BIA NMTs are not yet entirely resolved, but appear similar in some ways to that described above for BIA OMTs (Torres et al., 2016; Bennett et al., 2018; Lang et al., 2019). Perfectly conserved glutamic acid and histidine residues are implicated in the methyl transfer reaction, although their contributions may vary from one enzyme–substrate pair to the next. The histidine residue is proposed to act as a general base, which deprotonates the substrate nitrogen and thus activates it to carry out nucleophilic attack on the labile methyl group of SAM (Bennett et al., 2018). In support of an important role for the histidine residue, substitution of His206 in TfPavNMT and His208 in CjCNMT with alanine resulted in mutant enzymes with sharply reduced activities. Nevertheless, activity was not entirely abolished, and so, it appears that, unlike in BIA OMTs, deprotonation by a general base mechanism is not strictly necessary for catalysis.

Several proposals have been put forth concerning the function of the adjacent glutamic acid residue. In analogy to the catalytic dyad of BIA OMTs (i.e., histidine and aspartic acid), the glutamic acid carboxyl group might hydrogen bond with the histidine imidazole moiety and thus promote the basicity required for substrate deprotonation. However, given that mutagenesis of Glu207 in CjCNMT had only a minor effect on activity (~35% decrease), it was proposed that the backbone carbonyl of another highly conserved residue (Thr261 in CjCNMT) might be the catalytic partner. Another possible mechanism for the glutamic acid is hydrogen bonding with the BIA nitrogen atom to improve its position and orientation relative to the incoming methyl group. Similarly, the glutamic acid might hydrogen bond with the methyl group, itself, in a manner that stabilizes the reaction intermediate, as has been proposed for rat glycine NMT on the basis of computational modeling (Świderek et al., 2018). Aside from hydrogen bond-mediated interactions, it has also been proposed that the reaction intermediate could be stabilized by electrostatic interactions with the negatively charged glutamic acid side chain. Experimental results, in which double mutants (e.g., TfPavNMT-Glu205Ala-His206Ala) show an additive effect, are consistent with any of these alternative models in which the key active site residues make independent contributions to catalysis.

Notably, a general base mechanism may not be applicable to catalysis of all substrates accepted by BIA NMTs. For example, the tertiary amine in GfTNMT substrate stylopine has a calculated pKa of approximately 5.3, which indicates that it would be almost entirely deprotonated under physiological pH conditions, and thus, proton abstraction would be unnecessary to allow methyl transfer (www.chemicalize.org). Mutagenesis of His208 and Glu207 in GfTNMT had effects comparable to those seen in other BIA NMTs (for which substrates almost certainly require deprotonation), indicating that these residues are important for catalysis even when a general base mechanism cannot be invoked.

A fourth residue showing intriguing patterns of conservation between BIA NMT subtypes has also been shown to interact with the substrate (Glu204 in CjCNMT) (Bennett et al., 2018). Due to the side chain's proximity to the alkaloid nitrogen atom and the substantial detrimental effect of alanine substitution on catalysis, Micklefield and colleagues proposed that Glu204 hydrogen bonds with, and helps fine-tune the position and reactivity of, the target nitrogen atom. Interestingly, alanine substitution of Glu204 in GfTNMT had a comparable effect on activity with stylopine despite this molecule's nitrogen atom being unable to form an equivalent hydrogen bond concurrent with methyl transfer. Accordingly, in GfTNMT, it appears more likely that Glu204 interacts with the transition state methyl group or contributes an electrostatic effect. Given that the equivalent residue in TfPavNMT is an alanine, it is clear that the details of this interaction are not conserved between BIA NMT functional subtypes. Intriguingly, RNMT-like enzymes, which efficiently catalyze *N*-methylation of tertiary BIAs (e.g., reticuline) have a glycine at this position (Morris and Facchini, 2016). The absence of a side chain likely alleviates steric hindrance, which would be expected to occur between the *N*-methyl group and the glutamic acid present in CNMTs. Mutational studies, as well as investigation of natural variation of this position in diverse BIA NMTs, should allow for this hypothesis to be tested.

Although the above proposals likely explain a significant portion of BIA NMT catalytic power, experimental results clearly show that no residue identified to date (Glu204, Glu207, His208, and equivalents) is strictly necessary for methyl transfer to occur. This is in notable contrast to mutagenesis studies on BIA OMTs in which substitution of the general base histidine precludes all activity (Park et al., 2018; Cabry et al., 2019). In fact, even the simultaneous replacement of up to three putative catalytic residues with alanine produced mutant TfPavNMT and GfTNMT enzymes with detectable activities (Torres et al., 2016; Lang et al., 2019). While it remains possible that a general base mechanism is at play in certain reactions, the aforementioned work strongly suggests that rate enhancement in BIA NMTs results from multiple modest contributions. Mechanisms consistent with this interpretation include electrostatic or hydrogen bond-mediated stabilization of the reaction intermediate, and those which invoke "compression, proximity, orientation and desolvation" effects (Zubieta et al., 2003). While the identification and mutagenesis of additional catalytic residues may assist in resolving this question, biophysical modeling may be more likely to provide a conclusive answer.

# EVOLUTION OF DIVERSITY IN BIA METHYLTRANSFERASES

Compared to our detailed understanding of the evolutionary trajectories of MT activities implicated in caffeine biosynthesis, relatively little equivalent knowledge is presently available concerning the BIA MTs. Nevertheless, the recent rapid increase in availability of sequence, structure, and function information for these important enzymes foreshadows a commensurate leap forward in our understanding. Given the substantially more complex pathways involved in BIA biosynthesis, understanding the evolution of even a subset of the enzymes will greatly enhance our conceptual grasp of how the tremendous chemodiversity in plants originates.

# *O*-Methyltransferases

The importance of *O*-methylation in various aspects of plant metabolism (e.g., lignin, phenylpropanoid, phytoalexin, and phytohormone biosynthesis) makes it unsurprising that their classification and evolutionary relatedness has received substantial attention. Setting aside the highly sequence-divergent SABATH MTs, the remainder of plant OMTs are generally understood to fall within two major groups (Lam et al., 2007). The first (class I) contains mostly enzymes, which target hydroxycinnamoyl-CoA esters and participate in lignin biosynthesis as well as carboxylic acid OMTs involved in plant hormone and scent metabolism, and the second (class II) contains a much more variable set of enzymes including those involved in phenylpropanoid and alkaloid biosynthesis. The first group is thought to have evolved as a result of pressures relating to the colonization of terrestrial habitats, whereas the second group likely arose later in response to a more diverse set of evolutionary forces. Although a focused and up-todate phylogenetic study of plant OMTs is overdue, basic analyses reported in conjunction with the isolation of new alkaloid OMTs have consistently placed the BIA OMTs within class II (Morishige et al., 2002; Morishige et al., 2010; Nomura and Kutchan, 2010; Dang and Facchini, 2012; Salim et al., 2018). While a complete examination of known plant OMTs is outside the scope of this review, we performed a limited analysis of representative plant OMTs, BIA OMTs, and other recently reported alkaloid OMTs to place them in the context of the phylogeny reported by Dananyandan and colleagues (**Figure 5**) (Lam et al., 2007). Consistent with the previous work, our results place all cloned BIA OMTs within a well-supported clade corresponding to class II. We recovered four subclades with topology and bootstrap support similar to those reported previously. A large majority of BIA OMTs fell within subclade II-D, which was sister to subclade II-C in which most other alkaloid OMTs and flavonoid OMTs clustered. The functionally validated BIA OMTs in subclade II-D appear to be monophyletic, and most prefer substrates of the simple 1-BIA structure (**Supplementary Table 1**). In addition, the heterodimer-forming OMTs participating in noscapine biosynthesis are within this clade, with the two functionally interchangeable monomers (PsOMT3 and Ps6OMT) clustering together. A single putative BIA OMT, for which preliminary reports indicate a unique ability to methylate the free hydroxyl group of 10-hydroxydihydrosanguinarine, falls within clade II-C in a position distant from the other BIA OMTs and apparently more closely related to other alkaloid OMTs (Purwanto, 2017). Many of the remaining BIA OMTs belong to the very wellsupported subclade II-A, which also included hydroxycinnamic acid OMTs involved in phenylpropanoid biosynthesis. All of the BIA OMTs in this subclade are functionally characterized as preferring or accepting protoberberine substrates. The smallest and least well-supported subclade (II-B) contained a few remaining BIA OMTs that mostly target the 7-hydroxyl of simple 1-BIAs, along with OMTs involved in flavonoid and methoxypyrazine biosynthesis.

As suggested in previous publications, the presence of several well-separated clades makes it reasonable to suspect that BIA OMTs have evolved repeatedly in plants. However, it should be noted that the various clades of BIA OMTs show distinct substrate- and regio-specificity and thus cannot be said to have converged on precisely the same function (**Figure 5**). Although the most commonly accepted explanation at present is repeated evolution, it remains possible that an ancestral OMT was broadly promiscuous and had activity with BIAs, which was lost in most non-BIA OMT lineages. While this may be less likely given the multitude of non-BIA plant OMTs in existence, certain pieces of evidence suggest that it is a question worth exploring. Phenylpropanoid OMTs cloned from *T. tuberosum* were shown to also accept benzylisoquinoline substrates and, similarly, enzymes implicated in the biosynthesis of ipecac alkaloids in *Psychotria ipecacuanha* catalyze methylation of

FIGURE 5 | Rooted neighbor-joining phylogenetic tree of BIA *O*-methyltransferases and representative plant *O*-methyltransferases labeled with reported *in vitro* substrate range. Frequencies shown at each node represent the percentage of 500 bootstrapped replicate trees in which the associated taxa clustered together. The consensus tree is shown without branch lengths. The analysis involved 77 sequences, and positions with less than 50% coverage were discarded, resulting in a final dataset with 353 positions. The tree was rooted with sequences from bacteria and animals. The analysis was conducted in MEGA7 (Kumar et al., 2016). Representative plan OMT sequences and clade assignments were adapted from Lam et al. (2007). Plant OMTs implicated in the biosynthesis of alkaloids other than BIAs are indicated by white boxes, whereas those that nevertheless accept BIA substrates are indicated with *asterisks*. BIA OMTs are labeled with colored circles representing their reported *in vitro* activities, in order from strongest to weakest (red, scoulerine-*O*-methylation; blue, 4′-*O*-methylation; green, 6-*O*-methylation; yellow, 7-*O*-methylation; black, 3′-*O*-methylation; white, other). Clades with recognizable majority activities are shaded with rectangles in the corresponding color. Genbank accession numbers for BIA OMTs are provided in Supplementary Table 1. Other alkaloid OMT accession numbers are: NpsN4′OMT (KJ584561), IpeOMT1 (AB527082), IpeOMT2 (AB527083), IpeOMT3 (AB527084), Ca10OMT (MG996006), Cro16OMT (EF444544), TibN10OMT (MH454075), VviOMT1 (KC533529), VviOMT2 (KC533535), VviOMT3 (KC517470), and VviOMT4 (KC517475).

BIAs with surprising efficiency (Frick and Kutchan, 1999; Nomura and Kutchan, 2010). In fact, the highly divergent and distantly related rat liver catechol OMT is also known to accept BIA substrates (Meyerson et al., 1979). Conversely, none of the BIA OMTs assayed with phenylpropanoid substrates have been shown to accept them (Morishige et al., 2002; Ounaroon et al., 2003; Pienkny et al., 2009), perhaps indicating that such a function is the more derived character. It is intriguing to consider that our present understanding of most plant OMTs as relatively specialized for one class of substrate or another might be a result in part from the practical challenges of assaying for many diverse activities when a new enzyme is discovered. Resolving this important question will require that, going forward, workers begin to routinely examine plant OMTs with respect to a wider range of potential and even physiologically unlikely substrates.

Superimposition of reported *in vitro* activities over apparent phylogeny reveals conservation of function in some BIA NMT clades but not others (**Figure 5**). The existence of one well-separated subclade containing only enzymes, which preferentially methylate protoberberines such as (*S*)-scoulerine suggests that most SOMTs share a monophyletic origin. In all but one case, members of this subclade have also been shown to methylate simple 1-BIA substrates to a lesser degree. While this might be interpreted as maintenance of an ancestral enzyme feature, structural comparison of the flexible 1-BIAs to the rather rigid protoberberines suggests that the former can readily adopt a conformation mimicking the latter, thus potentially explaining the functional overlap from a strictly structural point of view. It is interesting to note that the hydroxyl groups methylated by SOMT-like enzymes in simple 1-BIAs (i.e., 7, 3′) correspond to those methylated in protoberberines (i.e., 2, 9) (**Supplementary Figure 4**). Enzymes with SOMTlike activity are also present in other clades. In particular, the *C. japonica* enzyme shown to prefer columbamine over scoulerine may have evolved independently. Clades with majority preferences for 6-, 7-, or 4′-*O*-methylation of simple 1-BIAs show substantially more functional diversity between members. Although clades showing a preference for 6- or 7-*O*-methylation are recognizable, most enzymes catalyze both reactions. As mentioned above with reference to the Tf6OMT structure, only a minor adjustment of binding angle is necessary to position either the 6- or 7-hydroxyl in a productive alignment for methyl transfer (**Figure 3**). Despite their present functional overlap, the distinct clades suggest that 6/7-*O*-methylation evolved in two independent lineages. Like the SOMTs, enzymes catalyzing 4′-*O*-methylation primarily fall within a single clade indicative of monophyletic origin. Weak 4′ OMT activities also reported for one 6OMT and one 7OMT enzyme may reflect the ability of a simple 1-BIA to bind in a "flipped" orientation as hypothesized in the structural section above. Notably, a clade corresponding primarily to 3′-*O*-methylation is not evident and such enzymes are present in most clades. Although the above evidence suggests that specialization for various BIA OMT activities occurred in several independent lineages, the sporadic occurrence of corresponding but weaker activities in other clades suggests the possibility that an ancestral BIA OMT was highly promiscuous and perhaps able to catalyze the full range of methylations with lower efficiency. Resurrection and functional characterization of ancestral BIA OMTs should help test this hypothesis in the near future.

# *N*-Methyltransferases

*N*-methyltransferases in plants have received less attention overall, and evolutionary relationships are still unclear. It is presently thought that they are polyphyletic in origin and, in fact, that several distinct NMTs may have existed in the last universal common ancestor of all extant life (Anantharaman et al., 2002). Other than the BIA NMTs, major small molecule NMT families in plants include the putrescine NMTs, phosphoethanolamine NMTs, xanthine (SABATH) NMTs, and tocopherol C-methyltransferase-like NMTs (Hibi, 1994; Kato et al., 2000; Nuccio et al., 2000; Liscombe et al., 2010). Given that a comprehensive classification and phylogenetic analysis of plant NMTs is far outside the scope of this review, only the BIA NMTlike enzymes will be considered below.

Phylogenetic analyses carried out in conjunction with the isolation of new BIA NMTs have generally agreed upon the existence of several clades roughly corresponding to subtypes (i.e., CNMT-, RNMT-, or TNMT-like; **Figure 6**) (Liscombe and Facchini, 2007; Liscombe et al., 2009; Morris and Facchini, 2016). However, the detailed topology of BIA NMT gene trees has varied substantially from one report to the next and all should be interpreted with caution. Recently, an analysis of

in positions reflecting their *in vitro* activities' correspondence to one or more of the three major canonical BIA NMT activities as reported in the literature (CNMT, coclaurine *N*-methyltransferase; RNMT, reticuline *N*-methyltransferase; TNMT, tetrahydroprotoberberine *N*-methyltransferase). A representative substrate is shown for each group [CNMT, (*S*)-Coclaurine; TNMT, (*S*)-Scoulerine; RNMT, (*S*)-Reticuline]. Genbank accession numbers and activity details are provided in Supplementary Table 1.

more than 90 putative BIA NMTs from Ranunculales recovered four well-supported clades, three of which correspond to the known BIA NMTs and were experimentally validated as loosely predicting function (Hagel et al., 2015). Superimposition of reported *in vitro* activities over a BIA NMT phylogeny clarifies this idea (**Figure 7**). In particular, a well-supported monophyletic group of enzymes shown to almost exclusively catalyze *N*-methylation of protoberberine substrates is evident. However, enzymes with weak TNMT-like activities are present in other clades and may have evolved this function independently. Enzymes primarily accepting BIAs with tertiary nitrogen atoms other than protoberberines (i.e., RNMT-like)

FIGURE 7 | Rooted neighbor-joining phylogenetic tree of BIA *N*-methyltransferases labeled with reported *in vitro* substrate range. The optimal tree is drawn to scale with branch length in units of substitutions per site. Frequencies shown at each node represent the percentage of 500 bootstrapped replicate trees in which the associated taxa clustered together. The analysis involved 22 sequences and positions with less than 50% coverage were discarded, resulting in a final dataset with 358 positions. The tree was rooted with distantly related plant putrescine and phosphoethanolamine *N*-methyltransferase sequences. The analysis was conducted in MEGA7 (Kumar et al., 2016). BIA NMTs are labeled with colored circles representing their reported *in vitro* activities, in order from strongest to weakest (green, protoberberine *N*-methylation; blue, 2′ 1-BIA *N*-methylation; red, 3′ 1-BIA *N*-methylation; purple, pavinan *N*-methylation; yellow, aporphine *N*-methylation; black, pthalideisoquinoline *N*-methylation; white, *N*-methylation of other alkaloids including isoquinolines). Clades with recognizable majority activities are shaded with rectangles in the corresponding color. The BIA NMT-like *E. sinica* phenylalkylamine NMT (EsPaNMT) implicated in ephedrine biosynthesis is indicated with a white box. Genbank accession numbers for BIA NMTs are provided in Supplementary Table 1. *Datura stramonium* putrescine NMT (DaPNMT; CAE47481); *Nicotiana sylvestris* putrescine NMT (NsPNMT; BAA74544); *Arabidopsis thaliana* phosphoethanolamine NMT (AtPEANMT; NP\_188427); *Spinacea oleracea* phosphoethanolamine NMT (SoPEANMT; Q9M571). *Coccomyxa subellipsoidea* NMT (CsubNMT; XP\_005645141); *Chlamydomonas reinhardtii* NMT (CreiNMT; XP\_001695187).

also form a single clade. Notably, members of this clade accept BIAs with a broad range of carbon skeletons, which includes aporphines, pavinans, and pthalideisoquinolines. On the other hand, the cluster of CNMT-like enzymes that preferentially target 1-BIA substrates with secondary nitrogen atoms are reported to have a more restricted substrate range. Although not evident in the phylogenetic analysis presented here, other reports have consistently indicated that CNMT enzymes are more ancestral. Later evolution of the RNMTs and TNMTs is consistent with the cumulative hypothesis, in which enzymes operating further downstream in biosynthetic pathways are recruited later (Granick, 1957). Conclusive statements regarding the evolutionary history of this enzyme family await careful phylogenetic study, ideally including sequences obtained from many species beyond those typically used as model systems for BIA biosynthesis and supported by functional characterization of resurrected ancestral enzymes.

Intriguingly, BLAST searches of publicly available nucleotide sequence databases (NCBI NR, OneKP) reveal that transcripts encoding BIA NMT-like proteins (40–70% amino acid identity; **Supplementary Data Sheet1**) are present in a wide range of flowering plants as well as algae, mosses, gymnosperms, and gnetophytes (Matasci et al., 2014). To the best of our knowledge, all but one of the cloned and functionally characterized members of this large NMT family belong to the Ranunculales order and are implicated in BIA biosynthesis. Given their apparently ancient origin and widespread occurrence, including in species not known to produce alkaloids of any sort, the functional significance and maintenance of BIA NMT-like genes through many millions of years of plant evolution is a fascinating mystery.

Several pieces of evidence point to the possibility of a relatively ancient origin for BIA biosynthesis, which likely included the activity of NMTs. The sporadic but widespread distribution of BIA biosynthesis in eudicots, along with detection of the "gateway" norcoclaurine synthase (NCS) activity in a broad range of plants, supports a proposal that the evolution of BIA biosynthesis may have a monophyletic history in angiosperms (Liscombe et al., 2005). In fact, the occurrence of several BIAs in *Gnetum* species (e.g., 8-benzylberbine), and of NCS activity in *Ephedra distachya*, further suggests that the evolutionary origin of BIA biosynthesis may have been at least as ancient as the divergence of the Gnetophytes (Xu and Lin, 1999; Rochfort et al., 2005; Martin et al., 2011). Although no BIA biosynthetic studies have been completed for these plants, in particular, the production of similar BIAs in other species is known to require a CNMT (Hagel and Facchini, 2013). Interestingly, a recent investigation of *Ephedra sinica* identified a BIA NMT-like enzyme (Phenylalkylamine NMT; EsPaNMT) implicated in the biosynthesis of ephedrine, which also has promiscuous activity on several other alkaloids including 1-BIAs (Morris et al., 2018). This observation supports the notion that the ancestor to all extant BIA NMT-like enzymes may have had a very broad range of activities and which was refined and subfunctionalized in certain lineages, where BIA biosynthesis provided a selective advantage. Of course, an alternative hypothesis is that EsPaNMT and the BIA NMTs were recruited independently from a functionally distinct ancestral lineage and simply converged on BIA NMT function. In any case, maintenance of BIA NMT-like genes in plants over evolutionary timescales implies that they must function in some useful role. It will be interesting to discover in the coming years whether the annotation of this family as BIA NMTs is simply a historical accident or an accurate representation of their broader roles in plants.

# FORCES AND MECHANISMS SHAPING BIA METHYLTRANSFERASE EVOLUTION

As for most specialized metabolites, the forces driving the evolution and maintenance of BIA biosynthesis are not yet fully understood (Weng, 2014). Generally, BIAs are assumed to provide defensive advantages *via* antiherbivore and antimicrobial properties (Hagel and Facchini, 2013). In the case of cultivated BIA-producing varieties, such as *P. somniferum*, artificial selection for the presence of psychoactive morphinans may also have played a minor role in recent times. However, potent biological activities are only firmly demonstrated for a small fraction of BIAs (e.g., berberine, sanguinarine, magnoflorine, morphine), and thus, straightforward adaptive evolution does not comfortably explain the tremendous chemical diversity observed across BIA-producing species and individuals. When attempting to justify the presence of apparently useless, yet metabolically costly, BIAs in a given plant, it is important to appreciate that the biochemical snapshot we obtain in the present day results from a complex evolutionary history spanning innumerable shifts in herbivore and pathogen challenges. Accordingly, biosynthesis of some functionless BIAs may have resulted from pressures no longer present in the environment. Alternatively, it has also been proposed that BIA metabolic diversity may be a useful trait in its own right (Facchini et al., 2004). That is, the production of a large and dynamic repertoire of potential defense molecules, which varies from individual to individual, may represent a form of "diversified bet hedging", which can ensure the survival of at least some members when a lineage is suddenly faced with novel challenges. The canonical example of this type of coping strategy is seed germination timing, but variation in BIA profile could also conceivably fit the theoretical criteria (i.e., improved long-term evolutionary success despite reduced mean fitness, *via* a reduction in detrimental temporal fitness variance) (Childs et al., 2010). As reviewed in the preceding sections, the BIA MTs show substantial promiscuity and, perhaps more than any other class of enzyme, greatly expand and diversify the pool of BIAs that are produced.

Duplication followed by sub- or neo-functionalization is thought to be a crucial mechanism underlying the diversification of most eukaryotic gene families, and this also applies to the BIA MTs (Taylor and Raes, 2004). In *P. somniferum*, Graham and colleagues identified a genomic region on which many genes required for noscapine biosynthesis were clustered, including three BIA OMT genes corresponding to PsSOMT2, PsSOMT3, and PsSOMT1, various other BIA pathway genes and many transposable elements (Winzer et al., 2012). Based on sequence homology and intron–exon structure, these OMT genes were suggested to have arisen *via* tandem gene duplication, potentially followed by transposon-mediated cluster rearrangement. More recently, a *P. somniferum* whole genome was reported, which provides further support for a history of MT gene duplication (Guo et al., 2018). The authors described a relatively recent whole genome duplication as well as more ancient segmental duplications likely to have resulted in new BIA MT gene copies. Aside from MTs in the noscapine cluster reported previously, at least seven additional MT genes are present in their assembly (NCBI BioProject PRJNA435796). Notably, two copies of genes encoding Ps6OMT tightly linked to PsCYP80B1 (*N*-methylcoclaurine 3′-hydroxylase) exist on two separate contigs, suggesting the occurrence of either dispersed duplication or tandem duplication followed by genomic rearrangement. In addition, two copies of genes encoding PsN7OMT are identifiable. Given that both 6OMT and N7OMT are necessary for the biosynthesis of papaverine (**Figure 2**), which is particularly abundant in *P. somniferum*, it appears that increasing gene dosage is one important mechanism enhancing the contribution of key MTs to BIA biosynthesis. In contrast to the OMTs, only single functional copies of genes encoding PsCNMT, PsTNMT, and PsRNMT are evident in the genome. Although not tightly linked, *CNMT* and *TNMT* are located in the same region (roughly 10 MBp apart) of one chromosome. Aside from sub- and neofunctionalization, duplicated genes may often become inactive or pseudogenized. For example, three pseudogene copies of *TNMT*, within ~30 kb of each other, are reportedly linked to the noscapine cluster (Winzer et al., 2012). In addition, examination of the *P. somniferum* genome suggests that such a fate is quite common for duplicated BIA MTs. BLAST searches of the published assembly reveal many putative pseudogenes, corresponding in particular to *4*′*OMT2*, *TNMT*, and *RNMT*. These are generally found in tight clusters indicative of tandem duplication. Although outside the scope of this review, it is clear that linkage of MT genes with those encoding upstream and downstream enzymes is an important contributor to the biosynthesis of BIAs. In addition to making co-inheritance of a useful group of alleles more likely, clustering probably facilitates coordinated gene expression *via* chromatin remodeling. Given the substantial amount of clustering evident in the *P. somniferum* genome, it will be interesting to discover whether similar structures exist in other BIA-producing species and, if so, whether clustering is an ancestral feature or yet another example of convergence.

# FUTURE DIRECTIONS

In the preceding sections, we reviewed the wealth of information presently available concerning BIA MT structure, function, and relationship with host plant chemodiversity. Although many important insights have been obtained in recent years, much remains to be done if the sea of information is to yield more widely applicable knowledge and conceptual understanding useful to the field of plant biochemistry as a whole.

In spite of the fact that the role of MTs in the central BIA pathway (i.e., leading to core intermediate (*S*)-reticuline) is firmly established in model species such as *P. somniferum* and other members of Ranunculales, it would be worthwhile to verify that this knowledge is applicable in more distantly related BIA producing plants such as the Piperales, Cornales, Laurales, Sapindales, and Proteales (Shulgin and Perry, 2002; Liscombe et al., 2005). Similarly, the role of MTs in the many "unusual" BIA branch pathways has not been investigated. This includes pathways biosynthesizing the rhoeadine alkaloids present in *P. rheas* (Rönsch, 1986), benzylprotoberberines (e.g., Latifolian A) occurring in *Gnetum latifolium* (Rochfort et al., 2005), hexahydrobenzophenanthridines (e.g., Corygaline A) occurring in *Corydalis bungeana* (Gao et al., 2018), as well as the dimeric and trimeric BIAs reported in many species (Schiff, 1991). Biosynthesis of aporphine alkaloids in *Nelumbo nucifera* (Proteales) has recently received some attention; however, most analyses assumed that central BIA biosynthesis is the same as in Ranunculales and, furthermore, that *N. nucifera* OMT and NMT homologs catalyze the same reactions as reported in other species (Menéndez-Perdomo and Facchini, 2018). Going forward, it would be valuable to carry out in these species the same types of studies as were used to firmly establish the routes of biosynthesis in Ranunculales (e.g., labeled tracer feeding, detection of intermediates, and activities). Furthermore, it is crucial to experimentally validate the function of putative BIA MTs when they are discovered. Whereas heterologous expression and *in vitro* assays can readily be applied to proteins originating from these other species, *in planta* approaches (e.g., virus-induced gene silencing, CRISPR-mediated knockout) have been used primarily in *P. somniferum*, and substantial method development may be required before these tools can be brought to bear in a wider context. Taken together, the above experiments would shed light on the question of whether BIA biosynthesis truly is monophyletic and as widely conserved as generally assumed or whether different enzymes and pathways have evolved to converge on BIA biosynthesis by different means.

Despite recent publications that have improved the situation, structures of BIA OMTs accepting a full range of BIA scaffolds (e.g., pthalideisoquinolines) or displaying alternate regiospecificity (e.g., 4′OMT) are missing from the literature. Comparison of these with existing structures would suggest how OMTs discriminate between highly similar molecules, and targeted mutagenesis would then allow for experimental validation of these hypotheses. Alternatively, a focused analysis of multiple functionally analogous BIA OMTs from distinct phylogenetic lineages (e.g., Tf6OMT, Ct6/7OMT, Cc6OMT1) would reveal to what extent the mechanisms of substrate binding and catalysis are conserved. Although these types of comparative studies have been limited by the recalcitrance of plant enzymes to crystallization, modern protein engineering methods such as Surface Entropy Reduction can help overcome these challenges (Cooper et al., 2007; Cabry et al., 2019). From a biotechnological standpoint, crystal structures or docking studies with a wider range of inhibitor molecules (e.g., pathway intermediates or end products) would be valuable in pointing the way to engineering feedback-insensitive variants desirable for industrial applications. Although BIA NMTs are relatively well covered in terms of available structures, certain features are still mysterious. Catalysis is still not fully understood, and this issue is compounded by difficulties in obtaining crystal structures with the enzyme's "true" substrate bound. One promising approach to this problem involves the use of a reactive SAM analog (*S*-adensoyl-vinthionine) to form a bisubstrate adduct *in situ*, which remains trapped in the active site of the crystallized NMT (Qu et al., 2016). Another open question, potentially explored *via* biophysical modeling, is how these enzymes' binding pockets successfully discriminate between rather similar BIAs despite seemingly forming very few specific interactions. Similarly, the unique *N*-terminal extension and active site "gate", which might contribute to substrate selectivity in BIA NMTs, await careful study. In these cases, domain swap or deletion experiments should yield useful information on their function.

Of particular interest is the biochemical and physiological significance of MT dimerization. Structural elucidation of both functional and non-functional heterodimers (e.g., PsSOMT2: PsSOMT3 vs PsSOMT2:PsN7OMT) might reveal the subtly different interactions, which prevent or allow catalysis on certain substrates. However, targeted mutagenesis will undoubtedly be required to verify such hypotheses. Although only one heterodimer is known to be physiologically relevant at present, this is likely to change with more study. Combinatorial expression of BIA MTs in heterologous systems containing reconstituted BIA pathways is a powerful system with which to search for such interactions. However, it will also be crucial to validate that these heterodimers form in plants and make meaningful contributions to biosynthetic capacity. Ideally, this will be done with a combination of *ex vivo* (e.g., pull down, enzyme assay) and *in vivo* (e.g., FRET, gene knockout) methods. Of course, while considering the occurrence of heterodimerization, it will be crucial to also consider higher-order interactions with other proteins and enzymes that might form BIA metabolons.

Recent interest in understanding BIA biosynthesis in a wider range of plants should soon provide a more diverse set of BIA MTs to study. Modern computational resources and algorithms should allow for robust analysis of all these sequences, resulting in reliable phylogenies clarifying their interrelationships. As DNA synthesis costs continue to decrease, resurrection of ancestral enzymes should become routine and will allow us to answer long-standing questions about MT evolution. For example, it will be fascinating to discover what functions the ancestral class II OMT may have had and what trajectories lead to the extant functional diversity. Similarly, it should be possible to test the long-standing hypothesis that extant BIA NMTs diverged from a CNMT-like ancestor, and whether neo-functionalization or, rather, sub-functionalization then came into play. Along with the analysis of transcripts and encoded enzymes, genome structure will undoubtedly contribute to understanding the mechanics of MT evolution. Although the recently published *P. somniferum* genome has begun to shed such light, the significance of certain features (e.g., clustering) would be more evident if the genomes of additional *Papaver* species, more distantly related BIA producers, and closely related non-producers were available. Comparative genomics should reveal the timing of gene duplications and suggest how selection and drift contributed to the present complement of BIA MTs.

Ultimately, a complete understanding of the determinants of BIA MT function and the evolutionary trajectories that led to the formation of specific enzymes will reveal an important part of how the exquisite BIA biosynthetic pathways came to be. In combination with existing knowledge regarding caffeine biosynthesis and, eventually, with knowledge concerning the many other alkaloid pathways, these discoveries will allow us to reach satisfactory answers to the long-standing questions of how and why plant-specialized metabolism achieves such tremendous diversity.

# AUTHOR CONTRIBUTIONS

JM wrote the manuscript and created the figures. PF edited the final draft of the manuscript.

# FUNDING

JM is the recipient of a Natural Sciences and Engineering Research Council of Canada Postgraduate Scholarship. This work was

# REFERENCES


funded by a Natural Sciences and Engineering Research Council of Canada Discovery Grant to PF.

# ACKNOWLEDGMENTS

We would like to thank Dr. Samuel Yeaman and Dr. Qiushi Li for their assistance with visualization and BLAST annotation of the published *Papaver somniferum* genome.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.01058/ full#supplementary-material

SUPPLEMENTARY DATASHEET 1 | A Selection of BIA NMT-like proteins sharing 40-70% amino acid sequence idenetity.


and characterization. *Phytochemistry* 67, 2002–2008. doi: 10.1016/j. phytochem.2006.06.036


evolution of caffeine biosynthesis. *Mol. Plant* 10, 866–877. doi: 10.1016/j. molp.2017.04.002


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Morris and Facchini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Corrigendum: Molecular Origins of Functional Diversity in Benzylisoquinoline Alkaloid Methyltransferases

#### Approved by:

Frontiers Editorial Office, Frontiers Media SA, Switzerland

#### \*Correspondence:

Peter J. Facchini pfacchin@ucalgary.ca

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 15 October 2019 Accepted: 16 October 2019 Published: 08 November 2019

### Citation:

Morris JS and Facchini PJ (2019) Corrigendum: Molecular Origins of Functional Diversity in Benzylisoquinoline Alkaloid Methyltransferases. Front. Plant Sci. 10:1436. doi: 10.3389/fpls.2019.01436

Keywords: Benzylisoquinoline, Alkaloid, Methyltransferase, specialized metabolism, molecular evolution

## **A Corrigendum On**

*Jeremy S. Morris and Peter J. Facchini\**

Department of Biological Sciences, University of Calgary, Calgary, AB, Canada

**Molecular Origins of Functional Diversity in Benzylisoquinoline Alkaloid Methyltransferases** *By Morris JS and Facchini PJ (2019). Front. Plant Sci. 10:1058. doi: 10.3389/fpls.2019.01058*

In the original article, the author names were ordered incorrectly as "Peter J. Facchini and Jeremy S. Morris". The correct order is "Jeremy S. Morris and Peter J. Facchini."

The authors apologize for this error and state that this does not change the scientific conclusions of the article in any way. The original article has been updated.

*Copyright © 2019 Morris and Facchini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Biosynthesis and Emission of Stress-Induced Volatile Terpenes in Roots and Leaves of Switchgrass (*Panicum virgatum* L.)

*Andrew Muchlinski1†, Xinlu Chen2†, John T. Lovell3, Tobias G. Köllner4, Kyle A. Pelot5, Philipp Zerbe5, Meredith Ruggiero1, LeMar Callaway III1, Suzanne Laliberte1, Feng Chen2\* and Dorothea Tholl1\**

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Takao Koeduka, Yamaguchi University, Japan Yasuhiro Higashi, RIKEN Yokohama, Japan*

#### *\*Correspondence:*

*Feng Chen fengc@utk.edu Dorothea Tholl tholl@vt.edu*

*†These authors have contributed equally to this work*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 21 April 2019 Accepted: 21 August 2019 Published: 19 September 2019*

#### *Citation:*

*Muchlinski A, Chen X, Lovell J, Köllner TG, Pelot KA, Zerbe P, Ruggiero M, Callaway L III, Laliberte S, Chen F and Tholl D (2019) Biosynthesis and Emission of Stress-Induced Volatile Terpenes in Roots and Leaves of Switchgrass (Panicum virgatum L.). Front. Plant Sci. 10:1144. doi: 10.3389/fpls.2019.01144*

*1 Department of Biological Sciences, Virginia Tech, Blacksburg, VA, United States, 2 Department of Plant Sciences, University of Tennessee, Knoxville, TN, United States, 3 Genome Sequencing Center, Hudson Alpha Institute for Biotechnology, Huntsville, AL, United States, 4 Department of Biochemistry, Max Planck Institute for Chemical Ecology, Jena, Germany, 5 Department of Plant Biology, University of California, Davis, Davis, CA, United States*

Switchgrass (*Panicum virgatum* L.), a perennial C4 grass, represents an important species in natural and anthropogenic grasslands of North America. Its resilience to abiotic and biotic stress has made switchgrass a preferred bioenergy crop. However, little is known about the mechanisms of resistance of switchgrass against pathogens and herbivores. Volatile compounds such as terpenes have important activities in plant direct and indirect defense. Here, we show that switchgrass leaves emit blends of monoterpenes and sesquiterpenes upon feeding by the generalist insect herbivore *Spodoptera frugiperda* (fall armyworm) and in a systemic response to the treatment of roots with defense hormones. Belowground application of methyl jasmonate also induced the release of volatile terpenes from roots. To correlate the emission of terpenes with the expression and activity of their corresponding biosynthetic genes, we identified a gene family of 44 monoterpene and sesquiterpene synthases (mono- and sesqui-TPSs) of the type-a, type-b, type-g, and type-e subfamilies, of which 32 TPSs were found to be functionally active *in vitro*. The TPS genes are distributed over the K and N subgenomes with clusters occurring on several chromosomes. Synteny analysis revealed syntenic networks for approximately 30–40% of the switchgrass TPS genes in the genomes of *Panicum hallii*, *Setaria italica*, and *Sorghum bicolor*, suggesting shared TPS ancestry in the common progenitor of these grass lineages. Eighteen switchgrass TPS genes were substantially induced upon insect and hormone treatment and the enzymatic products of nine of these genes correlated with compounds of the induced volatile blends. In accordance with the emission of volatiles, TPS gene expression was induced systemically in response to belowground treatment, whereas this response was not observed upon aboveground feeding of *S. frugiperda.* Our results demonstrate complex above and belowground responses of induced volatile terpene metabolism in switchgrass and provide a framework for more detailed investigations of the function of terpenes in stress resistance in this monocot crop.

Keywords: switchgrass, terpene synthase, volatile, herbivory, defense

# INTRODUCTION

Switchgrass (*Panicum virgatum L.*, Poaceae) is a native warmseason C4 perennial grass common to natural and anthropogenic grasslands in North America. Characteristic of the Tallgrass Prairie, switchgrass is considered an important species for sustaining natural prairie biodiversity (Sanderson et al., 2006). Used mostly for forage since the 1950s, more intensive breeding of switchgrass began over 20 years ago to develop the species as an herbaceous model species for biofuel feedstock development (Casler et al., 2011). Major advantages for cultivating switchgrass are its resilience to extreme weather conditions, capability of growing on marginal soils, and a high cellulosic content (Vogel, 2004). Switchgrass also exhibits considerable resistance to pests and diseases (Parrish and Fike, 2005). With an increase in cultivation, growing interest has focused on elucidating the resistance mechanisms of switchgrass as well as engineering more resistant varieties. However, surprisingly little is still known about the modes of pathogen and pest defense in this species.

Plants deploy a biosynthetic and structurally diverse mosaic of specialized or secondary metabolites for chemical defense (Dudareva et al., 2004). Terpenes constitute the majority of such metabolites with important defensive activities. For instance, nonvolatile triterpenes are potent growth inhibitors of fungal pathogens (Osbourn, 1996). By contrast, low molecular weight 10-carbon monoterpenes and 15-carbon sesquiterpenes are emitted by plants as volatile compounds and serve important roles in direct defenses against pathogens and herbivores or function indirectly by the attraction of parasitoids or intra- and interplant priming (Turlings et al., 1990; Dicke, 1994; Kost and Heil, 2006; Köllner et al., 2008a; Huang et al., 2012; Vaughan et al., 2013; Erb et al., 2015).

The formation of terpenes in plants is catalyzed by enzymes of the terpene synthase superfamily (TPSs). TPS enzymes convert 10- and 15-carbon *cis*- or *trans*-isoprenyl diphosphates such as geranyl diphosphate (GDP), neryl diphosphate, farnesyl diphosphate [(*E*,*E*)-FDP or (*Z*,*Z*)-FDP] into monoterpenes or sesquiterpenes, respectively (Tholl and Lee, 2011). TPS genes often undergo species specific divergence and duplications resulting in terpene metabolic plasticity and adaptations (Pichersky and Gang, 2000). The structural diversity and biosynthetic evolution of terpene secondary metabolites have been studied extensively in crops including grasses such as maize, rice, and sorghum (Chen et al., 2011; Boutanaev et al., 2015; Block et al., 2019). Terpene-related defenses have been well described in these monocot crops and reveal diverse chemical mechanisms for resistance against above- and belowground stressors. For example, the sesquiterpene (*E*)-β-caryophyllene, one of the major volatile organic compounds (VOCs) released by maize leaves and roots, is involved in indirect defense by attracting parasitoids of herbivores and entomopathogenic nematodes (Turlings et al., 1990; Rasmann et al., 2005; Köllner et al., 2008a). Monoterpenes have also been implicated in defensive roles; for example, linalool confers resistance against rice bacterial blight caused by *Xanthomonas oryzae* (Taniguchi et al., 2014). More recently, a rice (*S*)-limonene synthase (*OsTPS19*) was shown to be involved in direct defense against the blast fungus *Magnaporthe oryzae*  (Chen et al., 2018).

In contrast to these findings in highly domesticated grasses, the biosynthesis and dynamics of terpenes in switchgrass have not been fully investigated, in part because of its complex genetic background. Lowland ecotypes are allotetraploid (2*n* = 4*x* = 36), while upland cultivars are frequently octoploid (2*n* = 8*x* = 72). Recent transcriptional analysis of defense responses to green bug herbivory (*Schizaphis graminum*, Aphididae) in switchgrass leaves revealed a global transcriptional remodeling resulting in increased reactive oxygen species production and upregulation of genes with predicted terpene synthase function (Donze-Reiner et al., 2017). Moreover, the presence of a few triterpene saponins (C30) (Lee et al., 2009) and the synthesis of diterpenes (C20) related to abiotic stress have been described (Pelot et al., 2018). However, no prior studies have investigated the formation and function of volatile terpenes in this grass. Therefore, we sought to identify and characterize TPS genes from the switchgrass genome and correlate stress-induced terpene synthases with compound production in roots and leaves. Particular focus was placed on TPSs that were readily inducible when challenged with a generalist herbivore and the defense-related phytohormones methyl jasmonate (MeJA) and salicylic acid (SA) with the future goal to investigate these genes for their broad defensive functions against pathogens and herbivores. Results from this study provide further insight into the genetic organization of terpene metabolism in switchgrass and illustrate the metabolic potential of terpene-related defenses in perennial polyploid grasses.

# MATERIALS AND METHODS

# Plant Materials

Seeds from the lowland allotetraploid switchgrass cv. Alamo were purchased from Bamert Seed Company (Muleshoe, TX) and used throughout this study. The seeds were sowed into potting substrate in 200-ml aluminum cans or 2.5″ pots and grown for 5 weeks at 26°C (16 h day) and 24°C (8 h night) in a Percival growth chamber. After germination, 15 seedlings were selected in each can or pot and grown for 5 weeks.

# Plant Treatments

Five-week-old seedlings were treated with larvae of *S. frugiperda* as described by Zhuang et al. (2012) with some modifications. Cans with 15 seedlings were each placed into a collection chamber, and 10 second instar larvae were released inside the chamber for overnight feeding. For treatment with MeJA and SA (Sigma-Aldrich), 25 ml of MeJA (0.1, 1, and 5 mM) or SA (5 mM) dissolved in ethanol were added per can or pot as a soil drench and left for 24 h, respectively. For physical wounding, a surgical scalpel was used to wound leaves and stems. Untreated plants and mock-treated plants (ethanol only) were used as controls. Three replicates were performed for each treatment.

# Volatile Collection and Identification

Volatiles emitted from leaves of the treated switchgrass and control plants placed in glass chambers were collected with an open headspace sampling system (Analytical Research Systems, Gainesville, FL, USA) in the light from 9:00 am to 1:00 pm. Fall armyworm (FAW) larvae were removed before volatile trapping. The volatiles were collected with volatile collection traps (Porapak-Q, http://www.volatilecollectiontrap.com/) and eluted with 100 µl methylene chloride containing 0.003% nonyl acetate (v/v). The collected volatiles were analyzed on a Shimadzu 17A gas chromatograph coupled to a Shimadzu QP5050A (http://www.shimadzu.com). Statistical analysis of leaf volatile data was done in R (v3.5.0) using ANOVA and *posthoc* Tukey–Kramer honestly significant difference comparisons where alpha ≤0.05.

Root volatiles were analyzed by automated solid-phase microextraction (SPME, AOC-5000 Shimadzu) through adsorption in the headspace with a 100-µM polydimethylsiloxane (Supelco) SPME fiber and thermal desorption for gas chromatography–mass spectrometry (GC-MS) analysis. Root tissue (1 g fresh weight) was detached from plants and placed in a 20-ml screw-capped vial containing 2 ml distilled water and 20 ng of the volatile internal standard 1-bromodecane. The SPME fiber was placed into the headspace of the vial and incubated for 30 min at room temperature for volatile collection. Collected volatiles where thermally desorbed for 4 min and analyzed using a gas chromatograph (240°C injector port) coupled with a quadrupole mass spectrometer (GC-MS-QP2010S, Shimadzu). Extracts were separated with a 2:1 split on a 30 m × 0.25 mm i.d. × 0.25 μm film thickness Zebron capillary column (Phenomenex) using helium as the carrier gas (1.4 ml min−1 flowrate) and a temperature gradient of 5°C min−1 from 40°C (hold 2 min) to 220°C. Compound identification was based on similarity to library matches (NIST, Wiley), authentic standards (Sigma-Aldrich, (*E*)-β-caryophyllene, germacrene-D), and comparison to Opopanax essential oil (Floracopeia, δ-cadinene, α-humulene). Relative abundance was determined by normalization of the analyte peak area to the peak area of the internal standard and dividing by gram fresh weight.

# Identification of TPS Genes From the Switchgrass Genome and Phylogeny Reconstruction

Putative switchgrass TPS genes were retrieved from Phytozome (www.phytozome.jgi.doe.gov) through an annotation-based keyword search of genome versions v.1 and v.4. In addition, RNA-seq data kindly provided by the Noble Foundation (https://www.noble.org) for above- and belowground tissues were assembled *de novo* using Trinity (Grabherr et al., 2011). Assembled transcriptomes were queried with a representative switchgrass TPS sequence (*PvTPS01*) using the National Center for Biotechnology Information's TBLASTX. Resulting BLAST hits were manually curated for putative functionality based on length and presence of the conserved aspartate rich motif (DDxxD) necessary for ionization of the prenyldiphosphate substrate (Class I TPSs). Class I and II diterpene synthases identified in this study were not further pursued based on previous reporting by Pelot et al. (2018). Gene models were refined further by comparing transcripts to genome sequences available in Phytozome. Putative N-terminal plastidic transit peptides were predicted using multiple sequence alignments and analysis of each sequence with the transit peptide prediction software ChloroP (Emanuelsson et al., 1999). Phylogeny reconstruction was based on protein sequence alignments, which were performed using MAFFT (Katoh et al., 2002). Maximum likelihood trees were then built from MAFFT alignments using PhyML (Guindon et al., 2010) with 500 bootstrap replicates as previously described (Pelot et al., 2018). Final phylogeny annotation and design were performed in Interactive Tree of Life (Letunic and Bork, 2007). Heat map analysis was based on publicly available expression data at http:// www.phytozome.net/following previously described methods (Pelot et al., 2018).

# Synteny Analysis and Identification of Orthologous TPS Genes

*P. virgatum* (v4.1), *Setaria italica* (v2.2), *Sorghum bicolor* (v3.1), and *P. hallii* var. *hallii* (v2.1)genome annotations were downloaded from phytozome (phytozome.jgi.doe.gov). Syntenic blocks were generated following Lovell et al. (2018) *via* the GENESPACE pipeline. Orthofinder was run on synteny-constrained BLASTp results to build orthologous gene networks.

# Gene Expression Analysis

Total RNA was isolated from switchgrass leaves and roots using the RNeasy Plant Mini Kit according to the manufacturer's protocol (http://www.qiagen.com). Complementary DNA was synthesized using the GE Healthcare first-strand synthesis kit according to the manufacturer's protocol (http://www. gelifesciences.com). Gene expression analysis was carried out using quantitative reverse transcription PCR (RT-PCR), which was described previously (Chen et al., 2018). Sequences of primers used for RT-qPCR are listed in **Supplementary Table 3**.

# Protein Expression in *E. coli* and Terpene Synthase Activity Assay

Full-length and truncated genes (predicted transit peptide removed) were synthesized and cloned (*Nde*I) into the pET-28b(+) prokaryotic expression vector. Constructs were transformed into *Escherichia coli* BL21-CodonPLus(DE3) cells (Stratagene) and grown at 37°C in 100 ml Luria–Bertani media supplemented with 50 µM kanamycin until an optical density at 600 nm (OD600) of 0.5–0.7. Protein production was then induced with 0.5 mM isopropyl 1-thio-ß-D-galactopyranoside, and cells were incubated with shaking at 18°C for 16 h. Recombinant protein extraction and partial purification were performed as described by Tholl et al. (2005), with the modification that N-terminal Histags were implemented for partial purification. Enzyme reactions (125 µl total volume) were prepared in a 10-ml screw cap vial (Supelco) by combining partially purified protein with 20 mM MgCl2 and 60 µM commercially available prenyl diphosphate substrates GDP and (*E,E*)-FDP (Echelon Biosciences). Assay mixtures were incubated for 5 min at 30°C in the presence of a 100-µM polydimethylsiloxane fiber (Supelco). Collected volatiles were thermally desorbed for 4 min and analyzed using a gas chromatograph (240°C injector port) coupled with a quadrupole mass spectrometer (GC-MS-QP2010S, Shimadzu). Extracts were separated with a 5:1 split under the same conditions described above. Compound identification, in addition to those compounds described above, was based on similarity to library matches (NIST, Wiley, copaene, cycloisosativene, β-elemene, α-patchoulene, α-selinene, valencene), authentic standards (Sigma-Aldrich, borneol, 1,8-cineole, geraniol, limonene, linalool, α-pinene, sabinene, a-terpineol, α-terpinolene), and comparison to Opopanax oil [Floracopeia, β-bisabolene, (*E*)-γ-bisabolene, γ-curcumene, (*E*)-β-farnesene, sabinene, α-santalene].

# RESULTS

# Emission of Volatile Terpenes From Leaves in Response to Insect Feeding

To assess whether switchgrass leaves emit volatile compounds upon aboveground herbivory, emissions from switchgrass plants (cv. Alamo) damaged by larvae of *S. frugiperda* (FAW) were collected by open headspace sampling and analyzed by GC-MS. We found that FAW treatment induced the emission of nine terpene compounds, which were not detected in plants that only received physical wounding or remained untreated (**Figure 1** and **Supplementary Table 1**). Among the released compounds, the sesquiterpenes (*E*)*-*β-caryophyllene and (*E*)*-*β-farnesene were strongly induced by herbivore damage

accounting for ~17 and ~26%, respectively, of the total volatile organic compound emission (**Figure 1** and **Supplementary Table 1**). Emission rates of (*E*)*-*β-caryophyllene were ~500 ng/h g FW. Additional major compounds induced by FAW included the monoterpene (*E*)*-*β-ocimene, the homoterpene (*E*)*-*DMNT, and the sesquiterpenes β-elemene, α-bergamotene, α-humulene, and β-copaene (**Figure 1** and **Supplementary Table 1**).

# Emission of Volatile Terpenes From Roots and Leaves Upon Belowground Treatment With Methyl Jasmonate or Salicylic Acid

We further determined whether emissions of volatile compounds from switchgrass roots could be induced by root treatment with phytohormones mimicking herbivory or pathogen infection. Different concentrations of MeJA were tested (0.1, 1, and 5 mM) by watering plants directly with each solution. Because of the volatility of MeJA, we expected that the compound diffused further into the substrate at a lower concentration. Volatiles were collected from detached roots using SPME and analyzed by GC-MS. Concentrations of 1 and 5 mM MeJA caused a similar relative release of sesquiterpene compounds from the root tissue (shown for 5 mM treatment; **Figure 2**), while no volatiles were induced upon treatment with 0.1 mM MeJA. Of the seven identified compounds, (*E*)-β-caryophyllene was the most abundant (~43% of total), while cycloisosativene, β-elemene, α-humulene, α-selinene, germacrene D, and δ-cadinene were

13: β-copaene. IS: internal standard nonyl acetate.

present at low levels. We also applied SA at a concentration of 5 mM; however, no release of sesquiterpenes was observed from root tissue. We further found two monoterpenoids, camphor and borneol, to be released from root tissue of untreated plants. Emissions of these compounds were reduced by MeJA and SA treatments, although this was not found to be statistically significant based on comparisons of the means (ANOVA, *p* > 0.05, **Supplementary Figure S1**).

We also tested whether a drench with MeJA and SA at 5 mM could induce volatile emissions in aboveground tissues. Treatment with MeJA strongly induced volatile emission from leaves compared to mock controls, with 13 compounds identified (**Figure 1** and **Supplementary Table 1**). Major induced compounds were (*E*)*-β*-caryophyllene and β-elemene accounting for ~38 and ~17%, respectively, of total volatile emissions (**Figure 1** and **Supplementary Table 1**). Emission rates of (*E*)*-β*-caryophyllene were approximately 1,500 ng/h g FW. Other minor compounds included limonene, (*E*)-βocimene, (*E*)*-*DMNT, α-ylangene, α-bergamotene, α-humulene, (*E*)*-*β-farnesene, and β-copaene (**Figure 1** and **Supplementary Table 1**). Two additional putative sesquiterpenes were also emitted; however, these compounds could not be further identified based on available standards. Treatment with SA induced the emission of four terpene compounds with (*E*)*-*βfarnesene accounting for 83% of total emissions. Trace amounts of limonene, (*E*)*-*β-caryophyllene, and α-bergamotene were detected, which were not observed in the untreated controls (**Figure 1** and **Supplementary Table 1**).

# Genome-Wide Identification of Putative Terpene Synthases in Switchgrass

Based on the inducible emission of diverse volatile terpenes from switchgrass roots and leaves, we sought to identify the TPS genes responsible for their formation. Following a genome-wide search of the switchgrass draft genome v.1, we originally identified 144 putative TPS gene models. Of these putative gene models, 108 were confirmed in the draft genome v.4, with 74 putative full-length mono-, sesqui-, and di-TPS genes identified. Manual sequence curation through multiple sequence alignments and comparison to genomic and transcriptomic data resulted in the identification of 44 putative full-length mono- and sesqui-TPS genes (**Table 1**, **Supplementary Table 2**, **Supplementary Figure S2**). Identified di-TPS genes (30 in total) were previously reported and therefore not included in this study (Pelot et al., 2018).

Alignment and phylogenetic analysis of amino acid sequences from the mono- and sesqui-TPSs together with select TPSs from maize, rice, sorghum, tomato, and snapdragon showed that 35 members belong to the TPS type-a clade (**Figure 3**). In addition, five proteins aligned to the TPS-g subfamily and three clustered in the TPS-b subfamily. Only *Pv*TPS15 (TPS-e) was predicted to be involved in volatile formation outside of the TPS-a, TPS-b, and TPS-g subfamilies. Like in other plant TPS proteins, switchgrass TPSs of the TPS-a, TPS-b, TPS-e, and TPS-g subfamilies carry the conserved aspartate-rich "DDXXD" motif and the less conserved "NSE/DTE" motif in the C-terminal α-domain (Chen et al., 2011).

When we examined the relative chromosomal position of the identified TPS genes, we found that 22 genes are distributed across the nine chromosomes in the switchgrass subgenome K with highest abundance of genes occurring on chromosomes 1K, 6K, and 9K (**Figure 4** and **Table 1**). In subgenome N, we identified the relative location of 20 genes with highest abundance on chromosomes 1N, 6N, and 9N (**Figure 4** and **Table 1**). Several genes are positioned in loose gene clusters throughout the genome (**Figure 4** and **Table 1**). The relative positions of *PvTPS02* and *PvTPS07* could not be determined based on incomplete genomic data (**Table 1**).

TABLE 1 | Identified mono- and sesqui-terpene synthase (mono- and sesqui-TPS) gene models in the switchgrass (cv. "AP13") genome in the order of chromosomal localization.


*Genomic coordinates were determined based on draft genome data available in Phytozome (https://phytozome.jgi.doe.gov/).*

Investigation of syntenic orthologous genes between the two switchgrass subgenomes identified networks between 8 genes on subgenome K and 10 genes on subgenome N (including one putative mono- or di-TPS, 3NG211100, and two putative di-TPSs, 3KG400900 and 3NG171200) (**Figure 4**, **Supplementary Figure S3**, **Supplementary Table 2**). Comparisons between the genomes of switchgrass and sorghum showed that 13 switchgrass TPS loci have syntenic orthologs on 6 of the 10 sorghum chromosomes (**Figure 4**, **Supplementary Figure S3**, **Supplementary Table 2**). Several of these switchgrass TPS genes also occur in syntenic gene networks with genomes of the more closely related grasses *Setaria italica* and *Panicum*  *hallii* suggesting conserved genomic regions in TPS gene evolution in these species.

# Biochemical Characterization of Monoterpene and Sesquiterpene Synthases From Switchgrass

To determine the *in vitro* function of the 44 identified TPS genes, open reading frames were synthesized and cloned into the bacterial pET28b expression vector. The recombinant proteins were expressed in *E. coli* and protein lysates tested for TPS activity with GDP and (*E,E*)-FDP as substrates. We expected many TPSs

in the subfamily-a (**Figure 3**) to function as sesqui-TPSs. Indeed, 19 recombinant TPS proteins in this family produced one or more sesquiterpene olefins, among them (*E*)-β-caryophyllene, (*E*)-β-farnesene, and other common plant sesquiterpenes (**Figures 3** and **5**). All of these proteins except *Pv*TPS83 did not carry a plastidial transit peptide, indicating that they are likely to function in the cytosolic compartment. *Pv*TPS02 was the only TPS protein found in the g-subfamily to exhibit sesquiterpene synthase activity *in vitro*. However, since a plastidial targeting sequence typical of subtype-g TPSs has been predicted for this protein, its function as a sesqui-TPS *in vivo* might be limited.

Twelve TPS proteins distributed over the TPS-a, TPS-b, TPSg, and TPS-e subfamilies functioned as monoterpene synthases *in vitro* (**Figures 3** and **6**). *Pv*TPS04 produced a mixture of monoterpenes from GDP with α-terpinolene and borneol as major products (**Figure 6**). *Pv*TPS36 and *Pv*TPS56 converted GDP into multiple volatile products with predominantly limonene and α-terpineol as the major products, respectively (**Figure 6**). The remaining enzymes produced either linalool (*Pv*TPS12, *Pv*TPS13, *Pv*TPS15, *Pv*TPS27, *Pv*TPS52, and *Pv*TPS71) or geraniol (*Pv*TPS53 and *Pv*TPS101) (**Figure 6**). *Pv*TPS13 and *Pv*TPS15 also converted (*E,E*)-FDP into nerolidol (**Supplementary Figure S4**); however, this activity might be limited *in vivo* because of the predicted plastidial localization of these proteins. On the contrary, no plastidial transit peptides were predicted for *Pv*TPS12, *Pv*TPS56, *Pv*TPS71, and *Pv*TPS101, which questions their function as monoterpene synthases *in vivo*.

Only trace amounts of compounds were detected for recombinant proteins encoded by *PvTPS07*, *PvTPS62*, *PvTPS81*, and *PvTPS106*. In addition, no substantial enzymatic activity was found for eight proteins (*Pv*TPS18, *Pv*TPS26, *Pv*TPS28, *Pv*TPS33, *Pv*TPS54, *Pv*TPS73, *Pv*TPS85, and *Pv*TPS104), which is in accordance with

the presence of several deletions and/or insertions in the open reading frames of the corresponding genes (**Supplementary Figure S2**). Sequence truncations were furthermore found at the N- and C-terminus of the functionally active *Pv*TPS09 and *Pv*TPS02 proteins, respectively (**Supplementary Figure S2**).

# Expression Analysis of PvTPSs in Different Tissues and Upon Treatment With FAW, MeJA, and SA

Global expression patterns for all 44 TPS genes were analyzed by hierarchical cluster analysis based on publicly available data (https://phytozome.jgi.doe.gov/). We found specific patterns of transcript abundance in vascular tissue, leaf blade, and sheath tissues as well as roots and germinating seeds (**Figure 7**). Transcripts included those of the 12 genes that lack *in vitro*

functional activity. There was little overlap in expression between above- and belowground tissues, indicating gene-specific adaptations in these tissues. Despite the observed transcriptional patterns, we were unable, with the exception of borneol, to detect volatile terpenes in leaves and roots of the Alamo cultivar under constitutive conditions.

To determine whether correlations between transcript abundance and volatile terpene products could be established in response to treatment with FAW, MeJA, and SA, we selected multiple TPS genes for expression analysis by quantitative RT-PCR (**Figure 8A**). In leaves, substantial induction at the transcript level (>10-fold) following herbivory was observed for 12 TPS genes (*PvTPS01*, *PvTPS04*, *PvTPS05*, *PvTPS06*, *PvTPS08*, *PvTPS11*, *PvTPS14*, *PvTPS16*, *PvTPS19*, *PvTPS36*, *PvTPS53*, and *PvTPS56*), of which 10 genes and *PvTPS12* and *PvTPS15* were equally of more highly induced upon root treatment with MeJA

(**Figure 8A**). SA-induced expression exceeding that in response to FAW and MeJA treatment was observed for *PvTPS04*, *PvTPS13*, *PvTPS16*, and *PvTPS53*. Highest induction of *TPS* transcript levels in roots was found for 11 genes in response to the application of MeJA (*PvTPS05*, *PvTPS06*, *PvTPS10*, *PvTPS11*, *PvTPS14*, *PvTPS17*, *PvTPS19*, *PvTPS20*, *PvTPS36*, *PvTPS53*, and *PvTPS56*) or both MeJA and SA (*PvTPS53*) (**Figure 8A**). For *PvTPS02*, *PvTPS03*, *PvTPS07*, and *PvTPS09*, induced transcript levels were lower than 10-fold in leaves and/or roots upon any of the treatments (**Supplementary Figure S5**).

For nine TPS genes, we were able to identify their enzymatic products as components of the induced volatile blends of leaves and/or roots: The genes encoding (*E*)-β-caryophyllene synthases (*PvTPS11*, *PvTPS14*, and *PvTPS19*) showed highest transcript abundance in leaves and roots upon treatment with MeJA (**Figure 8A**). (*E*)-β-Caryophyllene emissions from both

tissues are most likely associated with the activity of these TPSs. Expression of *PvTPS16*, whose recombinant protein produced (*E*)-β-farnesene, was strongly induced by SA treatment in leaves and is likely be associated with (*E*)-β-farnesene emission from this tissue (**Figures 1** and **8A**). The gene encoding *Pv*TPS05, which was found to produce α-selinene volatiles, was most strongly expressed in roots by application of MeJA matching the detection of this compound from root tissue (**Figures 2** and **8A**). *Pv*TPS06 and *Pv*TPS09 both catalyze the formation of δ-cadinene, which was emitted from roots upon MeJA treatment

(**Figures 2** and **5**). Only *PvTPS06* was highly induced by MeJA, indicating its likely function *in planta* (**Figure 8A**). Moreover, transcript levels of *PvTPS36* were substantially induced in leaves in response to MeJA application, although emission of one of the primary products of the *Pv*TPS36 enzyme, limonene, occurred only at low levels (**Figures 1** and **8A**). Interestingly, terpene products (cycloisosativene, borneol) associated with two genes (*PvTPS01* and *PvTPS04*), which showed highest expression in leaves upon FAW and/or phytohormone treatment, could only be detected in roots (**Figures 2** and **8A**, **Supplementary Figure S1**).

Some TPS genes with lower levels of induction may contribute to the emission of particular terpenes [e.g., the (*E*)-β-farnesene

synthase gene *PvTPS02*]. Compounds produced *in vitro* by several other TPS enzymes could not be detected or occurred only at trace levels in leaves and roots despite a strong induction of their corresponding genes. For example, 1,8-cineole produced by *Pv*TPS08 was only detected in trace amounts in root tissues. Linalool, the single product of enzymes encoded by *PvTPS12*, *PvTPS13*, and *PvTPS15* (**Figure 6**), was neither detected in emissions from leaves and roots and may be further metabolized upon stress treatment. Other TPSs for which no associations could be established between their enzymatic products and volatile emissions include *Pv*TPS03 [(*E*)-γ-bisabolene synthase], *Pv*TPS10 (α-patchoulene synthase), *Pv*TPS17, and *Pv*TPS20

(A) and induced by different treatments. R, root; L, leaf; SA, salicylic acid; FAW, fall armyworm; MeJA, methyl jasmonate.

[(*E*)-β-bisabolene synthases], and TPS53 (geraniol synthase). Emission of germacrene D from roots may be associated with *PvTPS55*, the expression of which was not determined.

# DISCUSSION

The switchgrass genome contains a large family of 44 predicted full-length mono- and sesqui-TPS genes, of which 32 genes encode functionally active proteins. Sesqui-TPSs belonging to the type-a subfamily make up the majority of this TPS group, while only few mono-TPSs have emerged in the type-a clade or are distributed over the type-b, type-g, and type-e/f clades (**Figure 3**). Similar distributions have been shown to occur in the TPS families of rice and sorghum (Chen et al., 2011). Expansions of type-a clades are also common in dicots, although typically a higher proportion of mono-TPSs can be found in the type-b and type-g clades of dicot species (Chen et al., 2011; Kulheim et al., 2015)

The size of the switchgrass mono- and sesqui-TPSs family is almost twice as large as the number of characterized or predicted proteins with mono- or sesqui-TPS activity in maize (Springer et al., 2018). Polyploidy likely contributed to the expansion of the switchgrass TPS gene family, which is in agreement with studies by Hofberger et al. (2015) demonstrating the role of polyploidy events in the diversification and expansion of terpene secondary metabolism. Gene duplication through polyploidization generates gene redundancy eventually increasing functional divergence and allowing species adaption (Wendel, 2000). As an allotetraploid, switchgrass evolved from two diploid ancestors giving rise to two complete subgenomes (N and K) and functional divergence of TPS genes. In *P. hallii*, a diploid relative of switchgrass, ~32 putative full-length TPS genes are annotated (https://phytozome. jgi.doe.gov/), indicating that polyploidization of switchgrass more than doubled the number of TPS genes. Polyploidy events in domesticated grasses may not always result in large TPS gene families as has been suggested for wheat (Schmelz et al., 2014). However, in switchgrass, obligate outcrossing and limited breeding have maintained massive phenotypic and adaptive polymorphisms (Casler et al., 2007), in line with a higher level of diversification in TPS genes. Nevertheless, one-third of the TPS genes we characterized appear to be functionally inactive, while several other TPSs might have limited *in vivo* activity due to their subcellular localization, suggesting inactivation and loss of *in vivo* function for a substantial fraction of the gene family.

A comparison between the switchgrass subgenomes found that only 35 or 50% of the TPS genes on subgenome K and N, respectively, have syntenic orthologs on the other subgenome. This limited synteny indicates subgenome divergence in TPS gene organization. Syntenic regions include TPS genes with identical functions [*PvTPS14* and *PvTPS19*—(*E*)-βcaryophyllene synthases; *PvTPS17* and *PvTPS20*—(*E*)-βbisabolene synthases], while other orthologs adopted different functional activities. Further comparison with the genome of the closely related diploid species *P. hallii* revealed syntenic orthologs for more than 15 switchgrass TPS genes on 6 of the 9 P*. hallii* chromosomes. Corresponding syntenic orthologs could also be identified for several of these genes on the genomes of the close relative *S. italica* and of *S. bicolor*. These findings are consistent with the observed collinearity between the switchgrass, *Setaria*, and sorghum genomes (Casler et al., 2011) and suggest the presence of ancestral TPS genes in the common progenitor of sorghum and switchgrass more than 20 million years ago. Syntenic regions on the sorghum genome include a cluster of TPS genes on chromosome 7, which was found to encode insect-induced sesquiterpene synthases and shares (*E*)-β-farnesene synthase activity (*Sorbic.007G055600*, *PvTPS109*) (Zhuang et al., 2012).

Most mono- and sesqui-TPS genes of switchgrass exhibit tissue-specific expression patterns (**Figure 7**). With the exception of the root-accumulated monoterpene borneol, the products associated with these TPSs could not be found in leaves and roots under constitutive conditions and became in part only detectable in response to stress treatment when gene expression was induced. It is possible that under nontreatment conditions enzyme activity or substrate levels are too low to result in detectable amounts of product. In roots, microbial activity may also metabolize terpene compounds as has been shown in vetiver grass (Del Giudice et al., 2008). It is also possible that the enzymatic products are further metabolized to nonvolatile derivatives. For example, β-macrocarpene, a volatile sesquiterpene olefin produced by two maize terpene synthases, is not detected in volatile blends because of its conversion to nonvolatile acid derivatives called zealexins, which function as pathogen-induced phytoalexins (Köllner et al., 2008b; Huffaker et al., 2011). In another study in maize, Ding et al. (2017) found that the volatile sesquiterpene β-selinene is a direct precursor of β-costic acid, a nonvolatile antibiotic acid derivative. Based on these findings, it is possible that α-selinene made by TPS05 in switchgrass roots serves as a precursor of α-costic acid that may exhibit similar functions in antimicrobial defense. Future analyses should be performed to identify possible oxygenated downstream derivatives of switchgrass TPS products.

Twelve TPS genes were found to be induced in switchgrass leaves upon feeding by FAW larvae. At least half of these genes are likely to contribute to the production of the volatile terpenes released upon FAW feeding based on the activity of their corresponding enzymes. The majority of the FAW-induced genes also responded to belowground treatment with MeJA, and two genes were induced by root treatment with SA indicating bottom–up systemic responses in *de novo* terpene biosynthesis (**Figure 8B**). While these effects are likely to be less pronounced with the application of lower concentrations of MeJA and SA or in response to actual root herbivory or pathogen infection, several studies have reported similar root-induced systemic responses in the metabolism of terpenoids and other secondary metabolites in photosynthetic tissues (Bezemer et al., 2003; Bezemer et al., 2004; Rasmann and Turlings, 2007; Erb et al., 2008; Kaplan et al., 2008). By contrast, much weaker systemic effects have been observed on root defensive metabolites including terpenes in maize upon shoot treatments or foliar feeding (Bezemer et al., 2003; Bezemer et al., 2004; Rasmann and Turlings, 2007; Erb et al., 2008; Kaplan et al., 2008). Our findings support this notion since FAW feeding did not cause a major increase in TPS gene expression in switchgrass roots and only a local treatment with MeJA could elicit such a response (**Figure 8B**).

The terpene olefins released by switchgrass leaves and roots upon insect or hormone treatment are frequently found in stressinduced volatile blends of other monocots and dicots (Unsicker et al., 2009; Massalha et al., 2017). While determining the function of these compounds is beyond the scope of this study, we assume that they play roles in direct and indirect defenses similar to those described previously in maize, rice, or other plants (Degenhardt et al., 2009; Hare and Sun, 2011; Taniguchi et al., 2014; Chen et al., 2018). A common constituent of herbivoreinduced volatile blends in many plants including grasses is (*E*)-βcaryophyllene (Köllner et al., 2008a). This sesquiterpene, when released from damaged leaves of maize and rice plants, has been implicated in recruiting parasitoids of herbivores (Cheng et al., 2007; Köllner et al., 2008a; Yuan et al., 2008). We identified three (*E*)-β-caryophyllene synthase genes (*PvTPS11*, *PvTPS14*, and *PvTPS19*) (**Figure 5**), all of which are located on chromosome 2 and induced upon FAW feeding and treatment with MeJA. By contrast, in maize, rice, and sorghum, only single genes (*ZmTPS23*, Os08g04500, *SbTPS4*) have been associated with the synthesis of (*E*)-β-caryophyllene upon herbivore feeding (Köllner et al., 2008a; Zhuang et al., 2012; Chen et al., 2014). In MeJA-treated root tissue, *PvTPS14* was found to be induced approximately fourfold higher than *PvTPS11* and *PvTPS19* and may contribute to the emission of (*E*)-β-caryophyllene belowground. Induced root expression of (*E*)-β-caryophyllene synthases is common among grasses and has been implicated with recruitment of entomopathogenic nematodes for indirect defense against belowground herbivory (Rasmann et al., 2005).

(*E*)-β-Farnesene is another sesquiterpene that is released by many plant species and plays, among other volatiles, a role in indirect defense in maize (Schnee et al., 2006; Degenhardt, 2009). We found four TPS genes that encode functionally active (*E*)-β-farnesene synthases (**Figure 5**). However, only *PvTPS02* expression correlated with compound emission as a result of herbivore damage (**Figure 1** and **Supplementary Figure S5**). Another gene, *PvTPS16*, was highly expressed in leaves following SA treatment and strongly correlated with (*E*)-βfarnesene emission under this condition (**Figures 1** and **8A**). Despite limited and controversial evidence (Gibson and Pickett, 1983; Kunert et al., 2010), this response could potentially affect aphids, since (*E*)-β-farnesene serves as an alarm pheromone for many aphid taxa (Bowers et al., 1977; Pickett, 1983) and aphids are known to elicit both SA- and JA-dependent signaling pathways (Moran et al., 2002). A recent study by Donze-Reiner et al. (2017) found several TPS genes to be induced upon feeding by the grain aphid *S. graminum.* However, none of the (*E*)-β-farnesene synthase genes was among those induced by *S. graminum*, indicating that their expression might be suppressed. Instead, genes induced by aphid feeding included the (*E*)-β-bisabolene synthases *PvTPS17* and *PvTPS20* among other genes in the type-a family and genes in the type-c and type-e/f families, which have in part be characterized as diterpene synthases (Pelot et al., 2018). Whether these terpene compounds are produced upon *S. graminum* feeding is currently unknown.

We found only two monoterpenes (limonene and β-ocimene) to be emitted at low levels from treated switchgrass leaves (**Figure 1**). Except of *PvTPS36*, which was induced in leaves by MeJA treatment and makes limonene as an enzymatic product (**Figures 6** and **8A**), no terpene products of the other induced mono-TPS genes could be detected possibly because of the reasons addressed earlier. Interestingly, enzymatic products of two TPSs, the cycloisosativene synthase *Pv*TPS01 and the borneol synthase, *Pv*TPS04, could only be observed in emissions from roots, although the corresponding genes were most highly expressed in leaves upon FAW, MeJA, or SA treatment (**Figures 1** and **8A**, **Supplementary Figure S1**). Whether the absence of the compounds in leaf tissue is due to limited enzymatic activity, metabolization of the product, or transport from shoots to roots remains to be determined.

In summary, our study has provided a genetic road map for investigating the biosynthesis and function of volatile terpenoids in switchgrass. We have shown that the switchgrass genome contains an extended family of mono- and sesqui-TPS genes, several of which share syntenic orthologs in other grasses, exhibit tissue-specific expression, and respond to herbivory and phytohormone treatment above- and belowground. The volatiles associated with these genes and possibly their nonvolatile derivatives may exhibit functions in above- and belowground direct and indirect defense similar to those described for maize and other grasses. Further studies involving the generation of switchgrass mutants will evaluate these ecological roles in greater detail.

# DATA AVAILABILITY

The datasets generated for this study can be found in Phytozome, https://phytozome.jgi.doe.gov/pz/portal.html.

# AUTHOR CONTRIBUTIONS

AM, XC, FC, and DT designed the study. AM, XC, TK, KP, and PZ performed bioinformatic analyses and gene annotation. JL performed synteny analyses. AM, XC, MR, LC, and SL performed enzyme characterizations. AM and XC performed RNA extraction, RT-qPCR, and stress treatments. AM and XC performed volatile profiling. AM, XC, FC, and DT wrote the manuscript. All authors reviewed, read, and approved the manuscript before submission.

# FUNDING

This work was supported by Community Science Program grant (WIP 2568) of the Department of Energy Joint Genome Institute and funds by the Translational Plant Sciences Program at Virginia Tech. The work conducted by the US Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, is supported by the Office of Science of the US Department of Energy under contract no. DE-AC02-05CH11231.

# ACKNOWLEDGMENTS

We would like to thank Dr. Nan Zhao for his assistance with volatile profiling and Dr. Qidong Jia for his assistant with sequence analysis. We thank Debbie Wiley for assistance with plant maintenance. We are grateful to Dr. Gerald A. Tuskan (Center for Bioenergy Innovation, Oakridge National Laboratory) and Dr. Yuhong Tang of the Noble Foundation for the availability and access to switchgrass RNA-seq data sets. We thank Jim Tokuhisa for scientific advice. We thank the US Department of Energy Joint Genome Institute and collaborators for prepublication access to the *Pancium virgatum* V1.1 and V4.1 genome sequence.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.01144/ full#supplementary-material

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Muchlinski, Chen, Lovell, Köllner, Pelot, Zerbe, Ruggiero, Callaway, Laliberte, Chen and Tholl. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Spatio-Temporal Metabolite and Elemental Profiling of Salt Stressed Barley Seeds During Initial Stages of Germination by MALDI-MSI and µ-XRF Spectrometry

*Sneha Gupta1, Thusitha Rupasinghe2, Damien L. Callahan3, Siria H. A. Natera2, Penelope M. C. Smith4, Camilla B. Hill5, Ute Roessner2 and Berin A. Boughton1,2\**

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Robert David Hall, Wageningen University & Research, Netherlands Yozo Okazaki, Mie University, Japan*

> *\*Correspondence: Berin A. Boughton baboug@unimelb.edu.au*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 05 April 2019 Accepted: 21 August 2019 Published: 25 September 2019*

#### *Citation:*

*Gupta S, Rupasinghe T, Callahan DL, Natera SHA, Smith PMC, Hill CB, Roessner U and Boughton BA (2019) Spatio-Temporal Metabolite and Elemental Profiling of Salt Stressed Barley Seeds During Initial Stages of Germination by MALDI-MSI and µ-XRF Spectrometry. Front. Plant Sci. 10:1139. doi: 10.3389/fpls.2019.01139*

*1 School of BioSciences, University of Melbourne, Parkville, VIC, Australia, 2 Metabolomics Australia, School of BioSciences, University of Melbourne, Parkville, VIC, Australia, 3 School of Life and Environmental Sciences, Deakin University, Burwood, VIC, Australia, 4 AgriBio, Centre for AgriBiosciences, Department of Animal, Plant and Soil Sciences, School of Life Sciences, La Trobe University, Bundoora, VIC, Australia, 5 School of Veterinary and Life Sciences, Murdoch University, Murdoch, WA, Australia*

Seed germination is the essential first step in crop establishment, and can be severely affected by salinity stress which can inhibit essential metabolic processes during the germination process. Salt stress during seed germination can trigger lipid-dependent signalling cascades that activate plant adaptation processes, lead to changes in membrane fluidity to help resist the stress, and cause secondary metabolite responses due to increased oxidative stress. In germinating barley (*Hordeum vulgare*), knowledge of the changes in spatial distribution of lipids and other small molecules at a cellular level in response to salt stress is limited. In this study, mass spectrometry imaging (MSI), liquid chromatography quadrupole time-of-flight mass spectrometry (LC-QToF-MS), inductively coupled plasma mass spectrometry (ICP-MS), and X-ray fluorescence (XRF) were used to determine the spatial distribution of metabolites, lipids and a range of elements, such as K+ and Na+, in seeds of two barley genotypes with contrasting germination phenology (Australian barley varieties Mundah and Keel). We detected and tentatively identified more than 200 lipid species belonging to seven major lipid classes (fatty acyls, glycerolipids, glycerophospholipids, sphingolipids, prenol lipids, sterol lipids, and polyketides) that differed in their spatial distribution based on genotype (Mundah or Keel), time postimbibition (0 to 72 h), or treatment (control or salt). We found a tentative flavonoid was discriminant in post-imbibed Mundah embryos under saline conditions, and a delayed flavonoid response in Keel relative to Mundah. We further employed MSI-MS/MS and LC-QToF-MS/MS to explore the identity of the discriminant flavonoid and study the temporal pattern in five additional barley genotypes. ICP-MS was used to quantify the elemental composition of both Mundah and Keel seeds, showing a significant increase in Na+ in salt treated samples. Spatial mapping of elements using µ-XRF localized the elements within the seeds. This study integrates data obtained from three mass spectrometry platforms together with µ-XRF to yield information on the localization of lipids, metabolites and elements improving our understanding of the germination process under salt stress at a molecular level.

Keywords: salinity, barley, germination, metabolomics, mass spectrometry imaging, lipids, MALDI

# INTRODUCTION

Barley is a model organism for investigating the cereal germination process (Gorzolka et al., 2016) which commences with the uptake of water by the quiescent dry seed and finishes with the emergence of the radicle through the seed coat. Seed germination requires sufficient moisture and is also affected by temperature. Salinity affects normal seed germination through osmotic stress (Bliss et al., 1986), ion toxicity (Hampson and Simpson, 1990), or a combination of both (Huang and Steveninck, 1988). High intracellular concentrations of both Na+ and Cl− can inhibit the metabolism of dividing and expanding cells (Neumann, 1997), restrict mobilization, and hinder seedling emergence (Marques et al., 2013; Alencar et al., 2015). These conditions result in retarded (Zhang et al., 2010) or delayed (Ashraf et al., 2003) germination. Barley genotypes can be classified as either tolerant or sensitive to salt stress depending upon their genetic diversity and the ability to germinate and, survive under these conditions (Shelden and Roessner, 2013; Shelden et al., 2013; Shelden et al., 2016). The molecular changes in barley seed during germination that may play a role in tolerance or sensitivity are important in determining a plant's overall response to salinity.

Mature seeds contain a variety of compounds such as proteins, carbohydrates, lipids, vitamins, and phenolics. These compounds provide essential nutrients for the seed to germinate and develop into a mature plant, and are synthesized, packaged and stored in specific tissues that are chemically and morphologically distinct from each other (Fulcher, 1982). For example, a reserve of carbohydrates (starch) is found in the endosperm of cereal seed whereas a reserve of lipids and proteins is found in the scutellum.

Lipids are vital and abundant cellular constituents responsible for the structure and function of cell membranes, and act as an energy store to allow metabolism to continue during abiotic stress. Recent research has shown that lipids are involved as signal mediators in the initiation of defence reactions (Sarabia et al., 2018). Lipids are also involved in processes to mitigate the effects of stress in plant cells (Hasanuzzaman et al., 2013; Okazaki and Saito, 2014) for example through remodelling of glycerolipid levels to maintain membrane integrity and optimal fluidity (Sarabia et al., 2018). Plants also produce a vast array of secondary metabolites such as carotenoids, flavonoids, and coumestans (Gry et al., 2007). These compounds have long been known to protect plants from different biotic and abiotic stresses, function as signal molecules and act as antimicrobial agents (Bartwal et al., 2013).

Although changes in profiles of protein (Yang et al., 2007), transcripts (Nakabayashi et al., 2005; Reyes and Chua, 2007), and metabolites (Fait et al., 2006; Howell et al., 2009) during germination are well documented, there is currently little information on molecular changes during seed germination in response to abiotic stress including salinity. With the current changes in the global climate (Karl et al., 2009), understanding these responses to stress will be more important than ever. Information on how an embryo mobilizes its internal reserves during the early stages of germination in saline conditions can provide insights into the metabolic process of germination; and consequently help us understand how certain crop genotypes germinate better under adverse environmental conditions (Achakzai, 2014).

To understand the role and function of metabolites and lipids within an organism at the tissue level, spatial information about target compounds is required (Li et al., 2008). Many studies analyze whole tissue responses (Fiehn, 2002; Sumner et al., 2003; Weckwerth, 2003; Cevallos-Cevallos et al., 2009; Lee et al., 2010), and so information about the localization of small molecules in cell types within the tissue is lost.

Mass Spectrometry Imaging (MSI) provides a platform to detect and localize intra-tissue variation and the spatial distribution of metabolites at the cellular and sub-cellular levels at high resolution (Lee et al., 2012; Boughton et al., 2016). MSI enables visualization of the distribution of individual biomolecules in a tissue section without requiring staining or complicated pretreatment (Cornett et al., 2007). This method of direct analysis of a tissue section (Zaima et al., 2009) allows detection of a wide range of endogenous molecules such as lipids, peptides, and secondary metabolites (Boughton et al., 2016; Hansen et al., 2018; Heskes et al., 2018). In plants, MSI has been used to target small molecules in different plant species and tissues. Recent examples include *Arabidopsis* leaves (Shroff et al., 2008), Medicago roots (Ye et al., 2013), wheat seeds and stems (Burrell et al., 2006; Robinson et al., 2007), soya leaves (Mullen et al., 2005), rice seeds (Zaima et al., 2010a), different tissues of Eucalyptus species (Hansen et al., 2018; Jakobsen Neilson et al., 2019), and barley roots (Sarabia et al., 2018). Novel metabolites were discovered using Matrix Assisted Laser Desorption Ionization-Mass Spectrometry Imaging (MALDI-MSI) (Gorzolka et al., 2014), further highlighting the benefit of obtaining spatially resolved information.

In this study, we used a combination of four high-end analytical techniques — MALDI-MSI, liquid chromatographytandem mass Spectrometry (LC-QToF-MS/MS), inductively coupled plasma mass spectrometry (ICP-MS), and micro X-ray fluorescence (µ-XRF), to give new insights into the metabolism of metabolites including lipids, as well as elements including Na+ and K+. These analytical platforms were used to analyze early stages of germination in two barley genotypes that differ in their germination phenology (Mundah-early germinating and Keellate germinating) under salt stress. Barley was chosen as it is not only an agriculturally and industrially important crop, but also exhibits a typical cereal germination process, allowing the results to be applied to other cereal crops.

This combined approach provides unique insights into the spatially resolved coordination of metabolic processes that are sequestered among the germinating embryo axis, the lipid rich scutellum, the nutritive endosperm, the digestive enzyme secreting aleurone and the outer pericarp cell layers during the early stages of seed germination. LC-QToF-MS provided information on the distribution of a specific flavonoid in five additional barley genotypes to determine potential roles in response to salt stress. On-tissue analysis using MSI-MS/MS further confirmed the identity of a tentative flavonoid resulting in nine fragment ions. Techniques such as ICP-MS and µ-XRF were used to provide data on element levels that are dynamically mobilized during barley seed germination.

# MATERIAL AND METHODS

# Plant Material

Seven varieties of barley were selected based on their importance for crop production, commercially relevant traits, and known difference in germination phenology and salinity tolerance (Tavakkoli et al., 2012; Kamboj et al., 2015; Sarabia et al., 2018; Yu et al., 2018). This selection comprised of two Australian barley feed (Mundah and Keel), one Australian food (Hindmarsh), and three malting cultivars (Gairdner, Vlamingh, and Clipper), as well as one North African landrace (Sahara). All seeds were sourced from The University of Adelaide, Australia.

# Seed Germination and Sample Collection

The germination of seven barley varieties was compared under normal and saline conditions. Twenty seeds from each variety were sterilized in 70% ethanol for 1 min. They were then rinsed 4–5 times in sterile 18.2 Ω deionized water before being treated with 1.0% (v/v) bleach for 10 min, followed by thorough washing in sterile 18.2 Ω deionized water 6–7 times. Seeds were then imbibed overnight (~16 h) in sterile 18.2 Ω deionized water with continuous aeration before being transferred to 1% agar plates containing a range of salt (NaCl) concentrations: 0 (control), 50, 100, 125, 150, 200, and 250 mM. The plates were then sealed with parafilm, wrapped in aluminium foil and kept at 4°C for 48 h before being transferred to a growth chamber maintained at 17°C constant temperature with no light. Plates were scored for the number of seeds germinating at different times after transfer to the growth chamber (48, 72, 96, and 120 h post-imbibition). The germination percentage was calculated using the following equation: Germination (%) = G/X\*100, where, G = number of seeds germinated and X = number of seeds sown.

Seeds from barley varieties with the highest and lowest germination frequencies under salinity, respectively, were used in a second experiment to produce germinating seeds for MALDI-MSI, ICP-MS, and µ-XRF. The seeds were sterilized as described above. After 16 h of overnight aeration, seeds were harvested for the zero hour time point (0 h) and the remaining seeds were aseptically transferred to petri dishes with and without saline agar. These dishes were kept in the dark at 4°C for 48 h and then transferred into a growth chamber maintained at a constant temperature of 17°C with no light. Germinating seeds were harvested at 8, 16, 24, 48, or 72 h growth during this time so that seeds at different developmental stages could be analyzed. These seeds were immediately frozen in Super Cryo Embedding Medium (SCEM, Section Lab) using liquid nitrogen and stored at −80°C into 50 ml Conical Centrifuge Tubes (Thermo Fisher Scientific, USA) until subjected to cryosectioning.

# Chemicals

Solvents were purchased from Merck Millipore (Bayswater, VIC, Australia), Chemicals including 2,5-dihydroxy benzoic acid (DHB), elemental red phosphorus and Supra pure® nitric acid (70%) were purchased from Sigma-Aldrich (Castle Hill, NSW, Australia). Embedding and freezing supplies including a cryofilm fitting tool set (2.0 cm), embedding container (1.5cm × 2.0 cm), embedding medium (SCEM), and cryofilm 2C(9) (2.0 cm in width), were purchased from Section-Lab Co. Ltd. (Tokyo, Japan). Sectioning supplies including Menzel-Gläser Superfrost Ultra Plus Glass slides, Optimal Cutting Temperature (O.C.T.) compound and Feather® C35 tungsten microtome blades were purchased from Grale HDS (Ringwood, Australia). Lysing Matrix Tubes with 0.5 g Lysing Matrix D (1.4 mm ceramic spheres) were purchased from MP Biomedicals (Seven Hills, NSW, Australia). Elemental standards were purchased from PerkinElmer (Melbourne, VIC, Australia). Aqua Regia (concentrated hydrochloric acid and concentrated nitric acid in the ratio 4:1) was prepared fresh immediately prior to use.

# MALDI-MSI: Sample Preparation and Imaging

For MALDI-MSI, samples from six time points during germination (**Supplementary Figure S1**) of the two barley varieties were analyzed. Prior to sectioning, frozen seed samples were kept in a Reichert Jung 2800 FRIGOCUT M cryostat at −20°C for 30 min. Later, the frozen sample was attached to a cold specimen disk using Optimal Cutting Temperature (OCT) compound and trimmed with a C35 tungsten microtome blade (Feather®). Longitudinal sections of 14 µm were stabilized with cryofilm 2C(9) (Section Lab, Japan), attached to a glass slide with electrically conducting double-sided tape according to the Kawamoto's film method with slight modifications (Kawamoto, 2003; Zaima et al., 2010b; Boughton et al., 2016), and immediately freeze dried in a Christ Alpha 1–4 LD plus Freeze Drier (John Morris Scientific, Australia) under vacuum (1.0 mbar) for 20 min. This step was important in order to eliminate water from seed sections and also to facilitate their manipulation and storage at room temperature before further analysis. The sections were chosen from the centre of the seed so the embryo, endosperm and aleurone layer were included in all sections. In order to account for biological variation present, at least three different seeds from each time point and treatment were analyzed. These were treated as technical and biological replicates.

2,5-dihydroxy benzoic acid (DHB) was used as sublimation matrix for positive mode analyses. For this study, the optimal conditions were 0.11 Torr pressure during sublimation, a hot plate temperature of 140°C, 300 mg DHB as matrix added to the sublimator, and a sublimation period of 4 min.

MS imaging of seed sections was performed on a Bruker SolariX 7T hybrid ESI-MALDI-FT-ICR mass spectrometer equipped with a Smartbeam II laser (Nd : YAG laser 355 nm) operating at 2 kHz, using the "minimum" focus setting and smart walk enabled (Bruker, Bremen, Germany). For every run, laser fluence was optimized to obtain the best signal-to-noise (S/N) ratio. The MS was operated in the positive ionization mode across the mass range *m/z* 150–2,800. MSI measurements were acquired with a raster of 75 × 75 µm. The measurement of the spots was carried out in random order to eliminate influences of measurement order. An external calibration (quadratic equation) was carried out using elemental red phosphorus obtaining a minimum of seven points of calibration over the mass range. A mass accuracy better than 4 ppm was obtained across a tissue section image. Co-registration between each MSI data set and its optical image was performed with flexImaging software v.4.1 (Bruker Daltonics, Bremen, Germany).

MS/MS spectra of selected ions, previously identified through SCiLS Lab (SCiLS Lab 2017a, SCiLS GmbH) analysis were conducted in positive mode using a laser power of 40%. A total of 5,000 laser shots per sample were fired randomly across the relevant tissue region. The isolation window was set to 5 *m/z*  with the collision RF amplitude set to 1,200.0 Vpp and collision voltage set to 20 V.

# MALDI-MSI: Analysis, Data Processing, Detection and Annotations

The raw MALDI-MSI datasets were analyzed using flexImaging software v.4.1 (Bruker Daltonics, Bremen, Germany) and normalized to the Root Mean Square (RMS). Any *m/z* peaks that showed spatial distribution across the tissue sections were manually annotated and used to create a list of spatially distributed potential mass features. The generated list was further analyzed to identify protonated ([M+H]+), sodiated ([M+Na]+) and potassiated ([M+K]+) adducts of the mass features that displayed spatial distribution prior to identification of potential metabolites using accurate mass search against the LIPID MAPS (Fahy et al., 2007) databases by accurate precursor mass (<5 ppm). Further average MS spectra were also exported from flexImaging and imported into SimLipid v.5.50 (Premier Biosoft International, Palo Alto, CA, USA) for accurate mass annotation. Tentative annotations were performed to level two based on accurate mass match and spectral libraries (Sumner et al., 2007). All potential lipid and metabolites detected with multiple adducts were collapsed into single ID for identification and analysis purposes.

Annotated *m/z* features were imported into SCiLS Lab (SCiLS Lab 2017a, SCiLS GmbH; (http://scils.de/; Bremen, Germany) software from flexImaging and further used to identify discriminant features between control sections and treatments. Sub regions were created to analyze discriminant lipids in embryo and endosperm regions. Normalization of data was performed according to the Total Ion Count (TIC) of all spectra prior to data analysis. Next, discriminative *m/z* values were identified from all individual spectra with a minimal interval width of ±5 mDa between control and salt-treated sections. A computer-generated diagram known as a Receiver Operating Characteristic (ROC) curve was used to quantify how well the *m/z* values discriminated between two treatments (control and salt). Strict ROC Area Under the Curve (AUC) values of ≥0.8 and ≤0.2 were set as an additional criterion to the Student's *t*-test *p*-value of <0.005 for peaks considered as statistically significant for salt vs control and control vs salt respectively (Rauser et al., 2010). The Intensity Box Plot chart was generated which depicts intensities of a given *m/z* interval filtered by the visible regions through their quartiles.

Peak lists were then exported and reimported into flexImaging to verify and confirm the spatial distribution of identified discriminant peaks. After analysis, a curated list of spatially distributed peaks was generated for control vs salt-treated images. For MS/MS analysis, raw data was analyzed on Bruker Compass DataAnalysis v5.0 using the deconvolution and manual annotation tools.

# ICP-MS: Sample Preparation and Analysis

Dry seeds were ground to a fine powder using a Tube Mill control (IKA Mills, Selangor, Malaysia) at 25,000 rpm (30,500 × g) for 4 min. The ground samples were accurately weighed (20 mg) into 10 ml Falcon tubes (Sarstedt, Australia). These samples were then digested in 300 µl acid solution (Aqua Regia) by heating for 1.5 h at 80°C. After digestion and cooling to room temperature, deionized water was added to make the final volume to 10 ml. The tubes were then centrifuged at 5,000 rpm (6,100 × g) for 5 min. The supernatant was used for ICP-MS analysis.

A modified method from Callahan et al. (2016) was used for this study. An ICP-MS instrument (NexION 350X, PerkinElmer, USA) was used to measure the concentration of the following elements: P, S, K, Ca, Mg, Mn, Co, Ni, Cu, Zn, and Fe. The internal standards Sc (200 ppb) and Rh (20 ppb) in 1% Aqua Regia were used for correction. The internal standard was mixed with the sample stream *via* a T-piece in a 1:1 ratio giving a final acid concentration of 2% at the source. Two sets of calibration standards were prepared. Set 1 contained P, Na, K, and Ca at 500, 1,000, and 5,000 ppb. The second set contained the other elements at 0.1, 1, 10, 50, 100, and 500 ppb. The mass spectrometer was operated in kinetic energy discrimination mode (KED) with 50 ms dwell times, 20 sweeps, one reading and three replicates. The plasma source conditions were: nebulizer gas flow 1.02 L min−1, auxiliary gas flow 1.2 L min−1, plasma gas flow 15 L min−1, ICP RF power 1,500 W.

Data were processed using Syngistix (PerkinElmer) software. Signal responses were normalized to the rhodium (Rh) internal standard, sample weights, and dilution factors.

# µ-X-ray Fluorescence Analysis

An M4 Tornado µ-XRF spectrometer (Bruker, Bremen, Germany) equipped with a Rh anode side window X-ray tube with a polycapillary lens was used to obtain elemental distribution maps in this study. The step size was set to 40 µm (for whole seed) combined with high excitation intensity. Direct analysis on seed sections prepared similarly as described in Section 3.4 (without matrix application). This was achieved by placing the sections on the µ-XRF platform under 20 mbar vacuum, 50 kV voltage, and an anode current of 199 µA conditions (Sarabia et al., 2018). Detailed summary on the instrumental conditions and operating parameters used to analyze barley seeds in this study is given in **Table S1**. Spectra acquisition and 2D elemental maps were obtained using the Esprit software (Bruker, Berlin, Germany).

# LC-QToF-MS: Sample Preparation and Analysis for Flavonoid Detection

Seeds were germinated and harvested as described above (Section 2.2) followed by immediate freeze drying in a Christ Alpha 1–4 LD plus Freeze Drier (John Morris Scientific, Australia) under vacuum (1.0 mbar) for 48 h. This facilitated fine powdering of seeds prior to extraction. A modified method from (Routaboul et al., 2006) was used for extracting flavonoids. Briefly, flavonoid extracts were obtained from 30 mg dried seed powder. Powder was transferred into cryo-mill tubes along with 1 ml acetonitrile/ water (75:25; v/v) containing the following internal standards, 13C6-Sorbitol (0.02 mg/ml) and 13C5 15N–Valine (0.02 mg/ ml). The samples were homogenized using a cryomill (Bertin Technologies; 6,000 rpm (7,320 × g) for 3 repetitions at 45 s with a 45 s interval between replicates) at −10°C. The mixture was then sonicated for 20 min followed by centrifugation at 13,000 rpm (15,900 × g) for 10 min at room temperature. The supernatant was separated and transferred into a fresh 2 ml tube. The remaining pellet was further extracted with 1 ml acetonitrile/water (75:25; v/v) on a thermomixer, shaking overnight (500 rpm, 4°C). The solution was centrifuged, and the two extracts were pooled, dried in a SpeedVac (Martin Christ Gefriertrocknungsanlagen GmbH, Germany) without heating.

Dried flavonoid extracts were resuspended in 200 µl acetonitrile/water (75:25; v/v), and 5 µl (for MS analysis) or 10 µl (for MS/MS analysis) aliquots were injected onto a Poroshell 120, HILIC-Z, 2.1×100 mm (2.7 µm particle size) column (Agilent Technologies, Santa Clara, CA, USA) at 25°C, and analyzed on a LC system coupled to an Agilent Technologies (Santa Clara, CA, USA) 6545B ESI-QToF-MS. Flavonoids were eluted at 0.250 ml/ min over 18 min with a binary gradient of ACN-water (90:10, v/v) containing 20 mM ammonium formate and formic acid and water containing 20 mM ammonium formate and formic acid as described by Routaboul et al. (2006). MS data were acquired with an ESI source in positive ion mode. The MS data presented corresponds to ten pooled biological replicates for each treatment group. MS/MS was performed to obtain the MS2 spectra of the identified compound.

# LC-QToF-MS: Data Processing and Analysis

LC-MS data was processed using MassHunter Qualitative Analysis and MassHunter Quantitative Workstation v.B.08.00 software (Agilent Technologies, Santa Clara, CA, USA) for flavonoid identification. Peak spacing tolerance was set to 0.5 *m/z* with a retention time window of 0.1 s. The acquired

data was exported and normalized with sample weight and log transformed. Univariate statistical analysis tests (Student's *t*-test) was carried out using Minitab software. X-fold changes were determined between control and treated samples of seven genotypes.

# RESULTS

# Overview of Experimental Workflow

A schematic of the experimental workflow is provided in **Figure 1**. MALDI-MSI analysis was performed for seeds grown under control and saline conditions and harvested at six different time points, namely 0, 8, 16, 24, 48, and 72 h post imbibition (**Figure S1**). Sublimation of sectioned tissue with DHB matrix was used for MALDI-MSI positive mode ionization to image the spatial distribution of a wide range of lipids (*m/z* 150–2,800) in the seed sections. ICP-MS (for total elemental analysis) and µ-XRF spectrometry (to obtain the spatial information of targeted elements) was also performed for three time points (0, 24, and 48 h post imbibition).

# Germination Assay: Germination and Seedling Growth Under Salt Stress

In the initial germination assay, seeds from seven barley genotypes (Vlamingh, Mundah, Clipper, Sahara, Hindmarsh, Gairdner, and Keel) were germinated in petri dishes containing different salt concentrations (0–250mM NaCl) and observed for germination from 0–120 h post imbibition (**Table S2**). Seeds were considered to be germinated once emergence of the cotyledon was visible. There was a clear difference in the germination frequency of the seeds germinating in the presence of salt. Keel showed the greatest inhibition (only 65% germinated at highest salt concentration) while all Mundah and Vlamingh seeds germinated. Other genotypes showed germination frequencies between 75 and 90%. The most tolerant genotypes were viable under saline conditions for at least 7 days. Based on this assay, two Australian barley feed varieties were chosen to investigate the metabolic changes that occur during germination under salinity: Mundah, a very early spring variety from Western Australia with very high early vigor (highest germination rate under salinity), and Keel, an early spring variety from South Australia with moderate early vigor (lowest germination rate under salinity) (Long et al., 2003).

# MALDI-MSI of Lipids in Barley Seeds During Germination

To identify lipids and other small molecules, the distribution of these compounds at all post imbibition time points in control and saline conditions were compared in both genotypes. Lipids within longitudinal sections of mature barley seeds were profiled by MALDI-MSI at six time points post imbibition (**Figure S1**). Signals derived from the DHB matrix were excluded from further analyses. A total of 774 peaks in the average MS spectra were found to have a differential spatial distribution in the samples. An example of an average mass spectrum between *m/z* 780–900 acquired from a Mundah seed after 24 h post imbibition

is presented in **Figure 2**, illustrating a specific peak that can be selected to display the distribution of a corresponding lipid across the section. Many low abundance lipids were not reproducibly detected across all sections due to analytical and biological variations and thus, only the presence and tissue distribution of the tentatively identified lipids observed across three biological replicates are discussed (**Figure S2**).

All peaks showing spatial distribution in the seed sections were selected from flexImaging (**Tables S3 A–D**). Mass precursor ion search (<5 ppm) of the LIPID MAPS database (Fahy et al., 2007) was performed to provide tentative annotations of the features detected and resulted in between 246 to 319 tentatively assigned mass features for each sample. Some of these mass features corresponded to [M+H]+, [M+Na]+, and [M+K]+ ion adducts of

the same lipid and were observed in all samples. Two hundred and sixty-nine lipid species were identified in both the Mundah control and salt treated seeds. Three hundred and nineteen lipid species were identified in Keel control seeds but only two hundred and forty-six in salt treated Keel seeds. Lipids were annotated to the sum level only using the Liebisch notation (Liebisch et al., 2013). In-source fragmentation of more complex lipid species, such as Phosphatidylcholine (PC), during detection, may lead to incorrect identification of fragments belonging to other lipid species such as Phosphatidic Acid (PA), Fatty acyls (FA) and Lysophosphatidylcholines (LPC) (Wang, 1999) and this was taken into account during annotation of the lipid species. For example, it was demonstrated that PC, PA, and LPC species had different spatial distribution in seeds (**Figure 3**) suggesting that intact lipid species are being detected and not fragments of other lipids.

Tentatively annotated lipid species with several ion adducts, but the same masses, were collapsed into a single ID. It was noted that the returned matches were filtered to remove tentative annotations from mammalian origins. As a result, 234 lipid species were tentatively identified in Mundah control (**Table S4A**), 202 in salt treated Mundah seeds (**Table S4B**), 155 in Keel control (**Table S4C**) and 145 in salt treated Keel seeds (**Table S4D**). The tentatively annotated lipids were categorized into seven major lipid classes: Fatty acyls (FA), Glycerolipids (GL), Glycerophospholipids (GP), Sphingolipids (SP), Prenol lipids (PR), Sterol lipids (ST), and Polyketides (PK). The percentage of each lipid class in both genotypes is given in **Table 1**. This was solely based on the number of lipid species matched with the LIPID MAPS database. The distribution of lipid classes at all timepoints in both genotypes are shown in **Supplementary Figure S3**. For both genotypes, the number of lipids identified is smaller in the germinated seeds in the presence of salt compared to the control, but in Mundah, the difference between the two treatments is much smaller. The number of lipids identified in each class decreased in Keel after salt treatment, whereas in Mundah, only GL, GP, and ST decreased.

TABLE 1 | Percentage of individual lipid classes in two treatments of Mundah and Keel. The values are total percentages of annotated lipids at all timepoints.


In Mundah seeds, GP was the major class of lipid altered (43% in Mundah control and 40.6% in Mundah salt). In the GP class, PA, Phosphatidylethanolamine (PE), Phosphoglycerols (PG), and PC were tentatively identified as the major lipid subclasses. The GL class decreased in saline conditions but in control conditions, TAGs were detected in high number. The percentage of ST and GP decreased under salinity, but the sterol/phospholipid ratio was unaffected by salt (0.25% in control and 0.24% in salt).

GP is also the major class altered in Keel seeds (26.3% in Keel control and 34.1% in Keel salt). It is worth noting that an increased number of GP lipids in Keel salt treated seeds was detected likely due to different adducts detected for the same lipid. For example, in the GP class, LPC (16:0), LPC(18:0), and PC(36:5) were found with [M+H]+ and [M+K]+ ion adducts and PC(34:2) with [M+Na]+ and [M+K]+ ion adducts. The percentage of ST decreased significantly under salt stress in Keel with a slight decrease in GP under salt stress and hence the sterol/phospholipid ratio was affected by salt in Keel seeds (0.25% in control and 0.12% in salt). It is noted that PC lipid species of 36:n family have been found to be differentially distributed at all timepoints in both genotypes under control and salt conditions.

FIGURE 3 | Different lipid species displaying different spatial distribution. (A) scanned image of Mundah seed at 48 h in control conditions; (B) scanned image of Mundah seed at 48 h in salt conditions; A-I and B-I: *m/z* 496.338 LPC(16:0) showing spatial distribution only in endosperm of both seeds with different intensities in control and salt conditions. A-II and B-II: *m/z* 757.574 PA(40:2) showing spatial distribution in entire seed with different intensities in control and salt treated seed; A-III and B-III: *m/z* 820.530 PC(36:4) showing spatial distribution in the aleurone layer and embryo of control and salt treated seeds. Intensity scale for both images was set to 0-50%. Data was normalized using Root Mean Square.

Gorzolka et al. (2016) performed MSI to localize metabolites during the first seven days of germination in barley seeds and showed the distribution of 11 lipids in barley seeds at different stages of germination and different regions of seeds. In our study, we confirmed the presence of 10 out of 11 lipids from Gorzolka et al. (2016) in Mundah control, 9 in Mundah salt, 4 in Keel control and 9 in Keel salt treated seeds (listed in **Table S5**). Timedependent changes were also observed in the lipid profile of the endosperm early in germination in Gorzolka et al. (2016). Monoand diacyl PCs (*m/z* 496, 520, and 758) were highly abundant, and these lipids changed their distribution at later germination stages (Gorzolka et al., 2016). In the current study, the *m/z* 496 and 758 ions could be detected in both genotypes under both treatments. However, they were not consistently observed across all time points. In contrast, the *m/z* 520 ion was only detected in salt treated Keel seeds. The distribution pattern for these three PCs is given in **Table S6**.

# Lipids Detected Immediately After Imbibition and After 8 h of Salt Stress

Comparison between the annotated lipid classes at 0 h and 8 h post-imbibition in control Mundah seed sections showed 26 lipids species present at both time points whereas only 16 lipids species were found in common when 0-hour and 8-hour salt treated seeds were compared (**Tables S7A, B**). In Keel, 30 lipids species were detected in both 0-hour and 8-hour control sections whereas 20 lipids species were found between 0-hour control and 8-hour salt treated seed sections (**Tables S7C, D**). After 8 h of salt stress, 65 lipid species were differentially accumulated compared to the 8 h post imbibition in untreated seeds of Mundah (**Table S8**). The major lipid class detected in 8 h salt treated seeds in Mundah was GP. In contrast, 38 lipids were detected after 8 h of salt stress when compared to the 8 h untreated seeds in Keel (**Table S9**). The major lipid class detected in Keel after exposure to salinity stress was GL.

# Spatial Segmentation of the MALDI-MSI Data Using an Unsupervised Clustering Approach Using SCiLS Lab

Relatively few lipid species changed their spatial distribution in Keel compared to Mundah seeds. In Mundah, out of 225 lipid species that were found to be spatially distributed in salt treated seeds at all time points, 71 (31.5%) were salinity treatment specific. In Keel, out of 163 lipid species spatially distributed in salt seeds at all time points, 48 (29.4%) were specific to salt stress based on LIPID MAPS annotations.

SCiLS Lab was used to determine the discriminative features for both treatments in Mundah and Keel. A univariate measure was used to assess the quality of discrimination for imported *m/z* values for masses created in flexImaging, which quantified how well the *m/z* values discriminated between two treatments. A threshold for Area Under ROC (Receiver Operating Characteristic) Curve (AUC) was used to determine the discriminating masses in both treatments for both genotypes. A perfect discrimination was accepted where AUC value was equal to or close to 1 for salt treated seeds and close to or equal to zero for control seeds. The threshold for this analysis was set to 0.8 for salt treated seeds and 0.2 for control seeds. Subregions were created in SCiLS Lab to find *m/z* values that discriminate between the two treatments for embryo and endosperm in both genotypes. An example of a subregion created for Mundah seeds (*m/z* 615.494) and its corresponding ROC plot for DAG (34:2) with AUC value of 0.087 is given in **Figure S4**. The intensity box plot chart that describes a single *m/z* interval and a single region was also used.

Further discriminant lipids were detected at all timepoints using the above-mentioned threshold values in both genotypes. Two subregions were created — embryo and endosperm. In Mundah embryo, lipids discriminant to control seeds (AUC close to zero) increased in number over time whereas in Keel embryo, the number of discriminant lipids with AUC close to zero decreased over time. In Mundah endosperm, 45 lipids were found to be discriminant between control and salt treated seeds with only nine lipids showing AUC value between 0.8–1.0. In Keel endosperm, only four lipids were discriminant with AUC values between 0–0.2 (**Table S10**).

In Mundah embryos, a tentative flavonoid, *m/z* 365.102 [M+H]+ (AUC for Mundah 16 h, 0.867; Mundah 24 h, 0.949; Mundah 48 h, 0.907) was found to be strongly discriminant at 16, 24, and 48 h (**Figures 4A–C**, **A'–C'**) whereas the same mass was observed in Keel embryos at 72 h only (AUC 0.946) (**Figures 4D**, **D'**). Signal intensity was scaled to optimize visualisation of the respective ion. The intensity box plot and ROC plot for the above mentioned timepoints shows the discrimination capabilities of the given *m/z* for two (control and salt) regions in Mundah at 16, 24, and 48 h (**Figure S5A**) and Keel at 72 h respectively (**Figure S5B**). No common lipids were found in Mundah embryos over all time points. In Keel embryos, diacylglycerols (*m/z* 602.523) and a fatty alcohol (*m/z* 576.508) discriminate between control and salt treatments at 16 and 24 h with an AUC of 0.19 and 0.13 respectively. In Mundah endosperms, it is worth noting that fatty acyls (*m/z* 339.288, [M+Na]+) were found in salt treated seeds at 48 h with an AUC 0.88 whereas at 72 h, they were found intensively in control seeds with AUC value of 0.17. No discriminant lipids were found in Keel endosperms. Although several lipid species were found distinct at 8 h salt as compared to 8 h control as mentioned in Section 3.4, no discriminant lipids between these two timepoints in both genotypes were found.

# Confirmation of Tentative Flavonoid Using MS/MS

# MALDI-MS/MS

To probe the identity of the tentative flavonoid (*m/z* 365.102 [M+H]+), on-tissue MALDI-MS/MS spectra was generated and identified fragments were recorded. A 5 *m/z* window centred around *m/z* 365.1 was isolated with no collision voltage applied. **Figure 5** shows the two distinct peaks with very close mass obtained from MALDI-MS, *m/z* 365.1024 was matched in the LipidMaps and METLIN (http://metlin.scripps.edu) databases to a flavonoid and *m/z* 365.1058 which was assigned as a dihexose sugar (Calvano et al., 2017). Subsequent MS/ MS using the same isolation window and application of 20 V

collision energy led to observation of nine product ions (**Figure S6**) that were matched with the *in silico* predicted spectra of a flavonoid, Gancaonin F (METLIN database) (Smith et al., 2005), corresponding ions associated with the fragmentation of a dihexose sugar were also observed.

### LC-MS/MS

LC-MS/MS was performed on seed extracts to provide further evidence for annotation of the possible flavonoid and dihexose sugar. Both MS and MS/MS were conducted with the protonated precursor ion *m/z* 365.1 isolated for MS/MS. The extracted compound chromatogram (ECC) at the MS level showed two separate peaks with retention time of 6.71 and 4.49 min. MS/MS using 40 eV and 20 eV respectively showed distinct differences in the corresponding spectra between the two compounds. Inset **Figure 5A** displays the mass spectrum for one compound with seven fragments, of which three fragments (*m/z* 347, 323, and 307) match closely with *in silico* predicted spectra of Gancaonin F (METLIN database) and MALDI-MS/MS data obtained from on-tissue analysis in this study. Inset **Figure 5B** shows the mass spectrum of a compound with two fragments (*m/z* 203 and 185) that match exactly to dihexose sugar fragments (Calvano et al., 2017). The mass spectrum along with the database search showed a difference of less than 4 ppm in experimental and database mass identified using LC-MS/MS. **Table 2** shows the observed fragments of the tentative flavonoid detected using MALDI-MS/

MS and LC-MS/MS that matched with the *in silico* predicted spectra of Gancaonin F.

# Level of Tentative Flavonoid Increases in Response to Salinity Stress in Four Out of Seven Barley Genotypes

The response of the tentative flavonoid was further elucidated in seven barley genotypes (Mundah, Keel, Vlamingh, Gairdner, Hindmarsh, Clipper, and Sahara) at five timepoints (8, 16, 24, 48, or 72 h) in control and saline conditions (250 mM NaCl) using LC-QToF-MS. The aim was to understand the differential expression of the detected tentative flavonoid in germinating seeds of a barley variety panel consisting of a small but representative set of six barley genotypes. **Figure S7** shows the differential change in the tentative flavonoid over time in seven genotypes. Hindmarsh, Clipper, and Gairdner showed no significant difference over time in both conditions. In Sahara, the highest increase was shown at 16 h under saline conditions. In Keel, a significant increase was observed at 72 h under saline conditions as compared to control seeds. A similar pattern was observed in Vlamingh where significant increase in the tentative flavonoid was seen at 72 h in salt treated seeds as compared to control seeds. In Mundah salt treated seeds, there was a decrease in the tentative flavonoid at 72 h. **Table S11** shows fold changes in signal response of the flavonoid for seven genotypes over time.

giving seven fragments. Three fragments (marked with asterisks) match with fragments obtained from MALDI-MS/MS given in Figure S6. Inset (B) displays the mass spectrum for a dihexose sugar analyzed using using LC-MS/MS with retention time of 4.49 min and displaying two fragments *m/z* 203.052 and 185.041.

TABLE 2 | Theoretical and observed *m/z* for the tentative flavonoid, Gancaonin F from MALDI-MS/MS and LC-MS/MS.


*In silico fragments are obtained from METLIN (http://metlin.scripps.edu).*

*\*mass error observed from LC-MS/MS.*

Under saline conditions, among all genotypes, Vlamingh (+6.3 fold) and Keel (+14.04-fold) show the highest increase after 72 h salt stress (*p* < 0.001. However, in Mundah the highest increase was observed at 48 h (+9.0-fold) after salt stress with small decrease at 72 h (+8.9-fold) after salt stress.

# Elemental Concentrations in Seeds

The elemental composition of total barley seeds was analyzed to determine the effect of 24 and 48 h of salt stress (250 mM NaCl) on the content of eight macro- and micronutrients of both barley genotypes. **Table 3** summarizes the concentrations of elements in control and salt treated seeds and **Figure 6** shows the change in Na+ concentrations in both genotypes after 24 h in control and salt treated seeds. Of all the elements detected, only Na+ content was significantly altered (p < 0.001) after exposure to salt. In Mundah, Na+ concentrations were 0.5 ± 0.12 mg g−1 in control and 3.1 ± 0.48 mg g−1 after salt treatment for 24 h and 0.5 ± 0.06 mg g−1 in control and 4.7 ± 0.06 mg g−1 after salt treatment for 48 TABLE 3 | ICP-MS of whole barley seed (Mundah and Keel) and its main tissue fractions (mg g−1 dry weight). Each tissue type was weighed and analyzed separately by ICP-MS.


*Data points represent as mean ± standard error, n = 3. Scale in mg g−1/DW.*

*\*Values with significant differences (p < 0.001).*

between two treatments at individual timepoints (*\*P* < 0.001).

h. In Keel, Na+ concentrations were 0.4 ± 0.02 mg g−1 in control and 2.7 ± 0.21 mg g−1 after salt treatment for 24 h and 0.4 ± 0.06 mg g−1 in control and 4.2 ± 0.11 mg g−1 after salt treatment for 48 h. This indicated a significant increase in Na+ content in both Mundah (+9.4-fold) and Keel (+10.1-fold) after 48 h salt stress.

The K+ content in seeds decreased slightly after salt treatment but this was not statistically significant. None of the other six elements (P, Mg, Ca, Mn, Fe, and Zn) showed a statistically significant difference in seeds after salt stress as shown in **Table 3**. Since the overall concentration of most elements did not change during germination or in response to salt treatment we examined whether the treatments caused changes in the localization of the elements (S, P, K, and Cl) using µ-XRF analysis on seed sections from time point 24 and 48 h. In Mundah seeds, S accumulated in the entire seed irrespective of treatment. P was distributed in all seeds at moderate concentrations and was highly abundant in the aleurone layer. K was found in higher concentrations in control seeds than in salt treated seeds at all time points, with high abundance observed in the aleurone layer. These results complement the data obtained from ICP-MS. We were unable to observe Na+ due to instrumental limitations. However, Cl− was used as an indirect measure of the Na+ distribution in seeds (Sarabia et al., 2018). Cl− was highly abundant in salt treated seed sections with a uniform distribution correlating with the increased Na+ content as detected by ICP-MS (**Figure S8**). In Keel, S accumulated across the entire seed similarly to Mundah but was found to be more abundant in sections from salt-treated seeds obtained at 0 and 48 h post imbibition. P was also found to be highly accumulated in the aleurone layer of all sections and with lower abundance in the endosperm. K+ was seen abundantly in samples from the 0 h time point followed by a less intense distribution over time in both treatment. Cl− was prominently abundant in salt treated seed sections correlating with the increased Na+ content detected by ICP-MS data (**Figure S9**).

# DISCUSSION

In recent decades, salinization of agricultural land has resulted in lower crop productivity. Germination of seeds is directly affected by salinity either by the resulting osmotic stress or by toxic effects of the sodium and chloride ions themselves. In particular, salinity perturbs plant hormone balance (Khan and Rizvi, 1994) and reduces the utilization of seed reserves (Ahmad, 1992). In the past, metabolomics studies have mainly focused on the germination processes in rice and very little has been published on barley or on the effects of saline conditions on the germination process. In addition, these published studies used techniques that do not provide spatial information on metabolic processes. Gorzolka et al. (2016) previously studied the changes in metabolite profiles in barley seeds after 5 days of germination using MALDI-MSI, but changes in the metabolite profiles of barley seeds during the early stages of germination (0 to 72 h post imbibition) were not explored.

The process of uptake of water by a mature dry seed, which occurs in the initial hour of reintroduction of water to the dry seed, is triphasic (Bewley (1997) and results in the resumption of metabolic activity. These metabolic changes are extremely important to successful germination and, as such, it is vital to document and understand what changes occur (Bewley and Black, 1994; Bewley, 1997; Bewley, 2001). We first conducted a germination assay to establish the germination efficiency of seven barley genotypes under salinity stress. From these results, we selected two genotypes with contrasting germination phenology and salinity tolerance for this study: Mundah [early germinating and salinity tolerant, in agreement with Cao et al. (2017) and Keel late germinating and sensitive to salinity].

Following the germination test, ICP-MS and µ-XRF were combined to profile the levels and measure the spatial distribution of elements across six time points in the two selected barley genotypes (Mundah and Keel) under control and saline conditions. Munns and Tester (2008) reported that salt stress results in the accumulation of Na+ in plants and results in an adverse effect on K+ concentration which causes detrimental effects to plant growth and development. Various physiological studies have demonstrated that Na+ toxicity is caused due to the ability of Na+ to compete for K+ binding sites to disrupt K+ homeostasis (Volkov and Amtmann, 2006; Shabala and Cuin, 2008; Hasegawa, 2013). Hence, to avoid an ion homeostasis disorder under saline conditions, plant cells need to maintain a low Na+ concentration and a concurrent high K+ concentration in the cytosol, where enzymes for metabolism are located (Zhu, 2003; Pardo et al., 2006). In this study, we observed a substantial increase in Na+ content and a non-significant decrease in K+ content in germinating seeds of both genotypes under saline conditions.

It is apparent from this result that accumulation of excessive amounts of Na+ in germinating seeds are likely to be responsible for inhibition of germination under salt stress in Keel as compared to Mundah which was still able to germinate under these conditions. This view is supported by a previous report where increased accumulation of Na+ in wheat seeds under salinity stress inhibited the rate of germination (Begum et al., 1992). The K+/Na+ ratio decreased to greater extent in both Mundah and Keel after 48 h compared to 24 h salt stress. This result is in alignment with previously reported reduction of the K+/Na+ ratio in barley roots after salt stress (Shelden and Roessner, 2013). Salinity tolerance also depends on limiting Na+ accumulation and maintaining K+ content in the cytosol in order to achieve the preservation of ion homeostasis (Flowers et al., 1977; Hasegawa et al., 2000). Cl− was observed to increase but no change in K+ was observed using µ-XRF after salt stress. Increased Cl− distribution obtained from µ-XRF analysis is an indirect measure of Na+ as Cl− is likely associated with Na+ as a counterion (**Figures S8**–**9**). The prevailing higher concentrations of these elements after salt stress are likely to be affecting metabolism and further impacting the observed levels of lipids and other metabolites in a germinating seed. It is worth noting that elemental composition in embryo region of both genotypes change after salt stress. It was observed that the embryo size in both conditions of two genotypes were increased under salt treatment. This is potentially due to role of osmoregulators that exhibit more abscisic acid (ABA) under salt stress. And increase in production of ABA induces the accumulation of more elements to be retained in those regions post-salt treatment (Sun et al., 2015).

Lipids are one of the major components stored exclusively in seed embryos and aleurone layer. They are vital and abundant cellular constituents responsible for cell membrane functionality as well as acting as an energy store to allow metabolism to continue during abiotic stress. The content of lipids in most cereals is relatively low (about 3%) compared to starch and protein. However, their contribution toward the nutritional value, as well as storage stability, of cereal-based food is important (Liu, 2011). Several studies have reported on the content and composition of lipids in whole grains (Price and Parsons, 1975; Welch, 1975; Zhou et al., 1998; Osman et al., 2000; Mehmood et al., 2008) without investigating spatial distributions. In this study, we investigated the effect of salt on the germination of two barley genotypes and on lipid composition and distribution using MALDI-MSI. High quality images taken at a 75 µm raster step size allowed observation of the spatial distribution of a high number of lipids and metabolites in barley seeds. The choice of matrix (DHB) also allowed tentative identification, and profiling, curation and annotation of a large number of lipids. This allowed observation of major lipid species showing differential spatial distributions across seed sections at all time points under control and saline conditions. Despite the higher proportion of lipid species in the embryo [20% lipids (Price and Parsons, 1979; Newman and Newman, 2008)], the largest number of lipids were detected in the endosperm region and not in the embryo. This could be due to the tissue type, as on-tissue extraction of ions is highly influenced by the properties of the tissue. For example, ion abundances can influence the adduct formation, and highly abundant, co-localized lipids or lipids bound to tissue structures can lead to ion suppression (Gorzolka et al., 2016). It could also be due to the thin tissue content of the embryo at these developmental stages which is highly hydrated (less dense) whereas the endosperm is still very dense.

Lipids play vital roles in providing structural integrity to cell membranes and are involved in signal transduction and membrane trafficking (Krauß and Haucke, 2007). In seeds, lipids serve as a reserve energy source and act as emulsifiers of fat substrates for lipases (Ory et al., 1967). Various environmental stresses including salinity have been shown to be responsible for changes in lipid metabolism and composition (Wang, 2002; Meijer and Munnik, 2003; Mansour, 2013; Natera et al., 2016; Sarabia et al., 2018; Yu et al., 2018). Lipids are rich in olefinic bonds which are susceptible to oxidative attack (Singh and Sinha, 2002; Parida and Das, 2005) from reactive oxygen species (ROSs) present in membranes during salinity stress (Esfandiari et al., 2007). The numbers of annotated species from the GP and ST classes decrease with increasing salinity in both genotypes. This is in line with findings by Wu et al. (2005) where analysis of changes in lipid composition following salt stress was studied in *Spartina patens,* a member of the Poaceae family. The decrease in the number of lipid species in Keel could also be due to peroxidation of lipids as mentioned by Pérez‐López et al. (2009).

Among the detected lipid classes in this study, phospholipids were major species of lipids that were tentatively annotated in both genotypes. This was indirectly supported by µ-XRF data where the distribution of P was found across the entire seed of both genotypes (**Figures S8**–**9**), correlating with regions where higher numbers of phospholipids were also observed using MSI. Presence of P in the aleurone layer from µ-XRF data also correlates with MALDI-MSI data where higher number of lipids were observed over time. Phospholipid metabolism is associated with a large number of biological molecules and plays a crucial role in various signal transduction pathways in higher plants. In response to various biotic and abiotic stresses (Munnik et al., 1998; Pappan and Wang, 1999), phospholipase D catalyzes the hydrolysis of structural phospholipids, PC and other phospholipids, resulting in the formation of PA. Ritchie and Gilroy (1998) reported that PA is an important mediator of abscisic acid (ABA) signal transduction in barley aleurone protoplasts. Levels of PA were observed to increase slightly with increasing salinity in both genotypes. Salinity stress dependent activation of PA could be due to activation of phospholipase D (PLD) that hydrolyses structural lipids such as PC and PE (Meijer et al., 2002). Increased levels of PA are also proposed to facilitate vesicular trafficking and to recruit target proteins to the membrane, influencing their activity (Testerink and Munnik, 2005; Wang, 2005; Roth, 2008).

High salt concentrations in the growing medium usually lead to an increase in the plasma membrane sterol content (López-Pérez et al., 2009). This increase facilitates hydrophobic interactions among the acyl chains, producing greater order in the membrane, and consequently, a more rigid membrane (Silva et al., 2011). In this study, the percentage of sterols decreased under salt stress in Mundah and Keel seeds. A seed usually has a pair of growing tips (apical meristems), that develop into stem (hypocotyl) and root (radicle). Thus, decrease in sterols in seeds of both genotypes could indicate a need for mobilizing lipids into

meristematic tissue, providing nutrients during germination to support seedling growth during the early stages of development. It was also observed that sterol/phospholipid ratio was greatly affected after salt stress in Keel. Similar results were found in halophyte Brassicaceae species where a significant decrease in sterol lipids was observed after salt stress (Chalbi et al., 2015). However, the sterol/phospholipid ratio was unaffected by salt in Mundah which align with Wu et al. (2005) where the changes in lipid composition after salt stress in *Spartina patens* were studied.

Although phosphatidylcholine levels remained unchanged in Mundah seeds after salt treatment and increased in Keel after exposure to salinity, PC lipid species of 36:n was found at all timepoints in both genotypes under control and saline conditions. Similar results were found in the embryonic axis of *Camelina* seeds detected by MALDI-MSI under normal conditions (Horn et al., 2013; Horn and Chapman, 2014). Increased levels of PCs in response to high salinity was also found by Pical et al. (1999) in *Arabidopsis thaliana*. In plants, PCs are generally produced through a mixed cytidyl diphosphate-choline (CDP-Cho) pathways and methylation pathway (Tasseva et al., 2004). Tasseva et al. (2004) also suggested that enhanced synthesis of PCs is due to an accelerated CDP-Cho pathway. The accumulation of betaine in response to salt has been widely recognized in various plants. *In vitro* studies have proved the positive correlation between the accumulation of betaine and the acquisition of tolerance to salt in maize (*Zea mays*) (Rhodes et al., 1989) and barley (Strange, 1993; Kishitani et al., 1994). Hitz et al. (1981) demonstrated the use of phosphocholine for glycinebetaine synthesis in barley while Mcdonnell and Jones (1988) also proved that the choline required for oxidation of glycinebetaine is produced *via* turnover of PC. In our study, increased levels of PC over time could possibly result in the synthesis of glycinebetaine, helping seeds to cope under saline conditions. However, this particular mechanism needs further investigation.

High levels of Na+ also cause secondary responses in plants due to increased oxidative stress. Cellular damage from oxidative stress within plant cells (Apel and Hirt, 2004) induces production of reactive oxygen species (ROSs) (Jaleel et al., 2007; Ashraf, 2009). ROSs derived from molecular oxygen can accumulate in the plant cell and cause oxidative damage in cellular components, including proteins and lipids. To prevent the potential cytotoxic effects of ROSs, the stimulation of antioxidant systems can assist in plant protection from oxidative stress (Grace and Logan, 2000; Chutipaijit et al., 2009). Polyphenolic compounds such as flavonoids play an important role in stopping the propagation of oxidative chain reactions (Ksouri et al., 2007). Their synthesis is generally stimulated in response to various abiotic stresses including salinity (Zhou et al., 2018) and thus the biosynthesis of such compounds is generally stimulated in salt-exposed plants. Flavonoids are also potential inhibitors of the enzyme lipoxygenase, which converts polyunsaturated fatty acids to oxygen-containing derivatives (Nijveldt et al., 2001). These compounds accumulate in plant tissues protecting them from damaging effects of salt stress and inhibiting lipid peroxidation (Potapovich and Kostyuk, 2003). In our MSI results, a tentative flavonoid (*m/z* 365.102 [M+H+]) was found to be discriminant in Mundah and Keel embryos after exposure to salinity. Analysis using LC-MS/MS and MALDI-MS/MS confirmed the compound to be a flavonoid, tentatively annotated as Gaiconin F. Flavonoid profiling data from LC-QToF-MS supported our MALDI-MSI data (refer to **Figure S7** and **Table S11**). This is in alignment with Chutipaijit et al. (2009) who described a significant increase in total flavonoid content in salt stressed rice seedlings and Taïbi et al. (2016) where a significant increase in flavonoid content was seen in *Phaseolus vulgaris* under salt stress.

Under normal conditions, the quiescent embryo is able to germinate after imbibition and breaking dormancy (Mcdonald and Copeland, 2012). However, in salt stress conditions, the radicle has to avoid oxidative damage for successful germination. To detoxify or scavenge the severe effects of stress, the presence of antioxidants in embryos may play a crucial role in protecting the emerging radicle and help in a successful germination. This is also supported by Shirley (1998), who described the accumulation of flavonoids in the embryos of all plant species. Our observed detection of a tentative flavonoid in the embryos of Mundah and Keel supports the view of flavonoids playing potential roles in the protection of seeds under salt stress. Keel displayed a significantly delayed flavonoid response relative to Mundah as shown in the LC-MS data where highest X-fold increase in Keel was observed at 72 h. This depicts its lesser ability to deal with salt leading to poorer germination efficiency. However, in Mundah, flavonoid content was highest at 48 h displaying faster responses to salinity correlating with the higher germination efficiency. These results indicate a possible advantage for the Mundah genotype, making it a better germinator under salt stress. Gradual increase in the number of lipids detected over time from MALDI-MSI in Keel compared to Mundah under salt conditions also support our findings of Keel showing slow germination.

Analysis of the flavonoid profile of the five other barley genotypes showed varied and contrasting responses under salt stress over time. Vlamingh, which showed a germination efficiency between that of Keel and Mundah, also showed an increase in the Gancaonin F flavonoid under salt stress to a level between that of Keel and Mundah. These observations point to Gancaonin F providing some advantage to specific genotypes in the response to salt stress through possible antioxidant scavenging of ROS species and protection of the vulnerable embryo during germination. The other four genotypes showed a variety of differing patterns in Gancaonin F profiles. There are a multitude of possible response mechanisms that the germinating seed is able to employ and the varying responses observed are most likely due to the wide range in genetic diversity of the selected barley varieties.

# CONCLUSIONS

This study investigated the changes in levels and distribution of metabolites and elements in germinating seeds of Mundah and Keel, two barley genotypes that have contrasting germination rates in response to salt stress. To compare and contrast the lipid and metabolite profiles of these varieties during the early stages of seed germination under control and saline conditions, several orthogonal approaches were combined using MALDI-MSI, elemental composition analysis using ICP-MS, spatial distribution analysis using µ-XRF spectrometry and confirmation of compounds using LC-QToF-MS. Use of discriminative MALDI-MSI analysis as an exploratory and qualitative technique allowed visualization of larger lipid differences within different seeds structures as well as detection of a tentative flavonoid that showed discriminate behaviour between the two genotypes at different time points. Further analysis across genotypes to determine the roles of flavonoids in barley germination under salt stress conditions is required. However, these initial results indicate flavonoid as a strong candidate metabolite biomarker for detecting salinity tolerant varieties.

# DATA AVAILABILITY

This manuscript contains previously unpublished data. The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

# AUTHOR CONTRIBUTIONS

SG carried out the experiments, data analysis and interpretation of the results. SG wrote the manuscript with support from BB, SN, PS, CH, UR. LCMS was conducted and analyzed by SG and TR. ICP-MS was conducted and analyzed by SG and DC. µ-XRF imaging was conducted by BB. Both BB and UR supervised the project. All authors provided critical feedback, helped shape the research and authorized the final manuscript.

# FUNDING

This project and UR were funded through an Australian Research Council (ARC) Future Fellowship program (Grant Number FT130100326).

# ACKNOWLEDGMENTS

The authors would like to thank The University of Adelaide for providing the barley seeds. Jens Burgmann (Bruker Pty Ltd, Mining and Geo Sciences, Darra, Qld) for conducting µ-XRF imaging of seed sections. LC-MS and MALDI-MSI were carried out at Metabolomics Australia which is supported by funds from the Australian Government's National Collaborative Research Infrastructure Scheme (NCRIS) administered through Bioplatforms Australia (BPA) Ltd.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.01139/ full#supplementary-material

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Gupta, Rupasinghe, Callahan, Natera, Smith, Hill, Roessner and Boughton. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Terpene Synthases as Metabolic Gatekeepers in the Evolution of Plant Terpenoid Chemical Diversity

*Prema S. Karunanithi and Philipp Zerbe\**

*Department of Plant Biology, University of California Davis, Davis, CA, United States*

Terpenoids comprise tens of thousands of small molecule natural products that are widely distributed across all domains of life. Plants produce by far the largest array of terpenoids with various roles in development and chemical ecology. Driven by selective pressure to adapt to their specific ecological niche, individual species form only a fraction of the myriad plant terpenoids, typically representing unique metabolite blends. Terpene synthase (TPS) enzymes are the gatekeepers in generating terpenoid diversity by catalyzing complex carbocation-driven cyclization, rearrangement, and elimination reactions that enable the transformation of a few acyclic prenyl diphosphate substrates into a vast chemical library of hydrocarbon and, for a few enzymes, oxygenated terpene scaffolds. The seven currently defined clades (a-h) forming the plant TPS family evolved from ancestral triterpene synthase- and prenyl transferase–type enzymes through repeated events of gene duplication and subsequent loss, gain, or fusion of protein domains and further functional diversification. Lineage-specific expansion of these TPS clades led to variable family sizes that may range from a single TPS gene to families of more than 100 members that may further function as part of modular metabolic networks to maximize the number of possible products. Accompanying gene family expansion, the TPS family shows a profound functional plasticity, where minor active site alterations can dramatically impact product outcome, thus enabling the emergence of new functions with minimal investment in evolving new enzymes. This article reviews current knowledge on the functional diversity and molecular evolution of the plant TPS family that underlies the chemical diversity of bioactive terpenoids across the plant kingdom.

Keywords: terpenoids, terpene synthases, plant specialized metabolism, plant chemical diversity, terpenoid biosynthesis, natural products

# INTRODUCTION

Among the wealth of small molecule natural products, terpenoids (also referred to as isoprenoids) form an especially diversified and evolutionary ancient superfamily, which likely emerged alongside the formation of primitive membranes at the very origins of cellular life (Ourisson and Nakatani, 1994). Ubiquitous presence of terpenoids in membranes supports this hypothesis and suggests that ancient archaebacterial diphytanylglycerol ether membrane components, polyprenols, and derived steranes and sterols represent early terpenoid predecessors (Ourisson and Nakatani, 1994; Rohmer and Bisseret, 1994; Van De Vossenberg et al., 1998; Matsumi et al., 2011). From this origin, the staggering diversity of the terpenome has arisen, comprising more than 80,000

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Dae-Kyun Ro, University of Calgary, Canada Joseph Chappell, University of Kentucky, United States*

> *\*Correspondence: Philipp Zerbe pzerbe@ucdavis.edu*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

> *Received: 14 June 2019 Accepted: 26 August 2019 Published: 01 October 2019*

#### *Citation:*

*Karunanithi PS and Zerbe P (2019) Terpene Synthases as Metabolic Gatekeepers in the Evolution of Plant Terpenoid Chemical Diversity. Front. Plant Sci. 10:1166. doi: 10.3389/fpls.2019.01166*

**215**

compounds (Christianson, 2018) that are widespread across living organisms, including archaea (Matsumi et al., 2011), bacteria (Yamada et al., 2012), fungi (Schmidt-Dannert, 2015), social amoeba (Chen et al., 2016; Chen et al., 2019), marine organisms (Gross and König, 2006), insects (Beran et al., 2016; Lancaster et al., 2018), and plants (Gershenzon and Dudareva, 2007; Tholl, 2015). Plants are the champions of producing different terpenoid structures (Tholl, 2015). This includes a few isoprenoid derivatives with essential roles in plant growth and development such as gibberellins, brassinosteroids, carotenoids, and chlorophylls (Pallardy, 2008; Tripathy and Pattanayak, 2012). Conversely, the vast majority of plant terpenoids represent specialized metabolites that are dedicated to mediating interorganismal interactions or environmental defense and adaptation (Gershenzon and Dudareva, 2007; Tholl, 2015). For example, many terpenoids exhibit potent toxicity and serve as core components of chemical defenses against herbivores, insect pests, and microbial pathogens (Keeling and Bohlmann, 2006a; Vaughan et al., 2013; Schmelz et al., 2014). In addition, functions in allelopathic interactions and roles in abiotic stress responses have been reported (Lopez et al., 2008; Kato-Noguchi and Peters, 2013; Vaughan et al., 2015). Terpenoid bioactivities in cooperative interactions are equally diverse, including various volatile terpenoids essential for attracting pollinators and seed dispersers, as well as in mediating plant–plant and plant–microbe interactions that impact plant fitness (Dudareva and Pichersky, 2000; Pichersky and Gershenzon, 2002; Heil and Ton, 2008; Agrawal and Heil, 2012). Driven by selective pressures to adapt to the biotic and abiotic environments of the ecological niche occupied by individual plant species, specialized terpenoid metabolism has undergone an expansive evolutionary divergence, resulting in often lineage-specific pathways and products (Chen et al., 2011; Tholl, 2015; Zerbe and Bohlmann, 2015). Biosynthesis and accumulation of these compounds also are typically restricted to only a subset of organs, tissues, or developmental stages and may be tightly regulated by internal or external stimuli, granting plants the ability to fine-tune the deployment of terpenoids for mediating dynamic interactions with the environment (Keeling and Bohlmann, 2006a; Tholl, 2006; Schmelz et al., 2014). Owing to their diverse bioactivities, terpenoid-forming plants and their products have a long history of exploitation for human benefit. Historically, large-scale extraction of terpenoid resins from coniferous trees has been a resource for producing "turpentine"—giving the metabolite class its name—and continue to be of economic relevance for the manufacture of biopolymers and inks (Bohlmann and Keeling, 2008). Other uses of plant terpenoids span various industrial sectors, including flavors and fragrances (Lange et al., 2011; Schalk et al., 2012; Philippe et al., 2014; Celedon and Bohlmann, 2016), pharmaceuticals and cosmetics (Paddon et al., 2013; Pateraki et al., 2017; Booth and Bohlmann, 2019; Zager et al., 2019), biofuels (Peralta-Yahya et al., 2012; D'espaux et al., 2015), and natural rubber (Oh et al., 2000; Cornish and Xie, 2012; Qu et al., 2015).

The biological and economic relevance of terpenoids has fostered long-standing efforts in understanding the metabolic enzymes that generate terpenoid chemical diversity. Following common metabolic patterns of scaffold-forming and tailoring reactions in specialized metabolism (Anarat-Cappillino and Sattely, 2014), terpenoid biosynthesis proceeds through conversion of central 5-carbon isoprenoid precursors into a range of core scaffolds that are then functionally elaborated to generate the diversity of terpenoid bioactivities (Davis and Croteau, 2000; Chen et al., 2011; Hamberger and Bak, 2013; Zerbe and Bohlmann, 2015) (**Figure 1**). Functionally distinct enzyme families of terpene synthase (TPS) and cytochrome P450 monooxygenase (P450) enzymes are the major drivers of scaffold formation and functional modifications, respectively (Peters, 2010; Chen et al., 2011; Nelson and Werck-Reichhart, 2011; Zerbe and Bohlmann, 2015; Banerjee and Hamberger, 2018; Bathe and Tissier, 2019) (**Figure 1**). In particular, TPSs serve as the gatekeepers of species-specific terpenoid pathways, catalyzing stereo-specific carbocation cascades that transform a handful of common prenyl diphosphate substrates into the core scaffolds of numerous structurally distinct terpenoid groups. Recent years have witnessed groundbreaking advances in genomics and biochemical tools that have enabled the discovery of TPS and P450 enzymes at an unprecedented scale and can be combined with versatile metabolic engineering approaches toward producing a broader range of terpenoid bioproducts (Keasling, 2012; Kitaoka et al., 2015; Mafu and Zerbe, 2018). Building on comprehensive reviews on terpenoid biological function, regulation, and biochemistry (Dudareva and Pichersky, 2000; Tholl, 2006; Gershenzon and Dudareva, 2007; Hamberger and Bak, 2013; Lange and Turner, 2013; Schmelz et al., 2014; Tholl, 2015), this review focuses on recent advances in the knowledge of terpenoid biosynthesis and the evolutionary divergence of the TPS family.

# METABOLIC ORIGIN OF TERPENOID PRECURSORS

# Biosynthesis of C5 Isoprenoid Building Blocks

The metabolic origin of all terpenoids centers around the assembly of multiples of the common C5 isoprenoid precursor isopentenyl diphosphate (IPP) and its double-bond isomer dimethylallyl diphosphate (DMAPP) (Lange et al., 2000; Christianson, 2008). Unlike most microbial organisms, plants utilize two distinct pathways for producing these building blocks: the acetyl-CoAderived cytosolic mevalonate (MVA) pathway and the pyruvatederived plastidial 2-*C*-methyl-D-erythritol-4-phosphate (MEP) pathway (McGarvey and Croteau, 1995; Hemmerlin et al., 2012) (**Figure 1**). Presence of the MVA pathway in archaea (albeit with some species being devoid of some pathway enzymes) and phylogenetic relatedness of MVA pathway genes across multiple taxa provide evidence that the MVA pathway represents the ancestral isoprenoid-metabolic route that was present in the last common ancestor and has been vertically transmitted to the descendants (Lombard and Moreira, 2011; Matsumi et al., 2011). By contrast, the plastidial MEP pathway was likely acquired through horizontal gene transfer from different bacterial progenitors such as cyanobacteria and various proteobacteria (Lange et al., 2000). The metabolic expense of retaining two IPP/DMAPP-metabolic pathways in plants holds apparent advantages by enabling broader ability to evolve specialized terpenoid pathways and better control of compartment-specific isoprenoid pools toward MEP-derived mono- and di-terpenoids, carotenoids, plastoquinones and chlorophyll in plastids and MVA-derived sesquiterpenoids, sterols, brassinosteroids, and triterpenoids. This physical separation of downstream pathways has been supported—for example, by genome-wide co-expression studies in *Arabidopsis* that showed minimal interaction between MVA and MEP genes (Wille et al., 2004; Vranova et al., 2013; Rodriguez-Concepcion and Boronat, 2015). In addition, presently known pathway connections appear to be largely negative in nature, where transcriptional activation of MEP genes correlate with the repression of MVA genes and *vice versa* (Ghassemian et al., 2006; Rodriguez-Concepcion and Boronat, 2015). On the other hand, metabolic compensation, for example, of cytosolic sterol biosynthesis through the MEP pathway has been described (Hemmerlin et al., 2003; Laule et al., 2003). Indeed, cross-talk between both pathways *via* exchange of IPP, DMAPP, and C10–15 prenyl diphosphate intermediates has been demonstrated in several species (Hemmerlin et al., 2003; Laule et al., 2003; Opitz et al., 2014; Mendoza-Poudereux et al., 2015), indicating that the metabolic fate of MEP- and MVA-derived IPP and DMAPP is not as clear cut. For example, isotope-labeling studies demonstrated incorporation of MEPderived IPP/DMAPP into both mono- and sesqui-terpenoids in snapdragon (*Antirrhinum majus*) and carrot (*Daucus carota*) (Dudareva et al., 2005; Hampel et al., 2005). Similar work in cotton (*Gossypium hirsutum*) showed contribution of the MVA pathway to C10–C40 terpenoid biosynthesis (Opitz et al., 2014).

How the cross-membrane exchange of MEP and MVA intermediates is coordinated also requires further investigation. Transport of IPP and GPP across the plastidial membrane has been observed in isolated plastids (Soler et al., 1993; Bick and Lange, 2003; Flugge and Gao, 2005), but transporters or alternate transfer mechanisms are thus far unknown (Pick and Weber, 2014). Recent studies further illustrated that terpenoid biosynthesis *via* the MVA and MEP pathways is not solely routed through IPP and DMAPP but can involve a pool of the respective isopentenyl and dimethylallyl monophosphates, IP, and DMAP (Henry et al., 2015; Henry et al., 2018). The IP and IPP pools are controlled by two enzymes families, IP kinases and Nudix hydrolases, that catalyze the phosphorylation and hydrolysis of IP and IPP, respectively (Henry et al., 2015; Henry et al., 2018) (**Figure 1**). IP kinases were first discovered in archaea and Chloroflexi as an alternate pathway for isoprenoid biosynthesis (Dellas et al., 2013). More recently, IP kinase homologs were shown to be widely distributed in plant genomes, where they occur alongside the complete set of MVA and MEP pathway genes (Vannice et al., 2014). In *Arabidopsis*, IP kinase was shown to localize at the cytosol and regulate the formation of both MVA- and MEP-derived terpenoids as based on reverse genetic studies (Henry et al., 2015). *AtI*PK knockout in *Arabidopsis* using T-DNA insertion lines caused a significant decrease in the levels of sterols (37–50%) and sesquiterpenes produced (25–31%). Conversely, overexpression of *At*IPK in transgenic *Nicotiana*  *tabacum* led to a 3-fold and 2-fold increase in sesquiterpenes and monoterpenes, respectively. Further efforts to understand the formation of IP/DMAP in plants identified a role of Nudix hydrolases, a superfamily of two-domain hydrolases/peptidases broadly found in bacteria, animals, and plants (Bessman et al., 1996; Ogawa et al., 2005; Magnard et al., 2015; Henry et al., 2018). *In vitro* and genetic studies of the two cytosolic Nudix hydrolases in the *Arabidopsis* genome, *At*Nudx1 and *At*Nudx3, demonstrated their efficiency in dephosphorylating IPP and DMAPP (Henry et al., 2018). *Arabidopsis* T-DNA insertion gene knock-down and knock-out lines of *At*Nudx1 and *At*Nudx3 resulted in increased production of sesquiterpenes (28–60%), monoterpenes (148– 503%), and sterols (~50%) whereas overexpression of these enzymes in *N. tabacum* resulted in decreased production of monoterpenes (~50%) and sesquiterpenes (57–88%). Although understanding the broader relevance of IP kinase and Nudix hydrolase genes in plant terpenoid metabolism requires further studies, these collective findings highlight the potential of these pathway reactions to possibly function as additional regulatory mechanisms for balancing the IP/DMAP and IPP/DMAPP pools in the biosynthesis of terpenoids and other isoprenoids (Henry et al., 2015; Henry et al., 2018). Given the dramatic impact of modulating IP kinase and Nudix hydrolases gene expression on pathway productivity, combined tailoring of these pathway nodes holds promise for advanced terpenoid pathway engineering.

# Biosynthesis of Prenyl Diphosphate Precursors

Downstream of the IPP and DMAPP biosynthesis, prenyl transferases (PTs) catalyze the sequential condensation of isoprenoid units *via* ionization of the allylic diphosphate ester and subsequent rearrangement of the resulting carbocation to generate prenyl diphosphate metabolites of distinct chain length that serve as universal terpenoid precursors (Nagel et al., 2019) (**Figure 1**). Head-to-tail condensation (C4–C1 alkylation) reactions lead to C10 (geranyl diphosphate, GPP), C15 (farnesyl diphosphate, FPP), and C20 (geranylgeranyl diphosphate, GGPP) intermediates as precursors in mono-, sesqui-, and di-terpenoid metabolisms, respectively (**Figure 2A**). Notably, dimerization has been shown to be a major factor impacting PT activity and product specificity. For example, GPP synthases from *Mentha piperita*, *A. majus*, and *Clarkia breweri* require formation of a heterodimer of a small and a large subunits for their enzyme function (Burke et al., 1999; Tholl et al., 2004). Interaction of GPP small subunits with GGPP synthases from *Abies grandis* and *Taxus canadensis* were further shown to modify GGPP synthase product specificity in favor of forming shorter C10 chains (Burke and Croteau, 2002). Similarly, interaction of a *cis*-PT with an unusual *cis*-PT-like scaffolding enzyme was shown be a key function in rubber biosynthesis in lettuce (*Lactuca sativa*) (Qu et al., 2015). As alternative routes to the common head-to-tail condensation reactions, C30 and C40 prenyl diphosphates are formed *via* head-to-head condensation of FPP or GGPP through the activity of squalene synthases or phytoene synthases en route to triterpenoids and carotenoids, respectively (Christianson, 2008). Similarly, catalysis of non-head-to-tail or irregular C1′- 2-3 isoprenoid condensation can occur as exemplified by a PT

FIGURE 1 | Schematic overview of major terpenoid biosynthetic pathways. All terpenoids are derived from two isomeric 5-carbon precursors, isopentenyl diphosphate (IPP), and dimethylallyl diphosphate (DMAPP). In turn, IPP and DMAPP are formed *via* two pathways, the cytosolic mevalonate (MVA) pathway originating from acetyl-CoA and the pyruvate and glyceraldehyde-3-phosphate (G3P)–derived 2-C-methyl-D-erythritol-4-phosphate (MEP) pathway located in the plastids. However, active transfers of IPP, DMAPP, GPP, and FPP across the plastidial membrane enable some degree of pathway cross-talk. In addition, interconversion of IPP and DMAPP with their respective monophosphate forms IP and DMAP by IP kinase (IPK) and Nudix hydrolase enzymes can impact pathway flux in terpenoid metabolism. Except for isoprene and hemiterpene (C5) biosynthesis, condensation of IPP and DMAPP units generates prenyl diphosphate intermediates of different chain length. Condensation of IPP and DMAPP yields geranyl diphosphate (GPP) as the precursor to monoterpenoids (C10), fusing GPP with an additional IPP affords the sesquiterpenoid (C15) precursor farnesyl diphosphate (FPP), and fusing FPP with IPP generates geranylgeranyl diphosphate (GGPP) en route to diterpenoids (C20). Furthermore, condensation of two FPP or two GGPP molecules forms the central substrates of triterpenoid (C30) and carotenoids (C40), respectively. Terpene synthases (TPS) are key gatekeepers in the biosynthesis of C10–C20 terpenoids, catalyzing the committed scaffold-forming conversion of the respective prenyl diphosphate substrates into a range of hydrocarbon or oxygenated structures. These TPS products can then undergo various oxygenations through the activity of cytochrome P450 monooxygenases (P450), followed by further possible functional decorations, ultimately giving rise to more than 80,000 distinct natural products. AACT, acetoacetyl-CoA thiolase; CMK, 4-diphosphocytidyl-2-C-methyl-D-erythritol kinase; DXR, 1-deoxy-D-xylulose 5-phosphate reductase; DXS, 1-deoxy-D-xylulose 5-phosphate synthase; HDR, (E)-4-hydroxy-3-methyl-but-2-enyl diphosphate reductase; HDS, (E)-4-hydroxy-3 methyl-but-2-enyl diphosphate synthase; HMGR, 3-hydroxy-3-methylglutaryl-CoA reductase; HMGS, 3-hydroxy-3-methylglutaryl-CoA synthase; IDI, isopentenyl diphosphate isomerase; MCT, MEP cytidyltransferase; MDD, mevalonate-5-diphosphate decarboxylase; MDS, 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase; MK, mevalonate kinase; P450, cytochrome P450-dependent monooxygenase; PHY, phytoene synthase; PMK, phosphomevalonate kinase; PT, prenyl transferase; SCS, squalene synthase; SQE, squalene epoxidase; TPS, terpene synthase; TTS, triterpene synthase.

rearrangement into bicyclic prenyl diphosphates of *ent*-copalyl diphosphate (*ent*-CPP) (B) and related scaffolds of distinct stereochemistry and hydroxylation (C). (D–E) Class I diTPS-catalyzed conversion of bicyclic prenyl diphosphate intermediates *via* ionization of the diphosphate moiety and subsequent cyclization and rearrangement through various 1,2-hydride and methyl migrations to form, for example, *ent*-kaurene (D) and a range of other labdane diterpene scaffolds (E).

from *Chrysanthemum cinerariaefolium* that forms the irregular monoterpene chrysanthemyl diphosphate (Rivera et al., 2001). Recent years further revealed the biosynthetic enzymes forming plant prenyl diphosphate products of other chain length, including C25 intermediates en route to the rare group of defensive sesterterpenoids in *Brassicaceae* and a few other species (Luo et al., 2010; Nagel et al., 2015; Huang et al., 2017; Chen et al., 2019). *Arabidopsis* studies further illustrated a *trans*-type polyprenyl diphosphate synthase with the capacity to produce variable chain length (C25–C45) products (Hsieh et al., 2011).

In addition to product chain length, PT enzymes can be distinguished based on the *cis*- or *trans*-C–C double-bond configuration of their product (Liang et al., 2002). Although PTs utilize the same isoprenoid substrates, *cis-* and *trans*-type PTs differ in their protein structure and signature motifs that control catalytic specificity (Liang et al., 2002). Similar to class I TPS enzymes (described below), *trans*-PTs feature two aspartate-rich motifs, FARM (DDx2/4D) and SARM (DDx2D), which are critical for substrate binding, whereas *cis*-PTs lack these motifs, and substrate binding is controlled by Asp and Glu residues broadly distributed within the active site (Liang et al., 2002). Mediumand long-chain (≥C30) PTs more commonly produce *cis*-prenyl diphosphate compounds, most prominently represented by *cis*-PTs of rubber biosynthesis (Asawatreratanakul et al., 2003; Cornish and Xie, 2012; Chakrabarty et al., 2015; Qu et al., 2015). In contrast, the majority of short-chain (C10–25) prenyl diphosphates occur in the *trans* configuration. However, a number of *cis*-PTs have been identified in certain species that produce intermediates featuring *cis-* and *trans-*configured double bonds. For example, short-chain *cis*-PTs were identified in tomato (*Solanum* sp.) (Sallaud et al., 2009; Schilmiller et al., 2009; Akhtar et al., 2013; Matsuba et al., 2015). In the wild tomato variety *Solanum habrochaites*, an FPP synthase, zFPS, was demonstrated to form the *cisoid* FPP form *Z,Z*-FPP, which is further converted by a *Z,Z*-FPP-specific TPS (SBS) to form (+)-α-santalene, (+)-endo-β-bergamotene, and (−)-endo-αbergamotene (Sallaud et al., 2009). Notably, zFPS localizes to the chloroplast, contrasting the commonly cytosolic localization of *trans*-FPP synthases (Sallaud et al., 2009; Schilmiller et al., 2009; Akhtar et al., 2013). Further gene discovery studies in cultivated tomato (*Solanum lycopersicum*) identified neryl-diphosphate synthase 1 (NDPS1) catalyzing the formation of a *cis*-neryl diphosphate (NPP) as a precursor for a range of monoterpenoids in addition to the canonical *trans*-substrate GPP (Schilmiller et al., 2009; Gutensohn et al., 2013). More recently, a diterpenoidmetabolic cluster was reported in cultivated tomato that included an unusual *cis*-PT (CPT2) that produces the *cisoid* C20 GGPP variant *Z,Z,Z*-nerylneryl diphosphate (NNPP) (Matsuba et al., 2015). NNPP conversion by a TPS (TPS21) and a P450 (CYP71BN1) located within this cluster yielded the unusual diterpenoid lycosantalonol (Matsuba et al., 2015). Combinatorial functional analysis of class II TPSs known to convert *transoid E*,*E*,*E*-GGPP showed a broad capacity to also convert NNPP (Jia and Peters, 2017; Pelot et al., 2019). Together, the identification of *cis*-PT enzymes in an increasing number of species and the capacity of several TPSs to convert both *transoid* and *cisoid* prenyl diphosphate substrates may suggest that *cis*-prenyl diphosphate–derived terpenoids are more widely distributed than previously assumed.

# EVOLUTION OF TERPENE SYNTHASES DRIVES TERPENOID CHEMICAL DIVERSITY

The family of TPS enzymes governs the committed scaffoldforming C–C bonding and hybridization reactions in the biosynthesis of terpenoid chemical diversity from a handful of acyclic and achiral C5n prenyl diphosphate substrates. At the core of TPS product specificity is the intramolecular rearrangement of highly reactive carbocation intermediates. Although with far greater variation, these reactions are mechanistically analogous to those observed in PT enzymes (**Figure 2**), which suggested an evolutionary relationship between both enzyme families (Gao et al., 2012). Structural studies further strengthened this hypothesis by illustrating that these enzymes share a common TPS fold comprised of variations of three conserved helical domains, α, βα, or γβα (**Figure 3**). Differences in the functionality of these domains distinguish two major classes of TPSs: class II TPSs generate the initial carbocation intermediate *via* substrate protonation and catalyze scaffold rearrangements without cleavage of the diphosphate ester bond, whereas class I TPSs utilize ionization of the diphosphate moiety to form the

intermediary carbocation (Davis and Croteau, 2000; Peters, 2010) (**Figures 2B**, **C**). The class II (βγ) domain adopts a characteristic double α-barrel structure that likely evolved from bacterial class II TPSs, which in turn are related to ancestral class II triterpene synthases such as squalene-hopene cyclase (Cao et al., 2010; Gao et al., 2012; Christianson, 2018) (**Figure 3**). Crystallization of the class II *ent*-copalyl diphosphate (CPP) synthase from *Streptomyces platensis* empirically demonstrated this typical α-barrel βγ-domain structure featuring a Dx4E motif closely related to the DxDD signature motif critical for the activity of plant class II diTPSs (Rudolf et al., 2016). Class I activity occurs in the α-helical α-domain (**Figure 3**), predecessors of which will have been ancestral bacterial class I PT and TPS enzymes, as exemplified by the crystal structure of the class I diTPS *ent-*kaurene synthase from *Bradyrhizobium japonicum* that illustrates the presence of the characteristic α-domain fold along with the signature catalytic DDxxD of class I TPSs (Liu et al., 2014). Such consecutively acting *ent*-CPP and *ent*-kaurene synthases are indeed broadly distributed in plant-associated bacteria, including symbiotic rhizobia such as α-, β−, and γ-proteobacteria (*Rhizobiales*) and some phytopathogens such as species of *Xanthomonas* and *Erwinia* (Morrone et al., 2009; Nagel and Peters, 2017; Nagel et al., 2018). Ancestral *ent*-CPP and *ent*kaurene synthases are core enzymes in the formation of bioactive gibberellin (GA) phytohormones, and all so far characterized bacterial *ent*-CPP and *ent*-kaurene synthases function as part of GA-biosynthetic operons, albeit with some end product variation ranging from GA precursors to bioactive GA4 (Nagel and Peters, 2017). Based on the wide distribution of GA-biosynthetic gene clusters in bacteria, it has been suggested that plants acquired the ability to form GAs through ancient events of horizontal gene transfer with soil bacteria (Cao et al., 2010; Smanski et al., 2012), thus providing a selective advantage for phytohormone biosynthesis to control growth and development, as well as a genetic reservoir for the evolution of specialized diterpene metabolites as discussed below.

# Distribution of Plant Microbial-Like Terpene Synthases (MTPSLs)

Recent studies revealed that some plant species retained a previously unknown class of microbial-like TPSs, termed microbial TPS-like (MTPSL). A family of MTPSL was first discovered by Li et al. in the *Selaginella moellendorffii* genome, where they co-occur with classical plant TPSs (Li et al., 2012). The 48 MTPSLs identified in *S. moellendorffii* are phylogenetically more closely related to bacterial and fungal TPS-like sequences and differ from classical plant TPSs on the basis of several key features. Firstly, MTPSLs show a distinct gene structure with a higher variability of introns (0–7) as compared to 12–14 introns in classical plant TPSs (Trapp and Croteau, 2001; Li et al., 2012). Secondly, MTPSLs adopt a single domain α-fold closely resembling the structure of microbial α-domain enzymes rather than βα-domain plant TPSs (Li et al., 2012). Lastly, alongside the common class I DDx2D signature motif, MTPSLs contain additional DDx3D and DDx3 motifs suggesting a distinct evolutionary origin (Li et al., 2012). Despite their structural distinctiveness, *in vitro* enzyme assays demonstrated that

MTPSLs form common mono- and sesqui-terpene products, including linalool, germacrene D, and nerolidol that naturally occur in *S. moellendorffii* (Li et al., 2012). Following the discovery of MTPSLs in *S. moellendorffii*, members of this TPS class were also identified in the liverwort *Marchantia polymorpha*, the hornwort *Anthoceros punctatus,* the moss *Sphagnum lescurii*, and the monilophyte ferns *Pityrogramma trifoliata* and *Woodsia scopulina*, suggesting a broader distribution across evolutionary older plant lineages (Jia et al., 2016b; Jia et al., 2018b; Xiong et al., 2018) (**Figure 4**). Functionally active MTPSLs were also identified in the genomes of some red algae (*Rhodophyta*) such as *Laurencia pacifica, Porphyridium purpureum*, and *Erythrolobus australicus* (Kersten et al., 2017; Wei et al., 2018). However, expansive genomics studies across several hundred species suggest that MTPSLs are absent in seed plants and green algae (Jia et al., 2016b). Close phylogenetic relationships of MTPSLs with fungal or bacterial TPSs support the evolution of MTPSL genes through multiple events of horizontal gene-transfer events between plants, bacteria, and fungi after the split from the last common ancestor with the green algae lineage (Jia et al., 2016b; Kersten et al., 2017; Jia et al., 2018b; Wei et al., 2018). It can be speculated that the loss of MTPSLs in seed plant lineages is due to the emergence of sesqui-TPS and mono-TPS functions derived from the ancestral bifunctional diTPSs (Jia et al., 2018b).

# Emergence and Diversification of Bifunctional Terpene Synthases

A hallmark event in TPS evolution was the fusion of βγ- and α-domain enzyme classes that gave rise to bifunctional class II/I diTPSs with a βγα-domain architecture (**Figure 3**), providing an apparent evolutionary advantage of improved metabolite channeling of reactive prenyl diphosphate intermediates. Such bifunctional diTPSs have been identified in fungi, mosses, and gymnosperms (Toyomasu et al., 2000; Keeling and Bohlmann, 2006b; Hayashi et al., 2006; Mafu et al., 2011; Zhou et al., 2012a; Fischer et al., 2015; Kumar et al., 2016) but are absent in angiosperms (**Figure 4**). Whether such domain fusions occurred in the bacterial donor or after transfer of monofunctional diTPS genes remains to be resolved. Moreover, the relevant horizontal gene transfer events likely included only a subset of genes rather than entire operons. For example, the bryophyte *Physcomitrella patens* contain a single prototypical class II/I diTPS producing *ent*-kaurene and *ent*-hydroxy-kaurene *via ent*-CPP as an intermediate (Hayashi et al., 2006). However, *P. patens* lacks additional genes required for producing bioactive GAs and instead forms the GA intermediate *ent*-kaurenoic acid that functions as a growth and developmental regulator (Hayashi et al., 2006). Presence of ancestral bifunctional diTPSs involved in the biosynthesis of GA-related compounds in *P. patens* and related land plants (Hayashi et al., 2006), but not algae (Lohr et al., 2012; Jia et al., 2018b; Wei et al., 2018), places the evolutionary origin of plant diterpene metabolism with the emergence of nonseed plant lineages approximately 450 million years ago. Presumably in the same time period, ancestral *ent*-(OH)-kaurene-producing bifunctional diTPSs underwent neo-functionalization toward the biosynthesis of diterpenoids with specialized functions in a number of species. The moss *Hypnum plumaeforme* contains a bifunctional diTPS that forms *syn*-pimara-7,15-diene, a diterpenoid also present in rice as precursor of anti-microbial and allelopathic momilactones (Wilderman et al., 2004; Kato-Noguchi and Peters, 2013; Schmelz et al., 2014; Okada et al., 2016). Similarly, two class II/I diTPSs have been identified in *S. moellendorffii* that produce miltiradiene *via* the enantiomer of *ent*-CPP (9*R*,10*R*-CPP), namely, normal (9*S*10*S*) CPP or (+)- CPP as an intermediate (*Sm*MDS) and labda-7,13*E*-dien-15-ol (*Sm*CPS/KSL1) *via* labda-15-en-8-ol diphosphate (LPP) (Mafu et al., 2011; Sugai et al., 2011). Although the physiological relevance of these diterpenoids remains elusive, it is plausible that these compounds or derivatives thereof function in disease and pest defense, considering the bioactivity of closely related metabolites in other plant species (Ma et al., 2012; Helmstädter, 2013). In contrast, the role of specialized bifunctional class II/I diTPSs of the gymnosperm-specific TPS-d clade is well established, where these enzymes form the abietane- and pimarane-type labdane scaffolds in the biosynthesis of diterpene resin acids (DRAs) that serve as a durable defense against insect pests and associated fungal pathogens (Keeling and Bohlmann, 2006a; Bohlmann, 2011). Bifunctional abietane– and pimarane-type diTPSs were identified in *Ginkgo biloba* (Schepmann et al., 2001), species of fir (*Abies*) (Peters et al., 2000; Zerbe et al., 2012b), spruce (*Picea*) (Martin et al., 2004), and pine (*Pinus*) (Ro and Bohlmann, 2006; Hall et al., 2013). Notably, all so far identified enzymes utilize the enantiomer of *ent*-CPP (9*R*,10*R*-CPP), namely, normal (9*S*10*S*) CPP or (+)-CPP, as an intermediate en route to the their individual diterpene products. In addition, some species feature diTPSs that have undergone additional neo-functionalization. For example, balsam fir (*Abies balsamea*) contains a *cis*-abienol synthase (AbCAS) that catalyzes conversion of GGPP into *cis*-abienol *via*  the C-8 hydroxylated CPP intermediate LPP also observed in *S. moellendorffii* (Mafu et al., 2011; Zerbe et al., 2012b).

The bifunctional diTPSs involved in diterpenoid metabolism of mosses, lycophytes, and gymnosperms are strikingly similar to those identified in fungi, especially in *Ascomycota* and some *Basidiomycota* species where they function in pathogenic or plant growth–promoting pathways (Bomke and Tudzynski, 2009; Quin et al., 2014). A bifunctional diTPS with *ent*-CPP/ *ent*-kaurene synthase activity was first identified as a part of a GA-biosynthetic gene cluster in *Gibberella fujikuroi* (genus *Fusarium*), the causal agent of to bakanae disease in rice (*Oryza sativa*) (Tudzynski and Holter, 1998; Toyomasu et al., 2000; Tudzynski, 2005). Beyond GA-biosynthetic diTPSs, certain fungal species also contain enzymes for producing specialized diterpenoids, as exemplified by the bifunctional aphidicolan-16β-ol synthase producing a key precursor to aphidicolin toxins in the pathogenic fungus *Phoma betae* (Toyomasu et al., 2004) or the *ent*-CPP/*ent*-kaurene synthase homologs PaDC1/2 involved in the biosynthesis of phyllocladan-triol in *Phomopsis amygdali* (Toyomasu et al., 2008). Recent studies proposed horizontal gene transfer from a plant to an ancestral *Ascomycota* fungus as the primary mechanism underlying the emergence of fungal class II/I diTPSs (Fischer et al., 2015). This hypothesis is supported by several lines of evidence, including the abundance of mutualistic plant-fungal interactions often involving species containing TPS genes such as *Fusarium*, the presence of diTPS in only some *Ascomycota* and *Basidiomycota* species, and the lack of correlation between the phylogenetic relationships of fungal diTPS and the fungal species containing these genes (Fischer et al., 2015). Interestingly, domain fusions between diterpenoid-biosynthetic enzymes are not limited to bifunctional class II/I diTPSs in fungi and other species but also include other chimeric enzymes that, for example, represent fusions of PT and TPS domains (Minami et al., 2018; Mitsuhashi and Abe, 2018), such as the *P. amygdali* Fusicoccadiene synthase that contains an N-terminal class I TPS domain and a C-terminal PT domain and is involved in the biosynthesis of *Fusicoccin* toxins (Toyomasu et al., 2007).

# Functional Radiation of Monofunctional Terpene Synthases

The absence of bifunctional class II/I diTPSs in angiosperms (Chen et al., 2011) highlights another milestone in the expansion of the TPS family; the duplication and sub-functionalization of ancestral bifunctional γβα-domain diTPSs with one descendent retaining class II (*ent*-CPP synthase) activity in the βα-domain and the other copy acting as a monofunctional class I diTPS (*ent*kaurene synthase) with a functional α-domain (**Figures 2B**, **C** and **3**). Early examples of such monofunctional enzymes have been described in *S. moellendorffii*, the liverwort *M. polymorpha* (Li et al., 2012; Kumar et al., 2016), and gymnosperms (Keeling et al., 2010) (**Figure 4**). While ancient vascular plants such as *S. moellendorffii* did not yet use these enzymes for producing bioactive GAs (Aya et al., 2011), monofunctional *ent*-CPP and *ent*-kaurene synthase activities in GA biosynthesis are conserved across vascular plants (Keeling et al., 2010; Chen et al., 2011) (**Figures 2B**, **D**). In addition to their critical role in phytohormone metabolism, monofunctional *ent*-CPP, and *ent*kaurene synthases will have served as a major genetic reservoir for the lineage-specific expansion of functionally diverse class II and class I TPSs across the plant kingdom (Zi et al., 2014) (**Figure 4**).

Derived from ancestral *ent*-CPP synthases, the TPS-c clade of class II diTPSs has undergone a relatively moderate diversification with known enzymes differing predominantly in their expression patterns and product-specificity toward a range of alternate stereo- and double-bond isomers and hydroxylated intermediates (Peters, 2010; Zerbe and Bohlmann, 2015) (**Figures 2B**, **C**). Most prominently, the *ent*-CPP enantiomer (+)-CPP, first identified as an intermediate of gymnosperm class

di-TPSs, and bifunctional class II/I diTPSs; TPS-c, monofunctional class II diTPSs; TPS-e/f, monofunctional class I diTPSs; TPS-g, monofunctional class I mono-,

sesqui-, and di-TPSs; TPS-a, monofunctional class I sesqui- and di-TPSs; TPS-b, monofunctional class I mono-TPSs.

II/I diTPSs (Peters et al., 2000; Martin et al., 2004), is also a core precursor to many labdane-related specialized diterpenoids in various *Lamiaceae* species (Ma et al., 2012; Brückner et al., 2014; Gao et al., 2014; Zerbe et al., 2014; Božić et al., 2015; Cui et al., 2015; Ignea et al., 2016; Scheler et al., 2016) and some *Poaceous* crops such as maize (*Zea mays*) and wheat (*Triticum aestivum*) (Wu et al., 2012; Mafu et al., 2018). Class II diTPSs forming the alternate stereoisomer *syn*-CPP (9*S*,10*R*-CPP) appear to be of narrower taxonomic distribution with current examples limited to some *Poaceous* grasses (Otomo et al., 2004; Xu et al., 2004; Pelot et al., 2018). While *ent*-CPP, (+)-CPP, and *syn*-CPP are the most commonly occurring labdane diterpene precursors, variable series of 1,2-methyl and/or hydride migrations prior to carbocation neutralization can result in other isomeric structures (Peters, 2010). Examples include clerodienyl diphosphate synthases identified in phylogenetically distant plants such as the *Lamiaceae Salvia divinorum*, the *Poaceous* grass switchgrass (*Panicum virgatum*), and *Celastraceae Tripterygium wilfordii* (Pelot et al., 2016; Chen et al., 2017; Hansen et al., 2017a; Pelot et al., 2018); 7,13-CPP synthases described in *S. moellendorffii* and the *Asteraceae Grindelia robusta* (Mafu et al., 2011; Zerbe et al., 2015); and most recently, 8,13-CPP synthases in maize and switchgrass (Murphy et al., 2018; Pelot et al., 2018). In addition to variations in the scaffold rearrangement, a few class II diTPSs evolved the ability to terminate the carbocation *via* oxygenation rather than deprotonation, a function already present in the *A. balsamea cis*-abienol synthase that forms LPP as an intermediate (Zerbe et al., 2012b). Here, oxygenation commonly occurs at the C-8 position to yield LPP with several such enzymes known (Falara et al., 2010; Caniard et al., 2012; Zerbe et al., 2013; Pelot et al., 2017). But also a C-9-oxygenated product has been observed that is formed likely *via* alternate 1,2-hydride shift

between C-8 and the neighboring methine group of the labda-13-en-8-yl diphosphate carbocation prior to water quenching (Zerbe et al., 2014; Heskes et al., 2018).

While the functional diversification of class II diTPSs has been largely limited to variations in product specificity, the family of class I diTPSs has seen a vast expansion and functional radiation through which the large clades of class I diTPS (gymnosperm TPS-d, angiosperms TPS-e/f), and ultimately, sesqui-TPSs (gymnosperm TPS-d, angiosperm TPS-a) and mono-TPSs (gymnosperm TPS-d, angiosperm TPS-b) have arisen (Chen et al., 2011; Gao et al., 2012) (**Figure 3**). In angiosperms, the predominant blueprint for this class I TPS expansion will have been the repeated duplication and functionalization of ancestral monofunctional *ent*-kaurene synthases within the TPS-e/f clade (Peters, 2010; Zi et al., 2014; Zerbe and Bohlmann, 2015). Beyond the divergence of enzyme products, many specialized class I diTPSs exhibit broad promiscuity for converting different class II diTPS intermediates into the large class of labdane-related diterpenoids (**Figure 2E**). This biochemical capacity enables modular pathway networks where functionally distinct enzymes can act in different combinations to generate a wider spectrum of possible products. It then appears that dividing class II and class I activities into two monofunctional enzymes has provided an evolutionary advantage over the improved intermediate channeling in bifunctional diTPSs that emerged through the ancestral domain fusion events. Numerous examples of species-specific modular diterpenoid-metabolic networks have been described, including the biosynthesis of stress defensive diterpenoid networks in several *Poaceous* crops such as wheat, rice, and maize (Xu et al., 2007a; Morrone et al., 2011; Zhou et al., 2012b; Zerbe et al., 2014; Cui et al., 2015; Fu et al., 2016; Mafu et al., 2018), as well as specialized diterpenoid metabolism in species of *Salvia* and other *Lamiaceae* (Brückner et al., 2014; Zerbe et al., 2014; Cui et al., 2015; Heskes et al., 2018). Further studies on how interconnected pathway branches are regulated will be essential to better understand how metabolic flux is coordinated between general and specialized pathways, as well as specialized pathway branches sharing key intermediates. Unlike angiosperms where such modular pathways of pairwise-acting monofunctional class II and class I diTPSs are the major metabolic strategy (Peters, 2010; Morrone et al., 2011; Mafu et al., 2015; Zerbe and Bohlmann, 2015), specialized diterpenoid metabolism in gymnosperms largely relies on ancestral bifunctional enzymes. However, the existence of modular pathways also in conifers was suggested by the discovery of monofunctional class I diTPSs in jack pine (*Pinus banksiana*) and lodgepole pine (*Pinus contorta*) that derived from bifunctional progenitors and are capable of utilizing the (+)-CPP intermediate produced by bifunctional class II/I enzymes to form a set of pimarane-type labdanes (Hall et al., 2013). Likewise, two groups of predictably monofunctional class I diTPSs that likely evolved from both TPS-d and TPS-e/f diTPSs were recently identified in Western red cedar (*Thuja plicata*) (Shalev et al., 2018). Biochemical analysis of these enzymes may shed light on the distribution of modular class II-class I diTPS reactions in gymnosperm specialized diterpenoid metabolism.

This expansive diversification of class I diTPSs resulted in a multitude of enzymes with altered substrate/product specificity and/or gene expression patterns as a critical contribution to the emergence of species-specific specialized functions (Peters, 2010; Zerbe and Bohlmann, 2015). Mostly, these catalytic alterations resulted in variations of the carbocation rearrangement to yield different pimarane, abietane, clerodane, kaurane, dolabradane, and related labdane scaffolds (**Figure 2E**). While the majority of these diTPS products are hydrocarbon scaffolds, a few class I diTPSs were discovered in recent years that terminate the intermediary carbocation not by deprotonation but *via* hydroxylation in a manner analogous to class II diTPSs that produce C-8 or C-9 hydroxylated prenyl diphosphate products (Peters, 2010; Zerbe and Bohlmann, 2015). Although examples are presently rare, such enzymes seem to be widely distributed across different plant families, including *P. patens* class I/II 16α-hydroxy-kaurane synthase (Hayashi et al., 2006), *S. moellendorffii* labda-7,13*E*dien-15-ol synthase (Mafu et al., 2011), 13-hydroxy-8(14) abietene synthases in several gymnosperm species (Keeling et al., 2011), 16α-hydroxy-*ent*-kaurane synthases from *T. wilfordii* (Hansen et al., 2017a) and *Populus trichocarpa* (Irmisch et al., 2015), nezukol synthase from *Isodon rubescens* (Pelot et al., 2017), and *Salvia sclarea* sclareol synthase (Caniard et al., 2012). The regio-specific hydroxylation reactions catalyzed by these diTPSs suggests that the ability of TPSs to ligate a water molecule or hydroxyl group in the nonpolar active site for coordinated carbocation quenching emerged multiple times independently in terpenoid evolution. Notably, sclareol synthase belongs to a recently discovered group of βα-bi-domain class I diTPSs that have undergone loss of the γ-domain (**Figure 3**). Members of this diTPS group form a separate branch in the TPS-e/f clade and have so far been identified in *Poaceae* and *Lamiaceae* species (**Figure 4**), where they almost invariably form specialized terpenoid products and exhibit broad substrate promiscuity, in some cases, such as wheat (*T. aestivum*) KSL5 and maize TPS1 spanning mono-, sesqui-, and di-terpenoid products (Hillwig et al., 2011; Caniard et al., 2012; Zerbe et al., 2013; Zerbe et al., 2014; Fu et al., 2016; Jia et al., 2016a; Pelot et al., 2016; Pelot et al., 2017; Pelot et al., 2018). Close phylogenetic relationships to TPS-e/f diTPSs combined with partial *ent*-kaurene synthase activity of a few enzymes suggest that these βα-domain diTPSs derived more recently from γβαdomain class I enzymes. These βα-bi-domain diTPSs resemble the likely progenitors of not only specialized class I diTPSs but the vast classes of modern βα-domain mono- and sesqui-TPSs. Here, loss of the γ-domain will have been accompanied by various active site modifications toward converting the shorter C10 and C15 substrates and catalyzing manifold distinct scaffold rearrangements and, in case of the sesqui-TPS family, alteration of enzyme subcellular localization through loss of the N-terminal plastidial transit peptide. Unlike the typically mid-sized diTPS families of 2–30 members, mono- and sesqui-TPS families have undergone a far greater expansion that resulted in diverse families of—for example, 69 TPSs in grape (*Vitis vinifera*) (Martin et al., 2010) and 113 TPS genes in *Eucalyptus* (Külheim et al., 2015), enabling these species to produce an astounding variety of smaller and more volatile terpenoids.

In addition to diTPS converting class II enzyme products, class I diTPSs with the ability to directly convert GGPP as a substrate emerged multiple times during terpenoid evolution and are abundant in a few plant families (**Figure 4**). In gymnosperms, known GGPP-converting monofunctional class I diTPSs are currently limited to taxadiene synthases in species of yew (*Taxus* spp.) that produce the taxane scaffold in the biosynthesis of the chemotherapeutic agent Taxol (Williams et al., 2000) and pseudolaratriene synthase from golden larch (*Pseudolarix amabilis*) that forms an unusual 5–7-ring scaffold en route to the bioactive anti-cancer compound pseudolaric acid B (Mafu et al., 2017) (**Figure 5**). These enzymes adopt the characteristic γβα-domain architecture of gymnosperm diTPSs and phylogenetic analyses clearly indicate that they represent descendants of bifunctional class II/I diTPSs rather than mono- or sesqui-TPSs of the TPS-d clade. In angiosperms, the evolutionary path leading to GGPP-converting diTPSs appears to be more diverse and resulted in enzymes that can produce linear, unusual polycyclic, and macrocyclic scaffolds (Kirby et al., 2010; Vaughan et al., 2013; Falara et al., 2014; King et al., 2014; Luo et al., 2016) (**Figure 5**).

diTPSs that are capable of directly converting geranylgeranyl diphosphate (GGPP) as a substrate.

Among these class I diTPSs, geranyllinalool synthases (GLSs) are γβα-domain diTPSs that uniquely catalyze the ionization of the GGPP diphosphate ester bond without subsequent cyclization to occur (**Figure 5**), resulting in an acyclic geranyllinalool intermediate in the biosynthesis of the homoterpenoid (*E*,*E*)- 4,8,12-trimethyltrideca-1,3,7,11-tetraene (TMTT) with antiherbivory activity (Falara et al., 2014). Intriguingly, most of the known GLSs form an ancient branch within the TPS-e/f clade of kaurene(-like) synthases, whereas enzymes from *Fabaceae* species fall into the more divergent TPS-g clade mostly comprised of unusual mono- and sesqui-TPSs (Chen et al., 2011; Falara et al., 2014). Although currently known GLSs are restricted to angiosperm species, phylogenetic studies suggest that TPS-e/f GLSs represent an ancient diTPS family that arose from a common ancestor predating the split of the gymnosperm and angiosperm lineages (Falara et al., 2014). Conversely, *Fabaceae*-specific GLSs of the TPS-g family are likely the result of more recent independent evolutionary events. Alike GLSs, class I enzymes converting GGPP into non-labdane poly- or macro-cyclic scaffolds have been identified in a few angiosperm families (**Figure 4**), including casbene synthases and related macrocyclases in *Euphorbiaceae* (Mau and West, 1994; Kirby et al., 2010; King et al., 2014; Luo et al., 2016), cembratrienol synthases in *Solanaceae* species (Ennajdaoui et al., 2010), *Arabidopsis* rhizathalene synthase (Vaughan et al., 2013), and most recently, 11-hydroxy vulgarisane synthase from the *Lamiaceae Prunella vulgaris* (Johnson et al., 2019b) (**Figure 5**). All known members of this group represent βα-bi-domain class I diTPSs and do not belong to the broad TPS-e/f family of class I diTPSs but instead form a distinct branch in the TPS-a clade of sesqui-TPSs (Kirby et al., 2010; King et al., 2014; Johnson et al., 2019b). Two possible evolutionary routes toward these more unusual diTPSs can be envisioned: evolution from TPS-e/f type diTPSs through neo-functionalization and loss of the plastidial signaling peptide or divergence from sesqui-TPS progenitors, which would have involved the re-acquisition of a transit peptide that was previously lost in the evolution of the cytosolic sesqui-TPS family. The latter hypothesis is supported by the close phylogenetic relationship of TPS-a diTPSs and sesqui-TPSs of the TPS-a clade (King et al., 2014; Luo et al., 2016; Johnson et al., 2019b). Such re-evolution events toward diterpenoid-producing TPSs appear not be to restricted to the TPS-a clade, since a monofunctional class I diTPS–producing miltiradiene was identified in *T. wilfordii* that clusters with the TPS-b clade of mono-TPSs rather than with other miltiradiene synthases of the TPS-e/f clade (Hansen et al., 2017a). These collective findings support a highly branched rather than linear evolutionary diversification of TPS functions with many such bifurcations still unknown.

# CATALYTIC PLASTICITY OF PLANT TERPENE SYNTHASES

The rapid evolutionary divergence of terpenoid metabolism is aided by the extensive functional plasticity of TPSs. Hence, there has been a long-standing interest in deciphering the mechanisms underlying TPS catalysis and substrate/product specificity. Although TPS structural studies have provided important insight into TPS catalysis, successful crystallization of plant TPSs currently remains limited to a handful of examples, including *Arabidopsis ent*-CPP synthase (class II diTPS) (Köksal et al., 2014), *Taxus brevifolia* taxadiene synthase (class I diTPS) (Köksal et al., 2011), *A. grandis* abietadiene synthase (class II/I diTPS) (Zhou et al., 2012a), and a few mono- and sesqui-TPS enzymes (Starks, 1997; Whittington et al., 2002; Hyatt et al., 2007; Shishova et al., 2008; Gennadios et al., 2009; McAndrew et al., 2011). The conserved nature of the TPS fold further has enabled numerous homology-based structure-function studies that provided a deeper insight into the relative ease of functional change in TPS enzymes, where as little as a single residue mutation can alter the active site contour that largely determines product specificity (for detailed reviews see also Gao et al., 2012; Christianson, 2017).

Early work on the class II/I abietadiene synthase from *A. grandis* demonstrated that class II catalysis uses a general aidbase mechanism to bring about the cyclo-isomerization of GGPP, whereby the acid function is provided by the middle aspartate of the conserved DxDD motif (Peters and Croteau, 2002; Prisic et al., 2007; Peters, 2010). Molecular-level insight into this mechanism was recently gained through solving of the crystal structure of *Arabidopsis ent*-CPP synthase at high resolution (Köksal et al., 2014). In this study, Köksal and coworkers elegantly demonstrated that proton transfer with this conserved aspartate is enabled by hydrogen-bonded proton wires that link the active site to the bulk solvent (Köksal et al., 2014). Accompanying mutagenesis studies identified the relevant catalytic base in *Arabidopsis ent*-CPP synthase as a water molecule that, in turn, is coordinated by a dyad of two histidine and asparagine residues conserved in *ent*-CPP synthases (Mann et al., 2010; Potter et al., 2014) (**Figure 6A**). Alanine substitution of these residues resulted in water quenching rather than deprotonation of the labd-enyl carbocation intermediate to yield hydroxylated *ent*-8-hydroxy-CPP products (Potter et al., 2014; Mafu et al., 2015). The widely conserved relevance of this catalytic dyad has been further supported by site-directed mutagenesis of analogous residues in the bifunctional *ent*-CPP/*ent*-kaurene synthase from *P. patens* and the fungus *Fusarium fujikuroi*, which both resulted in the formation of *ent*-LPP as the class II product (Kawaide et al., 2011; Mafu et al., 2015). Likewise, in the class II/I diTPS, abietadiene synthase from *A. grandis* substitution of a tyrosine corresponding to the catalytic histidine and a nearby histidine led to the redirection of class II specificity from (+)-CPP to 8α-hydroxy-CPP (Criswell et al., 2012). In addition to its critical role in *ent*-CPP synthase activity, residues in the position of this dyad have been shown to impact product specificity in several class II diTPSs. For example, substitution of the *Arabidopsis ent*-CPP synthase histidine with phenylalanine or tyrosine rather than Alanine redirected product outcome toward the formation of clerodienyl diphosphate (Potter et al., 2016b) (**Figure 6A**). Strikingly, reciprocal mutagenesis of the corresponding active site positions in recently discovered clerodienyl diphosphate synthases from *S. divinorum* and *T. wilfordii* blocked the series of migration reactions required for forming the clerodane scaffold and resulted in premature deprotonation to form *ent*-CPP (Pelot et al., 2016; Hansen et al., 2017b). Further supporting this evidence, a recent study by Schulte et al. showed

that the catalytic dyad in a conserved clade of *Lamiaceae* (+)- CPP synthases is represented by a hydrogen-bonded histidinetyrosine pair (Schulte et al., 2018). Mutagenesis especially of the tyrosine position in *Salvia miltiorrhiza* (+)-CPP synthase as well as other *Lamiaceae* class II diTPSs showed a dramatic impact on product outcome. Likewise, alanine substitution of

a corresponding phenylalanine residue in a 8,13-CPP synthase from switchgrass (PvCPS3) resulted in both positional isomers and hydroxylated forms of the native 8,13-CPP product (Pelot et al., 2018), thus highlighting the relevance of these active site positions in specialized class II diTPSs across diverse plant species.

FIGURE 6 | Examples of active site residues with impact on class II or class I TPS product specificity. (Center) Structure of *Abies grandis* abietadiene synthase (PDB 3S9V; Zhou et al., 2012a) that adopts the prototypical α-helical TPS fold with variations in three domains γ (orange), β (blue), and α (red). Relative locations of the highlighted active site residues are indicated A–F. (A) A widely conserved His-Asn dyad is critical for stereo-specificity of *Arabidopsis ent*-CPP synthase (PDB 4LIX; Köksal et al., 2014) and other *ent*-CPP synthases (Mann et al., 2010; Potter et al., 2014; Mafu et al., 2015). Substitution of this dyad can result in the formation of the alternate clerodienyl diphosphate product (Potter et al., 2014; Pelot et al., 2016). (B) His501 in the rice *syn*-CPP synthase OsCPS4 is critical for the stereo-specific formation of *syn*-CPP and is conserved in known *syn*-CPP synthases but not class II diTPS producing alternate CPP stereoisomers (Potter et al., 2016a; Pelot et al., 2018). (C) A conserved Met residue in *ent*-kaurene synthases from *Picea glauca* and other species was shown to control *ent*-kaurene formation (Xu et al., 2007b; Zerbe et al., 2012a). (D) Mutational studies of corresponding Ser-Ile-Ala-Leu and Ser-Ile-Ser-Leu motifs located at the hinge region of helix G1/2 of *P. abies* levopimaradiene/abietadiene and isopimaradiene synthase showed their critical role in producing abietane or pimarane scaffolds (Keeling et al., 2008). (E) Three residues were identified in the active site of *Artemisia annua* β-farnesene synthase, reciprocal exchange of which to corresponding residues in *A. annua* amorphadiene synthase that control activation (Tyr402), reversion (Tyr430), and restoration (Val476) of cyclization capacity (Salmon et al., 2015). (F) Two residues, Trp324 and His579, were shown in limonene synthase of *Mentha spicata* to control the reactions cascade toward the natural product 4(*S*)-limonene (Srividya et al., 2015). The signature catalytic motifs of the class II (DxDD, green) and class I (DDx2D, cyan; NSE/DTE, magenta) active sites are highlighted. Protein abbreviations: AtECPS, *Arabidopsis thaliana ent*-copalyl diphosphate (CPP) synthase; OsCPS4, *Oryza sativa syn*-CPP synthase; PvCPS3, *Panicum virgatum* 8,13-CPP synthase; SmCPS, *Salvia miltiorrhiza* (+)-CPP synthase; MvCPS1, *Marrubium vulgare* peregrinol diphosphate synthase; SdCPS2, *Salvia divinorum* clerodienyl diphosphate (KPP) synthase; PpCPS/KS, *Physcomitrella patens* CPP/*ent*-kaurene synthase; AgAS, *Abies grandis* abietadiene synthase; PgEKS, *Picea glauca ent*-kaurene synthase; PtTPS19, *Populus trichocarpa ent*-kaurene synthase; PaLAS, *Picea abies* levopimaradiene/abietadiene synthase; PaISO, *P. abies* isopimaradiene synthase; AaBFS, *Artemisia annua* β-farnesene synthase; AaADS, *A. annua* amorphadiene synthase.

Another key position contributing to product specificity in class II diTPSs was identified as a histidine (His501) residue in the rice *syn*-CPP synthase OsCPS4, where mutagenesis to aspartate or phenylalanine resulted in additional scaffold rearrangements to form *syn*-halimadienyl diphosphate (Potter et al., 2016a) (**Figure 6B**). A later study on the product specificity of the functionally unique peregrinol diphosphate synthase from *Marrubium vulgare* (MvCPS1) showed that substitution of the corresponding phenylalanine residue and a proximal tryptophan in MvCPS1 also redirected product outcome to yield a halimadienyl diphosphate scaffold (Mafu et al., 2016). These structure-function studies in conjunction with the conservation of the relevant histidine residue in known *syn*-CPP synthases, but not functionally distinct class II diTPSs (**Figure 6B**), support the relevance of this position for controlling biosynthesis of the *syn*-CPP stereoisomer (Potter et al., 2016a). Given the relatively smaller functional range of plant class II diTPSs, knowledge of active site determinants controlling product specificity can facilitate sequence-based prediction of class II diTPS functions as additional residues and functionally distinct enzymes are identified.

By comparison to class II diTPSs, functional annotation of class I TPSs is inherently more complex, due to the larger size and functional diversity of the class I TPS family spanning diterpenoid as well as mono- and sesqui-terpenoid-producing enzymes. However, numerous structure-guided functional studies have provided a deeper understanding of active site determinants that control the fate of intermediary carbocations derived from ionization of the respective linear or bicyclic prenyl diphosphate substrates. Mutational studies of the *ent*-CPP/*ent*kaurene synthases from *P. patens* and the liverwort *Jungermannia subulata* that produce *ent*-kaurene and 16α-hydroxy-*ent*-kaurane identified an alanine residue that, when substituted for methionine or phenylalanine, blocked formation of 16α-hydroxy-*ent*-kaurane in favor of *ent*-kaurene (Kawaide et al., 2011). Similar studies on *ent*-kaurene synthases from spruce (*Picea glauca*) and polar (*P. trichocarpa*) showed that mutagenesis of the corresponding methionine residues in these enzymes had the reciprocal effect by redirecting product specificity toward 16α-hydroxy-*ent*kaurane instead of *ent*-kaurene (Zerbe et al., 2012a; Irmisch et al., 2015) (**Figure 6C**), thus suggesting a possible contribution of mutations at this position to the evolution of dedicated *ent*kaurene synthases.

Another key active site segment that impacts class I TPS product specificity is a small hinge region between helix G1/2 (**Figure 6D**). This helix break is already present in ancestral squalene synthases, and recent structural studies of a bacterial hedycaryol sesqui-TPSs illustrated a role of this helix break in generating a negative electrostatic potential that contributes to carbocation stabilization during catalysis (Pandit et al., 2000; Baer et al., 2014). For example, mutational analysis of a pair of paralogous class II/I diTPSs from Norway spruce (*Picea abies*) illustrated that reciprocal exchange of a largely conserved SIAL/SISL motif located at this hinge region resulted in the complete interconversion of the respective abietadiene and isopimaradiene synthase activities (**Figure 6D**) (Keeling et al., 2010). A similar scenario was observed in several *ent*-kaurene synthases, where mutagenesis of a conserved isoleucine residue at this helix break mitigates formation of a tetracyclic kaurane structure and instead yielded tricyclic *ent*-pimaradiene scaffolds as demonstrated in enzymes from rice, *Arabidopsis*, spruce, and *P. patens* (Wilderman and Peters, 2007; Xu et al., 2007b; Zerbe et al., 2012a). Reciprocal mutagenesis of a corresponding threonine residue in the specialized rice class I diTPS OsKSL5 shifted catalysis from forming *ent*-pimaradiene to producing *ent*-isokaurene and other tetracyclic scaffolds, further supporting the role of residues in this position in controlling the fate of the intermediary *ent*-pimarenyl carbocation likely through electrostatic stabilization by a coordinated water or hydroxyl group (Jia et al., 2017).

More recently, mutational analysis of sclareol synthase from *S. sclarea* (Caniard et al., 2012; Schalk et al., 2012) identified a single asparagine residue, Asn431, located at the helix G break that impacts stereochemical control of product outcome. Switching Asn431 to glutamine reprogrammed the hydroxylation at C-13 from forming the native product 13*R*-sclareol to selectively producing its stereoisomer 13*S*-sclareol, thus highlighting the critical role of this amino acid on stereospecific water addition (Jia et al., 2018a). Collectively, these studies support a possibly critical role of the helix G hinge region in the catalytic control of product specificity in distinct diTPS and likely other class I TPS enzymes from ancestral *ent*-kaurene synthases.

Numerous studies also illuminated enzyme-specific active site residues with impact on class I catalysis in mono- and sesqui-TPSs. For example, alanine substitution of Asn338 in the *Salvia fructicosa* 1,8-cineole synthase generated an enlarged active site contour and redirected catalysis to effective conversion of the C15 FPP substrate to yield α-bergamotene, β-farnesene, and related sesquiterpenoid products, highlighting how minor alterations in the active cavity can enable the accommodation of different chain length substrates (Kampranis et al., 2007). In addition, a triad of active site residues impacting TPS capacity for generating cyclic products was discovered using comparative studies of *Artemisia annua* amorphadiene synthase and β-farnesene synthase that produce contrasting cyclic and linear products, respectively (Salmon et al., 2015). Large-scale site-directed mutagenesis studies of active site residues distinct between both enzymes revealed two central residue switches that activate cyclization in β-farnesene synthase (Tyr402Leu) or revert cyclization in the Tyr402Leu mutant (Val476Gly) (**Figure 6E**). Interestingly, a third mutation (Tyr430Ala) restored cyclization activity in the Val476Gly mutant background, illustrating that the ability to form a cyclic product is controlled by combinatorial effects of these active site positions. A growing body of knowledge exists on active site residues that contribute to different rearrangements of the initial cyclic carbocation intermediates in mono- and sesqui-terpenoid biosynthesis. For example, structure-guided mutagenesis of *Mentha spicata* limonene synthase, the key enzyme in the menthol production (Lange et al., 2011), identified two amino acids, His579 and Trp324, substitution of which led to premature neutralization of the carbocation intermediate to form both linear and cyclic monoterpenoids, including myrcene, linalool, and terpineol (Srividya et al., 2015) (**Figure 6F**). Similarly, reciprocal mutagenesis analyses of the mono-TPSs, Sitka spruce (*Picea sitchensis*) 3-carene synthase, and sabinene synthase associated with tree resistance against white pine weevil (Hall et al., 2011) revealed that two corresponding residues, 3-carene synthase Leu596, and sabinene synthase Phe596 located near the helix G break are critical for rearranging the central α-terpinyl+ carbocation toward 3-carene and sabinene, respectively (Roach et al., 2014).

The above examples and numerous related structure-function studies not covered within the scope of this review provide a mere glimpse into the plasticity of TPS catalysis, which relies on a largely non-polar active site with various possible carbocation rearrangements that enable the formation of myriad terpenoid structures with minimal investment in evolving new enzymes. However, despite these advances, our understanding of TPS mechanisms remains incomplete, thus limiting the ability to apply such knowledge for predicting the complex carbocation cascades underlying TPS activity and engineering desired enzyme functions. For instance, the taxonomic rather than functional relatedness of plant TPSs limits the use of phylogenic analyses for functional prediction (Chen et al., 2011; Zerbe and Bohlmann, 2015). Moreover, product re-direction through TPS mutagenesis as discussed above can be accompanied by a decrease in overall enzyme activity or additional byproducts resulting from a loss of steric control in the active site (Peters and Croteau, 2002; Pelot et al., 2016; Mafu et al., 2017). In this context, combining TPS structural analysis with quantum chemical calculations and molecular dynamic modeling approaches is advancing as a powerful tool kit to examine and predict TPS-mediated reaction cascades as discussed in more detail in several recent expert reviews (Tantillo, 2010; Tantillo, 2011; Gao et al., 2012; Major et al., 2014). For example, computational quantum chemical analyses have provided deeper insight into the inherent energy states driving terpene carbocation rearrangements and offer tools for predicting terpene pathways, as shown—for example, for predicting the often multi-product reactions catalyzed by sesquiterpene synthases (Isegawa et al., 2014). Likewise, structural studies combined with molecular dynamic modeling of TPSs has been successfully employed to predict the chemical space of possible carbocation rearrangements in mono-, di-, and tri-TPSs (Tian et al., 2014; Tian et al., 2016; Driller et al., 2018). Specifically, modeling of the catalytically relevant closed conformer of taxadiene synthase enabled important insight into the yet incompletely understood conformational changes of class I TPSs that contribute to the enzymes' control over product outcome (Schrepfer et al., 2016). In addition, detailed insights into how individual active site residues impact taxadiene synthase catalysis was revealed using a combined quantum mechanics and free energy simulation approach (Ansbacher et al., 2018). Current challenges for such computational approaches, such as predicting the role of water in the active site and the termination of the carbocation *via* deprotonation or water capture require further attention, but can likely be addressed with increasing computing resources and available structural information on a broader range of TPSs.

# FUNCTIONAL ELABORATION OF THE TERPENE SCAFFOLD

The vast majority of terpenoids feature multiple functional decorations of the TPS-derived hydrocarbon scaffold that critically contribute to the diverse bioactivities of the metabolite class (Pateraki et al., 2015; Bathe and Tissier, 2019). These tailoring reactions almost invariably are initiated by positionspecific oxygenations. Although these reactions can be facilitated by TPSs as outlined above, the vast majority of terpene functional modifications are controlled by the large family of cytochrome P450 monooxygenases that function as versatile catalysts for a variety of monooxygenation reactions, as well as phenol-coupling reactions, oxidative rearrangements, and oxidative C–C bond cleavage in some cases (Mizutani and Sato, 2011; Banerjee and Hamberger, 2018). Given the vast diversity of P450-controlled metabolic bifurcations, their roles in terpenoid metabolism will be on briefly discussed here. For a more expansive overview, we refer the reader to a selection of expert reviews (Nelson and Werck-Reichhart, 2011; Pateraki et al., 2015; Banerjee and Hamberger, 2018; Bathe and Tissier, 2019).

The P450 superfamily has expanded far beyond the midsized TPS families observed in most plants studied thus far and comprises on average more than 200 genes in an individual plant genome with various functions in both general and specialized metabolisms. Among the 127 currently defined plant P450 families, only a handful have been shown to play major roles in terpenoid metabolism. Within the CYP85 clan, members of the CYP88A subfamily serve as *ent*-kaurenoic acid oxidases in GA biosynthesis (Nelson and Werck-Reichhart, 2011), whereas CYP725 and CYP720B enzymes are specific to gymnosperm species and catalyze hydroxylation and carboxylation reactions in the formation of taxol in species of *Taxus* and DRAs in *Pinaceae* species, respectively (Ro et al., 2005; Ro and Bohlmann, 2006; Rontein et al., 2008; Hamberger et al., 2011; Guerra-Bubb et al., 2012). More prominently, multiple families within the large CYP71 clan contribute to the various functional modifications of C10–C20 terpenoids (Hamberger and Bak, 2013). This includes members of the CYP701A subfamily that act as *ent*-kaurene oxidases in GA metabolism and, in several species, have been recruited through gene duplication and neo-functionalization for the formation of defensive specialized diterpenoids as exemplified in *Arabidopsis*, maize, and rice (Morrone et al., 2010; Wang et al., 2012b; Mafu et al., 2018). However, the majority of terpenoid-modifying P450s fall into the vast CYP71 and CYP76 families with numerous such enzymes having been characterized. Both P450 families are presumably evolutionary younger with the CYP76 family first occurring in cycads and *Ginkgo*, whereas the CYP71 family seemingly emerged with the onset of angiosperm evolution but is absent in nonseed plants (Nelson and Werck-Reichhart, 2011). Members of both families predominantly function as position-specific hydroxylases that catalyze (poly-)oxygenations of various mono-, sesqui-, and di-terpenoid scaffolds (Swaminathan et al., 2009; Ikezawa et al., 2011; Wu et al., 2011; Wang et al., 2012a; Diaz-Chavez et al., 2013; Guo et al., 2013; Ignea et al., 2016; Mao et al., 2016; Scheler et al., 2016; Mafu et al., 2018) but also alternate functions such as diterpenoid epoxidation and the formation of furan rings in mono- and di-terpenoid metabolisms have been demonstrated (Bertea et al., 2001; Heskes et al., 2018; Mafu et al., 2018). Notably, the first three-dimensional structure for a membranebound plant P450 (*S. miltiorrhiza* CYP76AH1) has been reported (Gu et al., 2019), providing resources to gain deeper mechanistic insight into the activity of diterpenoid-metabolic P450. In addition, recent P450 characterization studies expanded terpenoid-metabolic functions to other P450 families such as the gymnosperm-specific CYP750 family with a (+)-sabinene-3 oxidase (CYP750B1) from Western red cedar potentially involved in producing the anti-herbivory monoterpenoid thujone (Gesell et al., 2015), as well as members of the CYP726A subfamily from castor bean (*Ricinus communis*) that catalyze epoxidation and oxidation reactions converting macrocyclic casbene and neocembrene scaffolds in *Euphorbiaceae* species (King et al., 2014; Luo et al., 2016).

In addition to and often subsequent to the activity of P450s in the functional elaboration of terpene scaffolds, several other enzyme families contribute to the biosynthesis of bioactive terpenoids. This includes, but is not limited to, the function of 2-oxoglutarate/Fe(II)-dependent dioxygenases (2-ODDs) (Farrow and Facchini, 2014), for example, in GA phytohormone metabolism, as well as members of often large methyl-, glycosyl-, and acetyl-transferases (Bathe and Tissier, 2019).

# CONCLUDING REMARKS

Continued investigation of the evolutionary divergence and function of the TPS family will provide important knowledge of the still incompletely understood roles of terpenoids in mediating defensive and cooperative interactions with other organisms and the environment at large (Tholl, 2015). However, to address knowledge gaps and experimental limitations, research in several areas will be particularly important. Advances in the computational annotation and biochemical characterization of TPSs and P450 enzymes must continue in order to fully capitalize on rapidly expanding sequence resources across a broad range of reference and non-model species. Here, application of combinatorial functional studies in both microbial and plant hosts systems have proven to be a powerful tool to analyze modular terpenoid-metabolic networks comprised of multiple functionally distinct enzymes (Zerbe et al., 2013; Kitaoka et al., 2015; Andersen-Ranberg et al., 2016; Johnson et al., 2019a). Along with more efficient identification of new enzyme functions, continued structure-function studies will provide a deeper understanding of the functional diversity and molecular evolution of species-specific enzymes and pathways. Likewise, advanced quantum and molecular mechanics approaches for protein modeling and carbocation docking can utilize deeper structural insight to improve the precision of TPS functional prediction (Isegawa et al., 2014; Chow et al., 2015; Tian et al., 2016; O'Brien et al., 2018). Knowledge of terpenoidmetabolic genes, enzymes, and pathways will increasingly enable the investigation of terpenoid physiological functions *in planta* and under various environmental conditions. To this end, gene editing and transformation techniques applicable to a broader range of model and non-model species that produce species-specific blends of bioactive terpenoids will be critical (Wurtzel and Kutchan, 2016). Together, advanced genomic and biochemical tools and a deeper understanding of terpenoid biosynthesis and function have tremendous potential for harnessing the natural diversity of plant terpenoids for, for example, improving crop resistance and other quality traits and developing advanced protein and pathway engineering strategies for producing known and novel bioproducts.

# AUTHOR CONTRIBUTIONS

PK and PZ jointly wrote the manuscript.

# FUNDING

Work in the laboratory of the authors has been funded by the NSF Plant-Biotic Interactions Program (grant# 1758976 to PZ), by the DOE Joint Genome Institute Community Science Program (grant# CSP2568 to PZ), and by the DOE Early Career Research Program (grant# DE-SC0019178 to PZ).

# REFERENCES


Bohlmann, J., and Keeling, C. I. (2008). Terpenoid biomaterials. *Plant J.* 54, 656– 669. doi: 10.1111/j.1365-313X.2008.03449.x


synthases of the TPS-d subfamily. *Plant Physiol.* 135, 1908–1927. doi: 10.1104/ pp.104.042028


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Karunanithi and Zerbe. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation

*Adriano Rutz1†, Miwa Dounoue-Kubo1,2†, Simon Ollivier1, Jonathan Bisson3,4, Mohsen Bagheri1,5, Tongchai Saesong6, Samad Nejad Ebrahimi5, Kornkanok Ingkaninan6, Jean-Luc Wolfender1\* and Pierre-Marie Allard1\**

*1 Institute of Pharmaceutical Sciences of Western Switzerland (ISPSO), University of Geneva, Centre Médical Universitaire (CMU), Geneva, Switzerland, 2 Faculty of Pharmaceutical Sciences, Tokushima Bunri University, Tokushima, Japan, 3 Center for Natural Product Technologies, Program for Collaborative Research in the Pharmaceutical Sciences (PCRPS), University of Illinois at Chicago, Chicago, IL, United States, 4 Department of Pharmaceutical Sciences, College of Pharmacy, University of Illinois at Chicago, Chicago, IL, United States, 5 Department of Phytochemistry, Medicinal Plants and Drugs Research Institute, Shahid Beheshti University, G.C., Evin, Tehran, Iran, 6 Department of Pharmaceutical Chemistry and Pharmacognosy, Faculty of Pharmaceutical Sciences and Center of Excellence for Innovation in Chemistry, Naresuan University, Phitsanulok, Thailand*

#### *Edited by:*

*Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan*

#### *Reviewed by:*

*Hiroshi Tsugawa, RIKEN, Japan Tobias Kind, University of California, Davis, United States*

#### *\*Correspondence:*

*Jean-Luc Wolfender jean-luc.wolfender@unige.ch Pierre-Marie Allard pierre-marie.allard@unige.ch*

*†These authors have contributed equally to this work*

#### *Specialty section:*

*This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science*

*Received: 14 July 2019 Accepted: 24 September 2019 Published: 25 October 2019*

#### *Citation:*

*Rutz A, Dounoue-Kubo M, Ollivier S, Bisson J, Bagheri M, Saesong T, Ebrahimi SN, Ingkaninan K, Wolfender J-L and Allard P-M (2019) Taxonomically Informed Scoring Enhances Confidence in Natural Products Annotation. Front. Plant Sci. 10:1329. doi: 10.3389/fpls.2019.01329*

Mass spectrometry (MS) offers unrivalled sensitivity for the metabolite profiling of complex biological matrices encountered in natural products (NP) research. The massive and complex sets of spectral data generated by such platforms require computational approaches for their interpretation. Within such approaches, computational metabolite annotation automatically links spectral data to candidate structures *via* a score, which is usually established between the acquired data and experimental or theoretical spectral databases (DB). This process leads to various candidate structures for each MS features. However, at this stage, obtaining high annotation confidence level remains a challenge notably due to the extensive chemodiversity of specialized metabolomes. The design of a metascore is a way to capture complementary experimental attributes and improve the annotation process. Here, we show that integrating the taxonomic position of the biological source of the analyzed samples and candidate structures enhances confidence in metabolite annotation. A script is proposed to automatically input such information at various granularity levels (species, genus, and family) and complement the score obtained between experimental spectral data and output of available computational metabolite annotation tools (ISDB-DNP, MS-Finder, Sirius). In all cases, the consideration of the taxonomic distance allowed an efficient re-ranking of the candidate structures leading to a systematic enhancement of the recall and precision rates of the tools (1.5- to 7-fold increase in the F1 score). Our results clearly demonstrate the importance of considering taxonomic information in the process of specialized metabolites annotation. This requires to access structural data systematically documented with biological origin, both for new and previously reported NPs. In this respect, the establishment of an open structural DB of specialized metabolites and their associated metadata, particularly biological sources, is timely and critical for the NP research community.

Keywords: metabolite annotation, chemotaxonomy, scoring system, natural products, computational metabolomics, taxonomic distance, specialized metabolome

# INTRODUCTION

Specialized metabolites define the chemical signature of a living organism. Plants, sponges and corals, but also microorganisms (bacteria and fungi), are known to biosynthesize a wealth of such chemicals, which can play a role as defense or communication agents (Brunetti et al., 2018). Throughout history, humans have been relying on plant derived products for a variety of purposes: housing, feeding, clothing and, especially, medication. In fact, our therapeutic arsenal is deeply dependent on the chemistry of natural products (NPs) whether they are used in mixtures, purified forms or for hemi-synthetic drug development. After a period of disregard by the pharmaceutical industry, NPs are now the object of renewed interest, partly because of the promises of the latest technological developments (Shen, 2015). Developments in metabolite profiling by mass spectrometry (MS) grant access to large volumes of high-quality spectral data from minimal amount of samples and appropriate data analysis workflows allow to efficiently mine such data (Wolfender et al., 2019). Initiatives such as the Global Natural Products Social (GNPS) molecular networking (MN) project offer both a living MS repository and the possibility to establish MN organizing MS data (Wang et al., 2016). However, despite such advancements, *metabolite identification* remains a major challenge for both NP research and metabolomics (Kind et al., 2018). Metabolite identification of a novel compound requires physical isolation of the analyte followed by complete NMR acquisition and three-dimensional structural establishment *via* X-ray diffraction or chiroptical techniques. For previously described compounds, metabolite identification implies complete matching of physicochemical properties between the analyte and a standard compound (including chiroptical properties). Metabolite identification is thus a tedious and labor-intensive process, which should ideally be reserved to novel metabolites description. Any less complete process should be defined as *metabolite annotation*. By definition, metabolite annotation can be applied at a higher throughput and offers an effective proxy for the chemical characterization of complex matrices. This process includes dereplication (the annotation of previously described molecules prior to any physical isolation process) and allows focusing isolation and metabolite identification efforts on potentially novel compounds only (Gaudêncio and Pereira, 2015).

Given its sensitivity, selectivity and structural determination potential, MS is a tool of choice for metabolite annotation in complex mixtures. Various computational MS solutions have been developed to link experimental spectra to chemical structures. They can be classified into experimental rule-based strategies [MassHunter, Agilent Technologies], combinatorial fragmentation strategies [MetFrag, (Ruttkies et al., 2016)], machine learning based approaches using stochastic Markov modelling [CFM-ID, (Allen et al., 2014; Djoumbou-Feunang et al., 2019)] or predicting fragmentation trees [Sirius (Böcker et al., 2009; Dührkop et al., 2019)]. Computationally demanding ab initio calculations, modeling the gas-phase fragmentation process, have also been proposed (Bauer and Grimme, 2016). The output of such tools is, in general, a list of candidate molecules ranked according to a score. Such score can be based on a single measure (e.g. spectral similarity in CFM-annotate) (Allen et al., 2014) or integrate combined parameters (MS-Finder, Sirius) (Tsugawa et al., 2016; Dührkop et al., 2019; Tsugawa et al., 2019). The rationale behind comprehensive scoring systems is that orthogonal information (not directly related to spectral comparison) should further strengthen the metabolite annotation process. This has been illustrated in the past by using the number of literature references related to a candidate structure and basic retention time scoring based on logP in MetFrag 2.2 (Ruttkies et al., 2016). Recently, the integration of retention order prediction to an MS/MS prediction tool provided increased performance in metabolite annotation (Bach et al., 2018). Another example is the Network Annotation Propagation (NAP) approach, which takes advantage of the topology of a MN to proceed to a re-ranking of annotated candidates within a cluster where structural consistency is expected (da Silva et al., 2018). In our view, increased confidence in specialized metabolite annotation can be achieved by the establishment of a metascoring system capturing the similarity of diverse attributes shared by the queried analytes and candidate structures (Allard et al., 2017). Such metascore could for example consider 1) spectral similarity 2) taxonomic distance between the producer of the candidate compound and the annotated biological matrix 3) structural consistency within a cluster and 4) physico-chemical consistency. A conceptual overview of such metascore is illustrated in **Figure 1**. To the best of our knowledge, the automated inclusion of the taxonomic dimension within a scoring system has not been considered in current metabolite annotation strategies.

The central hypothesis of this work is directly inferred from the characteristics of the specialized metabolome. Unlike primary metabolites, which are mostly ubiquitous compounds central to organism functioning, specialized metabolites are, by definition, strongly linked to the taxonomic position of the producing organism. It thus appears desirable to consider taxonomic information when describing the chemistry of an organism. A *taxonomic filtering* process could be used to limit a database (DB) to compounds previously isolated from organisms situated within a given taxonomic distance from the biological source of the analyte to annotate. However, results of chemotaxonomic studies also highlight the presence of broadly distributed metabolites. For example, liriodenine (MUMCCPUVOAUBAN-UHFFFAOYSA-N) is a widely distributed alkaloid produced by more than 50 distinct biological sources, it is found in over 30 genera belonging to 13 botanical families. Convergent biosynthetic pathways offer intriguing example of unrelated species, shaped by evolution, that end up producing similar classes of compounds (Pichersky and Lewinsohn, 2011). To proceed to the annotation of such compounds, a *taxonomically informed scoring* allowing, both, to consider spectral similarity and taxonomic information while conserving the independence of the individual resulting scores appears as a better solution than a basic filtering process.

In the frame of this study we propose such taxonomically informed scoring system and benchmark the impact of taxonomic distance consideration on a set of 2,107 identified molecules using three different computational mass spectrometry metabolite annotation tools (ISDB-DNP, MS-Finder and Sirius).

RESULTS

# Conception of the Taxonomically Informed Scoring System

and modulate their contribution to the overall score.

The constituents of specialized metabolomes, as expression products of the genome, should reflect the taxonomic position of the producing organisms. The initial hypothesis of this work is that *the attribution of a score reflecting the taxonomic distance between the biological source of the queried analyte and the one of the candidate structures, is a valuable input for a metabolite annotation process.*

*Taxonomically informed scoring* is proposed to complement the initial score (S1 in **Figure 1**) attributed to candidate structures by existing metabolite annotation tool. To this end, the initial score is first normalized. Then, scores, inversely proportional to the taxa level difference (family < genus < species) are attributed when an exact match is observed between biological source denominations at the different taxa levels. The score corresponding to the shortest taxonomic distance is then added to the initial score. Candidates are further re-ranked according to the newly complemented score. In this study, no phylogenetic distances within taxa (e.g. family, genus or species) were considered due to high computational requirements, but the development of such an approach would be of interest. The general outline of the *taxonomically informed scoring* system is presented in **Figure 2**.

In order to apply the *taxonomically informed scoring* in a generic manner, the initial scores given by the metabolite annotation tools were rescaled to obtain values ranging from 0 (worst candidate) to 1 (best candidate). The scores, given according to the taxonomic distance between the biological source of the queried spectra and the one of the candidate compounds, were integrated in the final score by a sum. This choice allows to keep independence between individual components of the metascore (see **Figure 1**). Since the boundaries of the candidates' normalized score in a given dataset are defined (0 to 1), the minimal score to be applied to the worst candidate for it to be ranked at the first position after *taxonomically informed scoring* is 1. Following our initial hypothesis, a score of 1 was thus given if a match between biological sources was found at the family taxa level. In the case where the initial maximal score (1) would be given to a candidate and added to a score corresponding to a match at the family level (1), a score of at least 2 should be given for a candidate having the worst score to be ranked above. A score of 2 was thus given if a match between biological sources was found at the genus taxa level. Following the same logic, a score of 3 was given for matches between biological sources at the species level.

# Benchmarking the Influence of Taxonomically Informed Scoring in Metabolite Annotation

# Establishment of a Benchmarking Dataset

In order to establish the importance of considering taxonomic information in metabolite annotation, an experimental reference dataset constituted by molecular structures, their MS/MS spectra

genus and species level, when available. A score, inversely proportional to the taxonomic distance between the biological source of the standard compound and the one of the candidate compounds is given when the biological source of the candidate structures matches the biological source of the standard at the family, genus and species level, respectively. The maximum score for each candidate is then added to its spectral score to yield a complemented spectral score. Finally, candidates are re-ranked according to the complemented spectral score.

acquired under various experimental conditions and their biological sources, in the form of a fully resolved taxonomic hierarchy, was needed. This dataset, denominated hereafter *benchmarking dataset*, was built by combining a curated structural/biological sources dataset (obtained from the Dictionary of Natural Products (DNP)) and a curated structural/spectral dataset (obtained from GNPS librairies). Steps followed for the establishment of the benchmarking dataset are detailed below and summarized in **Supplementary Figure S3**.

# *Structural and Biological Sources Dataset*

The prerequisite to apply a *taxonomically informed scoring* in a metabolite annotation process is to dispose of the biological source information of *1)* the queried MS/MS spectra and *2)* the candidate structures. To the best of our knowledge, there is currently no freely available database (DB) compiling NP structures and their biological sources down to the species level. This study uses the DNP which is commercially available and allows export of structures and biological sources as associated metadata. A curation process using the Global Names index, kept biological sources resolved against the Catalogue of Life and resulted in 219,800 entries with accepted scientific names and a full, homogeneous, taxonomy up to the kingdom level. For example, the entry initially corresponding to pulsaquinone, "Constit. of the roots of *Pulsatilla koreana*", is converted to Plantae | Tracheophyta | Magnoliopsida | Ranunculales | Ranunculaceae | *Pulsatilla* | *Pulsatilla cernua* in the curated DB. See Material and Methods and **Supplementary Figure S3** for details concerning the curation process.

# *Structural and Spectral Dataset*

The GNPS libraries agglomerate a wide and publicly available ensemble of MS/MS spectra coming from various analytical platforms and thus having different levels of quality (Wang et al., 2016). These spectral libraries were used as representative source of diverse experimental MS/MS spectra to evaluate the annotation improvement that could be obtained by applying *taxonomically informed scoring*. All GNPS libraries and publicly accessible third-party libraries were retrieved online (https:// gnps.ucsd.edu/ProteoSAFe/libraries.jsp) and concatenated as a single spectral file containing 66,646 individual entries. The pretreatment described in Material and Methods, yielded a dataset of 40,138 structures (8,558 unique structures) with their experimental associated MS/MS acquired on different platforms. See **Supplementary Figure S3**.

### *Structural, Spectral and Biological Sources Dataset (Benchmarking Dataset)*

To apply the *taxonomically informed scoring*, it is required that denominations of both *1)* the queried spectra and *2)* the candidate structures biological sources are resolved using a common taxonomy backbone (i.e. using the accepted denomination). It was thus necessary to build an experimental spectral dataset for which each entry had a unique structure and a properly documented biological source, which constituted the benchmarking set. The structural and spectral dataset was matched against the structural and biological sources dataset, following the procedure detailed in Material and Methods. The full processing resulted in a dataset of 2,107 individual entries (characterized NPs with no stereoisomers distinction and a unique biological source associated), which was used for the rest of this study. See **Supplementary Figure S3**.

Analysis of the benchmarking dataset showed a chemodiversity comparable to the one of DNP (see panels **A** and **C** in **Figure 3**). Regarding the distribution of the biological sources in the benchmarking dataset, available data mostly matched plant specialized metabolites (see panel **B** in **Figure 3**). Additionally, repartition of mass analyzer types indicated the heterogeneous spectral quality of MS/MS spectra of the benchmarking dataset and was representative of commonly used analytical platforms. See repartition in **Supplementary Figure S7**.

## Evaluation of the Improvement of Metabolite Annotation on the Benchmarking Set

In order to assess the importance of considering taxonomic information in the annotation process, the outputs of three different computational MS-based metabolite annotation solutions were considered (ISDB-DNP, MS-Finder and Sirius). The 2,107 spectra of the benchmarking dataset were queried using these tools. The precision and accuracy of structural determination with and without the use of *taxonomically informed scoring* were systematically compared according to parameters detailed in the Material and Methods section.

### *Metabolite Annotation Tools Used Isdb-Dnp*

The first tool, denominated hereafter ISDB-DNP (*In Silico* DataBase—Dictionary of Natural Products) is a metabolite annotation strategy that we previously developed (Allard et al., 2016). This approach is focused on specialized metabolites annotation and is constituted by a pre-fragmented theoretical spectral DB version of the DNP. The *in silico* fragmentation was performed by CFM-ID (Allen et al., 2015). CFM-ID is, to our knowledge, the only computational solution currently available able to generate a theoretical spectrum with prediction of fragment intensity. The matching phase between experimental spectra and the theoretical DB is based on a spectral similarity measure (cosine score) performed using Tremolo (Wang and Bandeira, 2013). The scores are reported from 0 (worst candidate) to 1 (best candidate).

*Ms-Finder.* The second tool is MS-Finder. This *in silico* fragmentation approach considers multiple parameters such as bond dissociation energies, mass accuracies, fragment linkages and various hydrogen rearrangement rules at the candidate ranking phase (Tsugawa et al., 2016). The resulting scoring system range from 1 (worst candidate) to 10 (best candidate).

*Sirius.* The third tool to be used is Sirius 4.0. It is considered as a state-of-the-art metabolite annotation solution, which combines molecular formula calculation and the prediction of a molecular fingerprint of a query from its fragmentation tree and spectrum (Dührkop et al., 2019). Sirius uses a DB of 73,444,774 unique structures for its annotations. The resulting score is a probabilistic measure ranging between negative infinity (worst candidate) and 0 (best candidate).

### *Computation of the Taxonomically Informed Score*

R scripts were written to perform *1)* cleaning and standardization of the outputs, *2) taxonomically informed scoring* and re-ranking. First, the outputs were standardized to a table containing on each row: the unique spectral identifier (CCCMSLIB N°) of the queried spectra, the short InChIKey of the candidate structures, the score of the candidates (within the scoring system of the used metabolite annotation tool), the biological source of the standard compound and the biological source of the candidate structures. As described above, a score, inversely proportional to the taxonomic distance between the biological source of the annotated compound and the biological source of the candidate structure, was given when both matched at the family (score of 1), genus (score of 2) or species level(s) (score of 3). A sum of this score (1 to 3) and the original score (0 to 1) yielded the *taxonomically informed score*. This *taxonomically informed score* was then used to re-rank the candidates from highest to lowest score.

## *Results Before Taxonomically Informed Scoring*

Using each tools' initial scoring system, on the 2,107 experimental MS/MS spectra constituting the benchmarking set, the ISDB-DNP returned 214 (10.2%) correct annotations at rank 1, Sirius 975 (46.3%) and MS-Finder 180 (8.5%). The total number of unique correct annotations ranked first covered by ISDB-DNP, Sirius and MS-Finder prior to *taxonomically informed scoring* reached 1110 or 52.7% of the benchmarked dataset. Out of these, 29 (less than 1.4%), were common to all three tools, indicating the interest of considering various annotation tools when proceeding to metabolite annotation. Venn diagram in **Figure 4** illustrates the complementarity of returned annotations. Within all candidates (all ranks), the ISDB-DNP returned 1,750 correct annotations, Sirius 1,589 and MS-Finder 574. The ROC curves outline the number of correct hits outside first rank and indicate remaining improvement potential. See **Supplementary Figure S4**.

### *Results Aἀer Taxonomically Informed Scoring*

After *taxonomically informed scoring* and reranking, the number of correct annotations at rank 1 increased to 1,510, 1,508 and 546, respectively for ISDB-DNP, Sirius and MS-Finder. The

total number of correct annotations covered by all ISDB-DNP, Sirius and MS-Finder after *taxonomically informed scoring* reached 1786 or 84.8% of the benchmarked dataset. Interestingly, more than 10-fold increase after *taxonomically informed scoring* was also observed for the correctly annotated metabolite commonly returned by the three tools 376 (17%). It has to be noted that no stereoisomer distinction could be performed since all correct matches were assessed based on short InChIKey comparison.

F1 score (harmonic mean of precision and recall rate) was used in order to evaluate the impact of the *taxonomically informed scoring*. More details on the establishment of the score can be found in Material and Methods. The F1 scores of the three metabolite annotation tools before and after *taxonomically informed scoring* are displayed on **Figure 4**. The *taxonomically informed scoring* stage led to a systematic increase of the F1 score for the benchmarked tools. This increase was 7-fold (ISDB-DNP), 1.5-fold (Sirius) and 3-fold (MS-Finder).

# Optimization of Scores Combination for the Taxonomically Informed Scoring

In order to verify our initial hypothesis and define the optimal scores combination (at the family, genus and species taxa level) to be applied for *taxonomically informed scoring* we proceeded to an optimization of the *taxonomically informed scoring* function.

To this end, the taxonomic information related to candidate annotations was artificially degraded. This step allowed to mimic a "real life" case in which candidate annotation's taxonomic metadata are not necessarily complete or correct down to the species level. Using the procedure detailed in the corresponding Material and Methods section, a Bayesian optimization algorithm was applied four times on four randomized datasets. It quickly converged (100 iterations) towards a global maximum (max 1,126 hits, see **Figure 5**). The optimal scores were found to be 0.81, 1.62 and 2.55 for the family, genus and species taxa level, respectively. Such scores are dependent on the nature and completeness of the employed taxonomic metadata. However, the results obtained when applying the Bayesian optimization on the annotation sets for which taxonomic metadata was randomly degraded, indicated that optimal results were systematically obtained when *the attributed scores were inversely proportional to the taxa hierarchical position*, thus confirming our initial hypothesis.

# Application of the Taxonomically Informed Scoring to the Annotation of Metabolites From *Glaucium* Sp

The interest of the *taxonomically informed scoring* was further illustrated for the annotation of specialized metabolites from *Glaucium* species (Papaveraceae family). Three species, *G. grandiflorum*, *G. fimbrilligerum* and *G. corniculatum* were studied. The ethyl acetate and methanolic extracts of the three species were profiled by UHPLC-HRMS in positive ionization mode using a data-dependent MS/MS acquisition. After appropriate data treatment and molecular network generation

(see corresponding Material and Methods section), the *taxonomically informed scoring* was used to re-rank the candidate annotation returned by the ISDB-DNP. Best five hits were kept. We especially focused on the two major compounds (MS signal intensity) of *G. grandiflorum.* These were feature *m/z* 342.1670 at 1.42 min and *m/z* 356.1860 at 1.83 min. According to the optimization results (see previous section), a score of 0.81 was given to candidates for which the biological source was found to be Papaveraceae at the family level, 1.62 to *Glaucium* at the genus level and 2.55 to *G. grandiflorum* at the species level. The results of the *taxonomically informed scoring* annotation for feature *m/z* 342.1670 at 1.42 min are presented in **Table 1**. See **Supplementary Table S5** for annotation results concerning feature *m/z* 356.1860 at 1.83 min.

Both features were targeted within the extract and, after isolation, the structure of their corresponding compound was determined by 1D and 2D NMR measurements (see spectra in **Supplementary Material S1** and **S2**). NMR spectra of feature *m/z* 342.1670 at 1.42 min matched to the literature reported spectra for predicentrine (Guinaudeau et al., 1979). NMR Spectra of feature *m/z* 356.1860 at 1.83 min matched to glaucine (Huang et al., 2004). In both cases, the candidate structure proposed *via*  the *taxonomically informed scoring* annotation at rank 1 was found to be correct. With the classical spectral matching process, the correct candidates were initially ranked at positions 9 and 7 for predicentrine and glaucine, respectively (see **Table 1** and **S5** in **Supplementary Material**).

Additional predicentrine analogues were annotated in the corresponding cluster (see examples in **S6** in **Supplementary Material**).

# DISCUSSION

The metabolite annotation process can be boiled down to the comparison of attributes (e.g. exact mass, molecular formula

(MF) fragmentation spectra) of the queried analyte to attributes of candidate structures present in a DB. When HRMS and appropriate heuristic filters are used, the establishment of the MF of the analyte is relatively straightforward (Kind and Fiehn, 2007). However, this is not sufficient to proceed to metabolite annotation given the isomeric nature of numerous NP: over all compounds reported in the DNP, less than 10% have a unique chemical formula, the average number of compounds per molecular formula is 8.6 and a maximum of 1,274 isomers is found for C15H20O4. With an MS1 analysis relying on exact mass only, no ranking between those isomers is possible. Computational metabolite annotation tools allow to attribute a score to candidate structures and, thus, to discriminate isomeric molecules. However, MF and fragmentation spectra are not the only attributes which can be compared in the metabolite annotation process. Specialized metabolites, as products of biosynthetic clusters themselves part of the genome, are tightly linked to the taxonomic position of the producing organisms (Hoffmann et al., 2018; Ernst et al., 2019). Here, we demonstrate that the taxonomic distance between the biological source of the queried compound and the biological source of the candidate structures is a valuable attribute to integrate into the metabolite annotation process. We show that such information can be considered in a

*taxonomically informed scoring* system and automatically applied to the outputs of different computational metabolite annotation programs. The consideration of taxonomic information was shown to systematically improve the F1 score of the evaluated solutions (ISDB-DNP, Sirius, MS-Finder) with a 1.5 to 7-fold increase. The advantage of considering such information in the metabolite annotation process are thus observed *independently* of the tools and their associated structural DBs.

It is worth noting that this benchmarking was carried to evaluate the importance of considering taxonomic information during the metabolite annotation process. *It was not meant to compare the performances of the tools.* Indeed, all compounds of the benchmarking dataset are present in the DNP, and the ISDB-DNP tool, which is by definition backed by the same DB is thus favored. On the other hand, the GNPS spectral libraries were also part of the Sirius training set. Furthermore filters for the selection of [M+H]+ adducts and for the filtering MS/MS spectra (500 most intense peaks) were applied to meet restriction of the ISDB-DNP and MS-Finder, respectively. Finally, a number of entries (197) of the benchmarking dataset were found to have large mass difference (> 0.01 Da) between their experimental parent ion mass and their calculated exact mass. For example, cevadine [M+H]+ (CCMSLIB00004689734) had an experimental



parent ion mass of 632.386 Da, while its calculated exact mass is 591.3407 Da (C32H49NO9). Of course, such erroneous entries

cannot be identified by the computational metabolite annotation tools (the list of these problematic entries is available online Problematic\_entries.csv). Altogether these elements prevent a fair comparison of each tool's performances. Another precautionary statement concerns the results of the optimization on candidate datasets for which taxonomic information had been randomly degraded at multiple taxa level. This optimization indicated, for the ISDB-DNP results, that the optimal combination of scores was 0.81, 1.62 and 2.55 (family, genus and species taxa level, respectively). Such results should be taken with caution, and not as absolute optimal values, as such optimization process are heavily dependent on the training sets. Nevertheless, the optimization indicates that the best results were repeatedly obtained when the assigned scores were inversely proportional to the taxonomic distance between the biological sources of, both, the queried spectra and the candidate structures.

Other limitations of the described metabolite annotation strategy include its application range and prerequisites. Indeed, it is important to note that such *taxonomically informed scoring* system will mostly benefit the annotation process of *specialized metabolites* and not ubiquitous molecules (e.g. coming from the primary metabolism) for obvious reasons. Furthermore, it heavily depends on the availability and quality of DBs compiling structures and their biological sources reported as a fully and homogeneously resolved taxonomy. To the best of our knowledge, such DBs are not publicly available and downloadable at the moment. KNApSAcK (http://www.knapsackfamily.com/ KNApSAcK/) is a comprehensive species-metabolite relationship database compiling 116,315 metabolite-species pairs entries, it is accessible online but not downloadable. Other databases such as FooDB (http://foodb.ca) are fully downloadable but however focused on food-related metabolites, furthermore the biological sources field is not standardized. The NPAtlas (https:// www.npatlas.org/) is an interesting initiative, however biological sources information down to the species level is only accessible in query mode and the DB is limited to 24,594 metabolites of microbial origin only. The Dictionary of Natural Products, which we used in this study is the widest compilation of structure/ biological sources pairs, but is only available commercially. Furthermore, the biological sources are reported as a free text field (codes are available only for the family taxa levels and above), thus requiring tedious standardization and name resolving.

It is therefore important for the community to start the systematic reporting of biological sources, together with spectral and structural information, when documenting novel metabolites. In fact, reporting newly described biological occurrence should be encouraged even for previously described metabolites. However, the policy of most journals in NP research is to accept for publication only description of novel and bioactive structures, which hinders these potentially informative reports. The GNPS spectral libraries (https://gnps. ucsd.edu/ProteoSAFe/libraries.jsp) and MassIVE repositories (https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp) appear as optimal places, at the moment, to compile and share NP spectral and structural information. However, although free text comments can complement the documentation of an entry, no standardized fields are available to report the biological sources of the uploaded spectra. The creation of such a feature, ideally directly linking the entered biological sources to existing taxonomy backbones such as GBIF (https://www.gbif.org) or Catalogue of Life (http://www.catalogueoflife.org), would be extremely useful. A recent initiative, the Pharmacognosy Ontology (PHO), that builds on 50 years of development of NAPRALERT (https://www.napralert.org) is aimed at providing a Free and Open resource that will link taxonomical, chemical and biological data (http://ceur-ws.org/Vol-1747/ IP12\_ICBO2016.pdf). Of course, and in addition to correct and systematic biological sources occurrence reporting in dedicated DBs, it is of utmost importance to count on the expert knowledge of trained taxonomists specialized in the classification of living organisms. But it seems that today, unfortunately, these people are very few (Ajmal Ali and Choudhary, 2011; Drew, 2011).

Building on the proposed *taxonomically informed scoring*, further developments will pass by the consideration of more accurate quantification of taxonomic distances and by strengthening the metascoring system. Indeed, the approach presented here only considers the identity between the biological sources, at different taxa level, of the query compounds and the ones of the candidate structures. Taking into account a more precise phylogenetic position within or across taxa, for example *via* the calculation of taxonomic distinctiveness indexes (Clarke and Warwick, 1998; Weikard et al., 2006), could offer a more accurate distance and eventually improve such *taxonomically informed scoring* process. Such calculations could however reveal to be computationally demanding to realize on the fly. On another plan, efforts remain to be done towards the establishment of a global metascore (see **Figure 1**). Integrating the proposed taxonomic distance scoring (S2 in **Figure 1**) with the maximal number of available metadata (S3, S4, …) when proceeding to metabolite annotation should only be beneficial to such process. However, problematics such as the individual weights (*w*1, *w*2, *w*3,…) to attribute to each individual score of the metascore will have to be addressed.

# CONCLUSION

Efficient characterization of specialized metabolomes is a key challenge in metabolomics and NP research. Recent technical advances allow access to an ever-increasing amount of data, raising the need for *ad hoc* computational solutions for their interpretation. The metabolite annotation process, which can be resumed to the comparison of attributes of the queried features against attributes of the candidate structures can benefit from information complementary to the classically used MS/MS fragmentation. Ideally, the quantification of multiple attributes' similarities (or dissimilarities) should be integrated within a metascoring system. Here, we demonstrate that the consideration of the taxonomic distance separating the biological sources of both the queried analytes and the candidate structures can drastically improve the efficiency of existing MS-based computational metabolite annotation solutions. Metabolite annotation is crucial to guide chemical ecology research or drug discovery projects. More than two hundred years later, the present work thus supports the first of De Candolle's assumptions, *"Plant taxonomy would be the most useful guide to man in his search for new industrial and medicinal plants"* (de Candolle, 1804). His correlated postulate, *"Chemical characteristics of plants will be most valuable to plant taxonomy in the future"*, will be equally interesting to verify with computational approaches. Various strategies have been proposed to exploit structural (or biosynthetic) relationships among metabolites and further organize the producing organisms (Liu et al., 2017; Junker, 2018; Ernst et al., 2019; Kang et al., 2019) and interesting developments will appear once robust metabolite annotation solution are coupled to comprehensive DBs compiling structures and their biological sources. Indeed, specialized metabolome annotation could be a novel way to infer the taxonomic position of an unknown sample, just as valid as a genetic sequencing. Metabolite annotation can benefit from taxonomy and taxonomic relationships can be inferred from precise metabolite characterization. Efforts in both directions should thus fuel a *virtuous cycle of research aiming to better understand Life and its chemistry.*

# MATERIAL AND METHODS

# Outline and Implementation of the Taxonomically Informed Scoring System

To evaluate the importance of considering taxonomic information in the annotation process, three different computational mass spectrometry-based metabolite annotation tools were used (namely, ISDB-DNP, MS-Finder and Sirius). This resulted in three different outputs constituted by a list of candidates returned by each tool for the entries of the benchmarking dataset. These candidates were ranked according to the scoring system of each tool. R scripts in the form of markdown notebooks were written to perform *1)* cleaning and standardization of the outputs (1\_taxo\_ cleaner.Rmd) *2) taxonomically informed scoring* and re-ranking (2\_taxo\_scorer.Rmd). First, the outputs were standardized to a table containing on each row: a unique spectral identifier (CCMSLIB N°) of the queried spectra, the short InChIKey of the candidate structures, the score of the candidates (within the scoring system of the used metabolite annotation tool), the biological source of the standard compound and the biological source of the candidate structures. As described in the results section, a score, inversely proportional to the taxonomic distance between the biological source of the annotated compound and the one biological source of the candidate structure, was given when an exact match was found between both biological sources at the family, genus or/and species level(s). A sum of this score and the original score yielded the *taxonomically informed score*. This score was then used to re-rank the candidates. See **Figure 2** for a schematic overview of the *taxonomically informed scoring* process. Scripts are available at https://github.com/oolonek/taxo\_scorer.

# Dataset Preparation

## Structural and Biological Sources Dataset

In the Dictionary of Natural Products (v 27.1), taxonomic information appears in two fields. The *Biological Source* field, which is constituted by a free text field reporting occurrence of a specific compound and the *Compound Type* field which reports various codes corresponding to molecule classes or taxonomic position at the family level. As an example, for the entry corresponding to larictrin 3-glucoside (ODXINVOINFDDDD-UHFFFAOYSA-N), the Biological Source field indicates "Isol. from *Larix* spp., *Cedrus* sp. and other plant spp. Constit. of *Vitis vinifera* cv. Petit Verdot grapes and *Abies amabilis.*" and the Compound Type field indicates "V.K.52600 W.I.40000 W.I.35000 Z.N.50000 Z.Q.71600" suggesting that biological sources are found in the Phyllocladaceae (Z.N.50000) and Vitaceae family (Z.Q.71600). The biological source information is reported in a non-homogeneous way and multiple biological sources are reported in the same row. In order to extract taxonomic information out of the free text contents, the *gnfinder* program (https://github.com/gnames/gnfinder) was used. Gnfinder takes UTF8-encoded text as inputs and returns back JSON-formatted output that contains detected scientific names. It automatically detects the language of the text and uses complementary heuristic and natural language processing algorithms to detect patterns corresponding to scientific binomial or uninomial denomination. Gnfinder was forced for English language detection. In addition to scientific denomination extraction, gnfinder allows to match the detected names against the Global Names index services (https:// index.globalnames.org). The preferred taxonomy backbone was set to be Catalogue of Life. This last step allowed to return the full taxonomy down to the entered taxa level. It also allows to resolve synonymy. Since gnfinder is designed to mine raw texts, the JSON formatted output indicates the position of the detected name in the original input by character position. A python script was written to output a .csv file with the found name and taxonomy in front of the corresponding input. When multiple biosources were found for an entry, this one was duplicated in order to obtain a unique structure/biological source pair per row. The script is available online (gnfinder\_field\_scrapper.py).

## Structural and Spectral Dataset

All GNPS libraries and publicly accessible third-party libraries were retrieved online (https://gnps.ucsd.edu/ProteoSAFe/ libraries.jsp) and concatenated as a single spectral file (Full\_ GNPS\_lib.mgf) in the .mgf format. A python Jupyter notebook (mgf\_filterer.ipynb) was created to filter.mgf spectral file according to specific parameters: maximum and minimum number of fragments per spectrum and defined spectral ID (e.g. CCMLIB N°). The spectral file was filtered to retain only entries having at least 6 fragments. For spectra containing more than 500 fragments, only the 500 most intense were kept. A second python Jupyter notebook (GNPS\_parser\_cleaner.ipynb) was written to proceed to *1)* extraction of relevant metadata (parent ion mass, SMILES, InChI, library origin, source instrument, molecule name and individual spectrum id value (CCMSLIB N°) *2)* filtering entries having at least one structural information associated (SMILES and/or InChI) and corresponding to protonated adducts and *3)* converting structures to their InChIKey, a 27-character hashed version of the full InChI*.* The InChIKey conversion was realized using the RDKit 2019.03.1 framework (RDKit: Opensource cheminformatics; http://www.rdkit.org). This resulted in a structural dataset (GNPS\_lib\_structural.tsv) of 40138 entries constituted by 8558 unique compounds. The dataset was further filtered to keep entries which parent masses were comprised between 100 and 1,500 Da. Duplicate structures and stereoisomers were removed by keeping distinct InChIKey according to the first layer (first 14 characters) of the hash code. This spectral dataset encompasses spectra acquired on a variety of MS platforms (See **Supplementary Figure S7**). Scripts are available at https://github. com/oolonek/taxo\_scorer. Input and output data are available on OSF at the following address (https://osf.io/bvs6x/).

## Structural, Spectral, and Biological Sources Dataset (Benchmarking Dataset)

Once the structural and biological sources dataset and the structural and spectral datasets were prepared (as described above), both were joined in order to attribute a biological source to each spectrum. The scripts used to proceed to the merging step are part of the python Jupyter notebook (GNPS\_parser\_cleaner. ipynb). Since in most cases it is not expected to differentiate stereoisomers based on their MS spectra, the combination of both datasets was made using the short InChIKey (first 14 characters of the InChIKey) as a common key. In this merging process, only entries having biological source information resolved against the Catalogue of Life and complete down to the species level were retained. However, this merging implied that, for a given biological source, the information on the 3D aspects of the structure was lost. While this was not an issue for the benchmarking objective of this work the resulting dataset does not constitute a reliable occurrence dataset for annotation that needs stereoisomers to be differentiated. The resulting dataset containing structural, spectral and biological sources information was constituted by 2107 distinct entries and was named *benchmarking dataset*. The scripts allowing to generate the benchmarking dataset are available at https://github.com/oolonek/taxo\_scorer. The benchmarking dataset spectral data (benchmarking\_dataset\_spectral.mgf), and associated metadata (benchmarking\_dataset\_metadata.tsv) are available at the following address (https://osf.io/bvs6x/).

# Computational Metabolite Annotation Tools ISDB-DNP

The ISDB-DNP (*In Silico* DataBase—Dictionary of Natural Products) is a metabolite annotation workflow that we previously developed (Allard et al., 2016). A version using the freely available Universal Natural Products Database (ISDB-UNPD) is available online (http://oolonek.github.io/ISDB/). This approach is focused on specialized metabolites annotation and is constituted by a pre-fragmented theoretical spectral DB version of the DNP. The *in silico* fragmentation was performed using CFM-ID, a software using a probabilistic generative model for the fragmentation process, and a machine learning approach for learning parameters for this model from MS/MS data (Allen et al., 2015). CFM, is, to the best of our knowledge, the only solution available at the moment allowing to output a spectrum with fragment intensity prediction. The matching phase between experimental spectra and the theoretical DB is based on a spectral similarity computation performed using Tremolo as a spectral library search tool (Wang and Bandeira, 2013). The parameters used to proceed to the benchmarking dataset analysis were the following: parent mass tolerance 0.05 Da, minimum cosine score 0.1, no limits for the number of returned candidates.

### MS-Finder

This *in silico* fragmentation approach considers multiple parameters such as bond dissociation energies, mass accuracies, fragment linkages and various hydrogen rearrangement rules at the candidate ranking phase (Tsugawa et al., 2016). The resulting scoring system range from 1 to 10. The parameters used to proceed to the benchmarking dataset analysis were the following: mass tolerance setting: 0.1 Da (MS1), 0.1 Da (MS2); relative abundance cut off: 5% formula finder settings: LEWIS and SENIOR check (yes), isotopic ratio tolerance: 20%, element probability check (yes), element selection (O, N, P, S, Cl, Br). Structure Finder setting: tree depth: 2, maximum reported number: 100, data sources (all except MINEs DBs. Total number of structures, 321,617.) MS-Finder v. 3.22 was used, it is available at the following address: http://prime. psc.riken.jp/Metabolomics\_Software/MS-FINDER/.

## Sirius

Sirius 4.0.1 is considered as a state-of-the-art metabolite annotation solution, which combines molecular formula calculation and the prediction of a molecular fingerprint of a query compound from its fragmentation tree and spectrum (Dührkop et al., 2019). Sirius uses a DB of 73,444,774 unique structures for its annotations. The parameters used to proceed to the benchmarking dataset analysis were the following for Sirius molecular formula calculation: possible ionization [M+H]+, instrument: Q-TOF, ppm tolerance 50 ppm, Top molecular formula candidates: 3, filter:formulas from biological DBs. For the CSI : FingerID step, the parameters were the following: possible adducts: [M+H]+, filter: compounds present in biological DB, maximal number of returned candidates: unlimited. Sirius 4.0.1 is available at the following address: https:// bio.informatik.uni-jena.de/software/sirius/.

# Results Analysis

The F1 score was calculated for each evaluated metabolite annotation tool before and after the *taxonomically informed scoring* step. The F1 score is the harmonic mean of the recall (True Positive/(True Positive + False Negative)) and precision rate (True Positive/(True Positive + False Positive)) of a tool. The True Positive (TP) corresponds to the number of correct candidate annotations at rank 1, the False Positive (FP) to the number of wrong candidate annotations at rank 1, and the False Negative (FN) to the number of correct annotations at rank >1. The F1 score is then calculated as follows:

$$F1\ score = 2 \times \frac{\left(\text{Recall\ rate} \times \text{Precision\ rate}\right)}{\left(\text{Recall\ rate} + \text{Precision\ rate}\right)}$$

An R notebook to analyze the results of the *taxonomically informed scoring* process and plot the figures of this manuscript is available online (taxo\_figures.Rmd) at https://github.com/ oolonek/taxo\_scorer.

# Optimization of the Scores Combination for the Taxonomically Informed Scoring

In order to establish the optimal scores to be applied for each of the taxonomic distances (family, genus and species), the information related to candidate annotations was artificially degraded. For this, the annotation dataset returned by the ISDB-DNP approach against the benchmarking dataset was randomized. The randomized annotation dataset was then split into four equal blocks. For the first three blocks, the biological source information was deleted, respectively, at the species level; at the genus and species level; and, finally, at the family, genus and species levels. The fourth block was not modified. Finally, the four blocks were merged back to a unique dataset. The process was repeated four times yielding four datasets with randomly incomplete biological sources. The taxonomic distance informed scoring process was compiled to a unique function taking three arguments (scores given when a match was found at the family, genus and species level, respectively) and outputting the number of correct hits ranked at the first position. A parallelizable Bayesian optimization algorithm (https://github.com/AnotherSamWilson/ ParBayesianOptimization) was then used, being particularly suited for the optimization of black box functions for which no formal representation is available (arXiv:1807.02811). The bounds were set between 0 and 3 for the exploration of the three parameters of the function. Number of initial points was set to 10 and the number of iterations to 100. Parameter kappa (κ) was set to 5.152, to force the algorithm to explore unknown areas. The chosen acquisition function was set to Expected Improvement (ei). Epsilon parameter (ε, eps) was set to 0. The whole procedure was run 4 times on the 4 randomized datasets. Best set of parameters were then averaged across the 16 results set. All codes required for this optimization step are available online (3\_taxo\_optimizer. Rmd) at https://github.com/oolonek/taxo\_scorer.

# Chemical Analysis and Isolation of Compounds From Glaucium Extract Plant Material

The aerial flowering parts of three *Glaucium* species were collected in May and June of 2015 from the northern part of Iran including Mazandaran and Tehran provinces. The samples were identified by Dr. Ali Sonboli, Medicinal Plants and Drugs Research Institute, Shahid Beheshti University, Tehran, Iran. The voucher specimens MPH-2351 for *G. grandiflorum* (vernacular Shaghayegh goldrosht), MPH-2352 for *G. fimbrilligerum* (vernacular Shaghayegh sharabeie) and MPH-2353 for *G. corniculatum* (vernacular Shaghayegh shakhdar or red horned poppy) have been deposited at the Herbarium of Medicinal Plants and Drugs Research Institute (HMPDRI), Shahid Beheshti University, Tehran, Iran.

## Mass Spectrometry Analysis

Chromatographic separation was performed on a Waters Acquity UPLC system interfaced to a Q-Exactive Focus mass spectrometer (Thermo Scientific, Bremen, Germany), using a heated electrospray ionization (HESI-II) source. Thermo Scientific Xcalibur 3.1 software was used for instrument control. The LC conditions were as follows: column, Waters BEH C18 50 × 2.1 mm, 1.7 μm; mobile phase, (A) water with 0.1% formic acid; (B) acetonitrile with 0.1% formic acid; flow rate, 600 μl·min−1; injection volume, 6 μl; gradient, linear gradient of 5−100% B over 7 min and isocratic at 100% B for 1 min. The optimized HESI-II parameters were as follows: source voltage, 3.5 kV (pos); sheath gas flow rate (N2), 55 units; auxiliary gas flow rate, 15 units; spare gas flow rate, 3.0; capillary temperature, 350.00°C, S-Lens RF Level, 45. The mass analyzer was calibrated using a mixture of caffeine, methionine–arginine–phenylalanine– alanine–acetate (MRFA), sodium dodecyl sulfate, sodium taurocholate, and Ultramark 1621 in an acetonitrile/methanol/ water solution containing 1% formic acid by direct injection. The data-dependent MS/MS events were performed on the three most intense ions detected in full scan MS (Top3 experiment). The MS/MS isolation window width was 1 Da, and the stepped normalized collision energy (NCE) was set to 15, 30 and 45 units. In data-dependent MS/MS experiments, full scans were acquired at a resolution of 35,000 FWHM (at *m/z* 200) and MS/MS scans at 17,500 FWHM both with an automatically determined maximum injection time. After being acquired in a MS/MS scan, parent ions were placed in a dynamic exclusion list for 2.0 s.

### MS Data Pretreatment

The MS data were converted from .RAW (Thermo) standard data format to .mzXML format using the MSConvert software, part of the ProteoWizard package (Chambers et al., 2012). The converted files were treated using the MZMine software suite v. 2.38 (Pluskal et al., 2010).

The parameters were adjusted as following: the centroid mass detector was used for mass detection with the noise level set to 1.0E6 for MS level set to 1, and to 0 for MS level set to 2. The ADAP chromatogram builder was used and set to a minimum group size of scans of 5, minimum group intensity threshold of 1.0E5, minimum highest intensity of 1.0E5 and *m/z* tolerance of 8.0 ppm. For chromatogram deconvolution, the algorithm used was the wavelets (ADAP). The intensity window S/N was used as S/N estimator with a signal to noise ratio set at 25, a minimum feature height at 10,000, a coefficient area threshold at 100, a peak duration ranges from 0.02 to 0.9 min and the RT wavelet range from 0.02 to 0.05 min. Isotopes were detected using the isotopes peaks grouper with a *m/z* tolerance of 5.0 ppm, a RT tolerance of 0.02 min (absolute), the maximum charge set at 2 and the representative isotope used was the most intense. An adduct (Na+, K+, NH4 +, CH3CN+, CH3OH+, C3H8O+ (IPA+)) search was performed with the RT tolerance set at 0.1 min and the maximum relative peak height at 500%. A complex search was also performed using [M+H]+ for ESI positive mode, with the RT tolerance set at 0.1 min and the maximum relative peak height at 500%. Peak alignment was performed using the join aligner method (*m/z* tolerance at 8 ppm), absolute RT tolerance 0.065 min, weight for *m/z* at 10 and weight for RT at 10. The peak list was gap-filled with the same RT and *m/z* range gap filler (*m/z* tolerance at 8 ppm). Eventually the resulting aligned peaklist was filtered using the peak-list rows filter option in order to keep only features associated with MS2 scans.

### Molecular Networks Generation

In order to keep the retention time, the exact mass information and to allow for the separation of isomers, a feature based molecular network (https://ccms-ucsd.github.io/GNPSDocumentation/ featurebasedmolecularnetworking/) was created using the .mgf file resulting from the MZMine pretreatment step detailed above. Spectral data was uploaded on the GNPS molecular networking platform. A network was then created where edges were filtered to have a cosine score above 0.7 and more than six matched peaks. Further edges between two nodes were kept in the network if and only if each of the nodes appeared in each other's respective top 10 most similar nodes. The spectra in the network were then searched against GNPS' spectral libraries. All matches kept between network spectra and library spectra were required to have a score above 0.7 and at least six matched peaks. The output was visualised using Cytoscape 3.6 software (Shannon et al., 2003). The GNPS job parameters and resulting data are available at the following address (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?t ask=a475a78d9ae8484b904bcad7a16abd1f).

### Taxonomically Informed Metabolite Annotation

The spectral file (.mgf) and attributes metadata (.clustersummary) obtained after the MN step were annotated using the ISDB-DNP with the following parameters: parent mass tolerance 0.005 Da, minimum cosine score 0.2, maximal number of returned candidates: 50. An R script was written to proceed to the *taxonomically informed scoring* on GNPS outputs and return an attribute table which can be directly loaded in Cytoscape. The script is available online (taxo\_ scorer\_user.Rmd) at https://github.com/oolonek/taxo\_scorer.

### Isolation of Predicentrine and Glaucine From *G. grandiflorum*

The air-dried, ground and powdered plant materials (500 g) was successively extracted by solvents of increasing polarities (hexane, ethyl acetate and methanol), 4 × 5.0 L of each solvent (48 h). An aliquot of each ethyl acetate and methanolic extract was submitted to C18 SPE (eluted with 100% MeOH), dried under nitrogen flow and redissolved at 5 mg/ml in MeOH for LC–MS analysis. The methanolic extract of *G. grandiflorum* was concentrated under reduced pressure, then dried with a nitrogen flow until complete evaporation of the residual solvent yielding 50 g of extract. An aliquot (5 g) was subjected to a VLC in order to eliminate sugars and other very polar compounds. A 250 ml sintered-glass Buchner funnel connected to a vacuum line was packed with a C18 reverse phase LiChroprep 40–63 μm (Lobar Merck, Darmstadt, Germany). After conditioning the stationary phase with methanol (4 × 250 ml, 0.1% formic acid) and distilled water (4 × 250 ml, 0.1% formic acid), 5 g of methanolic extract was dissolved in water and the mixture was deposited on the stationary phase. Elution of the sample was conducted using water (4 × 250 ml, 0.1% formic acid) in the first step and followed by methanol (4 × 250 ml, 0.1% formic acid) in the second step. This process yielded 1.4 g of processed methanolic extract. After condition optimisation at the analytical level, 50 mg of the extract were solubilized in 500 µl DMSO and injected using a Rheodyne® valve (1 ml loop). Semi-preparative HPLC-UV purification was performed on a Shimadzu system equipped with: LC20A module elution pumps, an SPD-20A UV/VIS detector, a 7725I Rheodyne® injection valve, and a FRC-10A fraction collector (Shimadzu, Kyoto, Japan). The HPLC system was controlled by the LabSolutions software. The HPLC conditions were selected as follows: Waters X-Bridge C18 column (250 × 19 mm i.d., 5 μm) equipped with a Waters C18 pre-column cartridge holder (10 × 19 mm i.d.); solvent system consists of ACN (2 mM TEA) (B) and H2O (2 mM TEA & 2 mM ammonium acetate) (A). Optimized separation condition from the analytical was transferred to semi-preparative scale by a geometric gradient transfer software (Guillarme et al., 2008). The separation was conducted in gradient elution mode as follows: 5% B in 0–5 min, 12% B in 5–10 min, 30% B in 10–30 min, 60% B in 30–55 min, 100% B in 55–65 min. The column was reconditioned by equilibration with 5% of B in 15 min. Flow rate was equal to 17 ml/min and UV traces were recorded at 210 nm and 280 nm. The separation procedure yielded 0.3 mg of predicentrine and 3.4 mg of glaucine. Spectra for predicentrine (CCMSLIB00005436122) and glaucine (CCMSLIB00005436123) were deposited on GNPS servers.

### NMR Analysis

The NMR spectra of each isolated compound was recorded on a Bruker BioSpin 600 MHz spectrometer (Avance Neo 600). Chemical shifts (δ) were recorded in parts per million in methanol‒d4 with TMS as an internal standard. NMR data are available as **Supplementary Material S1** and **S2**.

# DATA AVAILABILITY STATEMENT

Scripts and datasets generated and analyzed for this study can be found on github (scripts): https://github.com/oolonek/taxo\_ scorer and at the following OSF repository (datasets): https:// osf.io/bvs6x/(DOI 10.17605/OSF.IO/BVS6X). This manuscript has been released as a pre-print at bioRxiv: http://dx.doi.org/10. 1101/702308.

# REFERENCES


# AUTHOR CONTRIBUTIONS

P-MA, AR, MD-K, and J-LW designed the study. JB wrote the Python script for gnfinder output formatting. P-MA and AR wrote the scripts for dataset preparation, *taxonomically informed scoring*, and results analysis. P-MA, AR, MD-K, SO, MB, and TS used the scripts for metabolite annotation and provided feedback. MB and SNE collected the *Glaucium* species. P-MA and AR performed the LCMS analysis on the *Glaucium* extracts. MB analyzed profiling data, isolated the compounds of *Glaucium*, and established their structures. P-MA wrote the manuscript together with AR and J-LW. All authors discussed the results and commented on the manuscript.

# FUNDING

JB gratefully acknowledges the support of this work by grant U41 AT008706 from NCCIH and ODS. J-LW is thankful to the Swiss National Science Foundation for the support in the acquisition of the NMR 600 MHz (SNF R'Equip grant 316030\_164095).

# ACKNOWLEDGMENTS

MD-K acknowledges gratefully the support by the Yamada Science Foundation and The Nagai Foundation Tokyo. The authors acknowledge Ali Bakiri for fruitful discussions on the optimization of the weights.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.01329/ full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Rutz, Dounoue-Kubo, Ollivier, Bisson, Bagheri, Saesong, Ebrahimi, Ingkaninan, Wolfender and Allard. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Untargeted Metabolomics of Nicotiana tabacum Grown in United States and India Characterizes the Association of Plant Metabolomes With Natural Climate and Geography

*Dong-Ming Ma1, Saiprasad V. S. Gandra2, Raman Manoharlal2, Christophe La Hovary1 and De-Yu Xie1\**

1 Department of Plant and Microbial Biology, North Carolina State University, Raleigh, NC, United States, 2 ITC Life Sciences and Technology Centre (LSTC), ITC Limited, Karnataka, Bengaluru, India

#### Edited by:

Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan

#### Reviewed by:

Corey Broeckling, Colorado State University, United States Akira Oikawa, Yamagata University, Japan

> \*Correspondence: De-Yu Xie dxie@ncsu.edu

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 05 May 2019 Accepted: 04 October 2019 Published: 30 October 2019

#### Citation:

Ma D-M, Gandra SVS, Manoharlal R, La Hovary C and Xie D-Y (2019) Untargeted Metabolomics of Nicotiana tabacum Grown in United States and India Characterizes the Association of Plant Metabolomes With Natural Climate and Geography. Front. Plant Sci. 10:1370. doi: 10.3389/fpls.2019.01370

Climate change and geography affect all the living organisms. To date, the effects of climate and geographical factors on plant metabolome largely remain open for worldwide and local investigations. In this study, we designed field experiments with tobacco (Nicotiana tabacum) in India versus USA and used untargeted metabolomics to understand the association of two weather factors and two different continental locations with respect to tobacco metabolism. Field research stations in Oxford, North Carolina, USA, and Rajahmundry, Andhra Pradesh India were selected to grow a commercial tobacco genotype (K326) for 2 years. Plant growth, field management, and leaf curing followed protocols standardized for tobacco cultivation. Gas chromatography–mass spectrometry based unbiased profiling annotated 171 non-polar and 225 polar metabolites from cured tobacco leaves. Principal component analysis (PCA) and hierarchical cluster analysis (HCA) showed that two growing years and two field locations played primary and secondary roles affecting metabolite profiles, respectively. PCA and Pearson analysis, which used nicotine, 11 other groups of metabolites, two locations, temperatures, and precipitation, revealed that in North Carolina, temperature changes were positively associated with the profiles of sesquiterpenes, diterpenes, and triterpenes, but negatively associated with the profiles of nicotine, organic acids of tricarboxylic acid, and sugars; in addition, precipitation was positively associated with the profiles of triterpenes. In India, temperature was positively associated with the profiles of benzenes and polycyclic aromatic hydrocarbons, but negatively associated with the profiles of amino acids and sugar. Further comparative analysis revealed that nicotine levels were affected by weather conditions, nevertheless, its trend in leaves was independent of two geographical locations and weather changes. All these findings suggested that climate and geographical variation significantly differentiated the tobacco metabolism.

Keywords: climate change, continent, geography, metabolomics, nicotine, tobacco, non-polar metabolite, polar metabolite

# INTRODUCTION

The global warming trend has taken place over the last 100 years (Keller, 2007; Schnoor, 2007). With this continuous trend, the global mean temperature is predicted to increase by 1–4°C in the next 50–100 years (Jones, 2013; Kintisch, 2013). Accordingly, from last few decades global research efforts in different countries and regions have been undertaken to understand the effects of both global warming and regional climate changes on natural systems and human health. More endeavors are continuously being explored globally. The main research topics that have been studied and are being studied are focusing on the global scale impacts of climate change. These include effects of climate change on ecosystem and biodiversity, sea level rising, ocean acidification, water resources and desertification, agriculture and food security, and human health (Gosling et al., 2011). To date, all completed global research has led to numerous accomplishments which provide fundamental evidence that climate change significantly affects the global and regional ecosystems (Polley et al., 2013; Schmitz, 2013; Kizildeniz et al., 2015; Park et al., 2015; Hauser et al., 2016; Taylor and Kumar, 2016), agriculture, such as biotic and abiotic stresses (Adams et al., 1998; Dwivedi et al., 2013; Chauhan et al., 2014; Abberton et al., 2016; Kollah et al., 2016), and human health, such as the increase of infectious diseases and other public health problems (Patz et al., 1996; Rom and Pinkerton, 2014; Clayton et al., 2015; Kjellstrom et al., 2016). All of those data are fundamentally informative for different countries and global organizations to develop strategies and policies to slow down the global warming. To appropriately understand the effects of climate change on ecosystems, agriculture, and food safety, more endeavors are necessary to complete basic research at the molecular, biochemical, chemical, and physiological levels. However, this type of research is limited. For example, few studies have been reported to understand the effects of the combination of climate change and geography on natural product composition associated with crop and food safety and quality. Furthermore, there are particularly fewer reports regarding the comparative investigations of plant or crop metabolism associated with climate change across the continents. One of the main challenge is to conduct experiments using model plants or crops with the same genetic background that can be grown in the field in different continents. Accordingly, development of such a system is necessary to comprehensively enhance our understanding of the biological consequences of climate change.

Tobacco (*Nicotiana tabacum*) is an appropriate plant to study the effects of climate change on plant metabolism. As well characterized, metabolism in tobacco is complicated by biosynthetic networks of diverse plant metabolites such as alkaloids, terpenoids, phenolics, polyketides, alkanes, alkynes, and other metabolites, among which tobacco alkaloids, such as nicotine, cause public health problems (Sierro et al., 2014). Furthermore, more than 4,200 metabolites have been identified from tobacco plants and more than 8,000 metabolites in tobacco smoke (Rodgman and Perfetti, 2009). Currently, approximately 125 countries in Asia, Africa, North America, South America, Europe, and Australia grow tobacco plants on more than 4.3 million hectares (Eriksen et al., 2015). One primary variety grown in all these regions is the flue-cured tobacco cultivar K326, and its genome sequencing was completed a few years ago (Sierro et al., 2014). All these advantages allow K326 to be an appropriate crop plant to study the effects of climate change and geographic region or a combination of both on plant metabolism.

In this study, we used K326 tobacco to conduct a 2-year pilot study in order to understand the effects of climate change and two continental growth locations on plant metabolism. Based on our previous metabolic profiling of a small number of tobacco samples collected from tobacco companies in India, North Carolina and Brazil (Ma et al., 2013), we hypothesized that weather and growing sites in different continents differentially altered the tobacco metabolomes associated with abiotic and biotic stresses. To test this hypothesis, we selected two field locations, the Oxford tobacco research station in North Carolina, USA, and the ILTD research station (department) of ITC limited at Rajahmundry, in Andhra Pradesh, India. These two locations have been used for longterm tobacco field research. We comparatively grew K326 tobacco plants in these two locations for two consecutive years. To understand the effects of climate on plant metabolism, we followed the standard tobacco industry protocols to manage field growth, collect, and cure leaf samples. We recorded daily temperature and precipitation conditions at these two locations. A large number of samples were collected for metabolic profiling of untargeted non-polar and polar metabolites using gas chromatography–mass spectrometry (GC-MS). A metabolite analysis pipeline was developed with Mass Professional Profile (Agilent Technologies, Inc). Metabolite profiles, main weather factors, and geographic locations were integrated for association analysis. The resulting data indicated an association of metabolic variations with climatic and geographical factors.

# MATERIALS AND METHODS

# Field Study Sites and Selection of Tobacco Variety

Two different continental field sites were selected for plant growth Viz., the Oxford research station in Oxford, North Carolina (latitude: 36, longitude: −78), USA and the ILTD research station of ITC limited in Rajahmundry, Andhra Pradesh, India (latitude: 16, longitude: 80). At each research station, three field plots were designed to grow plants, each plot had an area of 4047 m2 (one acre) and constituted one field replicate (**Supplementary Figure 1**). The variety of *N. tabacum* cultivar used was K326, which is one of the primary flue-cured tobacco crops grown globally. The rationale was that this tobacco cultivar produces a large number of plant natural products and its field growth protocol has been industrially standardized for farmers in different continents. In addition, its secondary metabolism has been well documented in the literature. Therefore, it is an appropriate crop plant to investigate the effects of climate and geography on plant metabolism.

# Experimental Design, Planting, and Sampling in North Carolina

Field growth of plants (**Supplementary Figure 1**) followed protocols documented for K326 in these two regions. The growth guide for K326 is published for farmers in North Carolina every year (https://content.ces.ncsu.edu/catalog/collection/23069/ flue-cured-tobacco-information). We used this commercial growth guide to perform our experiments. Seeds were planted in soil contained in floating nursery beds in the greenhouse (**Supplementary Figure 2A**) at the end of February. This is the tobacco farming technique for commercial purpose. Seedlings were not removed out of the greenhouse until the second week of the May of 2011 (year 1) and 2012 (year 2), when plants were 72 days old. All seedlings grew strongly and uniformly (**Supplementary Figure 2B**). In North Carolina, prior to transplant, three plots were supplemented with 560 kg ha−1 8-8-24 base fertilizer and treated with clomazone and carfentrazone-ethyl/sulfentrazone for weed control. Each plot was planted with 800 plants (**Supplementary Figures 2C**, **D**) in the second week of May 2011 and 2012 (planting date was dependent upon weather). Thus, 2,400 plants in total were grown in each farming season. Plants were later side dressed with 168 kg ha−1 of 15.5-0-0 fertilizer. Field growth management, such as irrigation (two to three times with irrigation reel and water from a lake) and the use of fungicides and herbicides, followed protocols documented in the research station. Plant growth was recorded and photographed each month (**Supplementary Figure 1**). When plants started to develop flower buds in early July, we selected 20 plants in each plot and then grouped 10 plants as one biological replicate. Consequently, each plot had two biological replicates. In total, 60 plants were selected from three plots and formed six biological replicates. All these plants almost grew the same height, flowered on the same date, and had 24 leaves. After this step, all plants were topped to keep 20 leaves from the bottom to the top and tagged group numbers from I to VII as shown in **Supplementary Figure 3**. After topping, plants were sprayed with maleic hydrazide to control suckers (axillary buds). After 3 weeks of topping, we started to harvest the leaf samples from group I at the bottom to group VII at the top (**Supplementary Figure 3**) on Aug. 3, Aug. 24, Sept. 14, and Oct. 4 (**Supplementary Table 1**). In total, we harvested seven groups of leaf samples, groups I–VII (**Supplementary Table 1** and **Supplementary Figure 3**). All samples were harvested in the morning. After the leaves were excised from the stems, were immediately separated into two halves, labeled as half 1 and half 2 (**Supplementary Figure 4**). The second half section included the main middle vein and was immediately cured in barn for metabolomics experiments. The first half section without the main middle vein was immediately frozen in liquid nitrogen, transported to laboratory by car, and then stored in −80°C freezers for other experiments. After 1 month of curing, 42 cured biological samples were obtained for metabolomics analysis in each year. In addition, 10 additional plants in each plot were selected and pooled as one biological sample for leaf harvest. However, leaves from these plants were not cut into two halves. Accordingly, seven uncut groups (I–VII, **Supplementary** 

**Figure 3**) of leaves from 30 plants were harvested and immediately cured in the same barn. In total, 21 uncut biological samples were obtained (**Supplementary Table 2**). These samples were used as control to evaluate if cutting leaves into two halves could affect metabolite profiles. In summary, 84 cut leaf (biological) samples were collected from seven leaf positions (**Supplementary Figure 3**, **Supplementary Table 1**) and cured for metabolic profiling. To simplify the sample name description for data analysis, "U" was used to represent USA, and "F" and "S" were used to represent the first and second year of growth. Based on positions defined for sampling (**Supplementary Figure 3**), we used U-F1H through U-F7H to describe USA's samples in the first year from positions 1 to 7 harvest. We also used U-S1H through U-S7H to describe USA's samples in the second year from positions 1 to 7 Harvest. Each group of samples was composed of six biological replicates (**Supplementary Tables 1** and **3**). In addition, 42 entire leaf (un-cut, biological) samples as control were collected from seven positions (**Supplementary Table 2**) and cured for control metabolic profiling. Cured samples were ground into fine powder. For each biological sample, three technical replicates were prepared for non-polar and polar metabolite extraction described below.

# Experimental Design, Planting, and Sampling in India

In India, field design, experimental design, and field management practices were the same as described above. However, field growth of plants and sampling times (**Supplementary Table 3**) were different from those in North Carolina because of tropical weather in India. In addition, plants were irrigated 6–8 times with ground water using a furrow irrigation method.

Moreover, the protocol developed for farmers requires to keep 22 leaves in each plant (**Supplementary Figure 5**). Therefore, in our experiments, after topping, each plant had 22 leaves, which were grouped into eight groups (**Supplementary Figure 5**). The leaf sampling approach was the same as described above for North Carolina's samples (**Supplementary Figure 4**). In total, 48 biological leaf samples with a main vein (midrib) (**Supplementary Table 4**) were collected and cured for metabolomics in each year. Samples frozen in liquid nitrogen were transported to ITC laboratory by air flight and then stored in −80°C freezers. In addition, 24 biological samples of entire leaves without cutting were harvested and then cured as control for metabolomics in each year. Daily weather conditions during two growing seasons were obtained from local weather station. In summary, in 2 years of growth, 96 cut leaf (biological) samples were collected from eight positions (**Supplementary Table 3**) and cured for metabolomics. To simplify the sample's name description used for data analysis, we used "I" to represent India (**Supplementary Tables 3**). We used I-F1H through I-F8H to describe India's samples in the first year from positions 1 to 8 harvest. We also used I-S1H through I-S8H to describe India's samples in the second year from positions 1 to 8 harvest. Each group of samples included six biological replicates (**Supplementary Tables 3**). In addition, 48 entire leaf (un-cut, biological) samples were collected from eight positions (**Supplementary Table 4**) and cured as control. Cured samples were ground into fine powder. Powdered samples were then shipped to North Carolina State University for metabolic profiling. For each biological sample, three technical replicates were prepared for non-polar and polar metabolite extraction described below.

# Extraction of Non-Polar Metabolites

Extraction of non-polar metabolites was performed as reported previously (Ma et al., 2013). In brief, 100 mg of powdered sample was suspended in 1.5 ml 100% hexane in a 2 ml Eppendorf tube. The extraction tube was vortexed vigorously for 1 min, followed by 30 min sonication, and then placed in a 56°C water bath for additional 50 min. The extraction tube was centrifuged at 11,000 rpm for 10 min. The resulting clean hexane supernatant was pipetted into a new clean tube. A 200 μl hexane extract was pipetted into a 400 μl insert contained in a 2 ml glass vial (Agilent) for gas chromatograph-mass spectrometry analysis described below.

# Extraction of Polar Metabolites and Derivatization

One hundred milligram of powdered sample was suspended in 1.5 ml extraction solvent (methanol: chloroform: water, 5:2:2) in a 2.0 ml Eppendorf tube and then vigorously vortexed for one min. The extraction tube was sonicated for 30 min in water bath and placed in a 56°C water bath for 50 min. The extraction tube was centrifuged at 11,000 rpm for 10 min. One milligram of the supernatant was transferred to a new clean 1.5 ml tube, to which 300 μl chloroform and 600 μl double deionized H2O were added. The tube was vigorously vortexed for 1 min to mix solvents thoroughly, followed by centrifugation at 4,000 rpm for 5 min to result in the upper water-methanol (polar) phase and the lower chloroform (non-polar) phase. One hundred microliter watermethanol phase was transferred to a new 1.5 ml Eppendorf tube and then was evaporated in a rotary vacuum at room temperature. The remaining residues at the bottom of tubes contained polar metabolites, which were used for derivatization as reported previously (Ma et al., 2013). In brief, the remaining residue was oximated using 40 μl methoxylamine hydrochloride (20 mg/ml) in anhydrous pyridine at 37°C for 2 h, and then were silylated at 37°C for 30 min using 70 μl of N-methyl-N-(trimethylsilyl) trifluoro acetamide (MSTFA). The liquid derivatized samples were pipetted into 400 μl glass vials for GC–MS analysis as reported (Duan et al., 2012), which is described below.

# Gas Chromatograph-Mass Spectrometry Analysis

An Agilent 6890 Gas Chromatograph coupled with 5975 MSD (Agilent Technologies, USA) was used to profile metabolites extracted from 180 cut leaf samples (84 from North Carolina and 96 from India) and 90 (42 from North Carolina and 48 from India) uncut entire leaf samples (**Supplementary Tables 1**–**4**). Prior to injection of samples, we carefully performed instrumental quality assurance and quality control (QA/QC), including but not limiting to the use of new consumables, cleaning of systems, tuning, blank solvent control, and others. Three technical replicates were prepared for each biological sample, of which two technical replicates were analyzed. The third technical replicate was assayed for some of samples when it was necessary. As a result, 12 or 18 repetitive assays for each group leaf sample (such as U-F1H from group I, **Supplementary Figures 3** and **4**) were carried out to profile nonpolar and polar metabolites. A RTX-5 capillary column (30 m×0.25 mm×0.25 μm) was used to separate metabolites. The inlet was operated using a splitless mode. One microliter sample was injected to profile metabolites. The injection temperature was set at 250°C. The temperature of column oven was initially set at 60°C for 2 min, then ramped to 320°C at a constant rate of 8°C/min, and held at 320°C for 2 min. Pure helium was used as the carrier gas with a flow rate of 1 ml/min. A positive electron impact ion source (70 EV) was used to ionize compounds and mass fragments were scanned in the range of 40–800 (m/z) starting with 4 min of retention time. We recorded total ion chromatographs for all samples. In addition, a mixture of even-numbered chain n-alkenes (C10–C40) purchased from Restek (Florida, USA, catalog no. 31266) were used to estimate the retention index values, which were used to deconvolute metabolite peaks as reported previously (Almaarri et al., 2010).

# Data Processing and Statistical Analysis

GC-MS ChemStation data format for metabolites detected from each assay was translated into the MassHunter data format using Agilent MassHunter GC/MS Translator Software (version B.05.02). Then, untargeted analysis was carried out using the Mass Hunter Qualitative Analysis software (Version B.06.00) for metabolite deconvolution, compound search, and identification in NIST11 library. When the MS profile of an untargeted metabolite showed no less than 80% identity to that of a standard metabolite in the library, it was annotated to the standard. Then, the MassHunter data format of all annotated metabolites was exported as Compound Exchange Format (.cef). All "cef " files were imported to the Mass Profiler Professional (MPP, version B.12.5, Agilent) for statistical analysis.

The resulting "cef " files for all samples were imported to MPP for principal component analysis (PCA) and hierarchical clustering analysis (HCA) according to our previous report (Ma et al., 2015). For PCA, we used Eigen-vector based scaling method to perform alignment and normalization (using log2) and then visualize metabolite profiles in all samples. The abundance of metabolites was evaluated using counts. In MPP, the minimum abundance count and retention time tolerance (min) values across all sample sets were 5,000 and 0.05, respectively. These threshold values were used as a filter to determine the presence or absence of metabolites in samples. In addition, HCA is a statistical method to group samples that are unsupervised in different clusters or branches of the hierarchical tree. The MPP software processing tool is effective to align all the data and then normalize them using log 2 for each metabolite. The MPP identified the median value across all the biological samples (180 cut leaf samples) and then baselined each metabolite account value to the median value to establish the hierarchical clustering (conditional tree) and heat maps. The resulting tree showed the relationships between different groups of samples from both North Carolina and India.

# Pearson Analysis and Principal Component Analysis for Metabolites, Temperature and Precipitation in North Carolina and India Alone

Pearson analysis was performed using the SPSS software (IBM SPSS Statistics 22) to understand the potential relevance of one or two climate factors and metabolite profiles of leaves from in both India and North Carolina. Weather stations in Oxford County in North Carolina, USA, and Rajahmundry in Andhra Pradesh, India recorded local field air temperature and daily precipitation. We obtained these data from weather stations and then calculated the average values for every week of plant growth. To perform Pearson analyses, variables that were used included average temperature and precipitation values in each harvest week, nicotine (a main tobacco alkaloid in leaves) peak values, and peak values of 11 groups of metabolites (including six groups of non-polar and five groups of polar metabolites, **Supplementary Table 9**). The six non-polar metabolite groups included monoterpenes, sesquiterpenes, diterpenes, triterpenes, benzene, and polycyclic aromatic hydrocarbons (PAH). The five polar metabolite groups included sugars, amino acids, organic acid, nitrogen-containing secondary metabolites (nicotine-related compounds) and polyphenol. The level of each group resulted from the sum of peak values of all individual metabolites. For example, the monoterpene level was summed from all individual molecule peak values. The resulting average temperature values per week, average precipitation values per week, and total peak values of each group of metabolites were input to the SPSS software for Pearson analysis.

Eigen-vector based PCA was carried out with the JUMpro 12 software (North Carolina State University, Raleigh, USA) to understand the potential relevance between 12 groups of metabolites (nicotine and 11 groups), average daily temperature, and average daily precipitation at locations in both India and North Carolina. The PCA type used during the computation was the Pearson's correlation matrix. This analysis corresponds to the classical correlation coefficiency. The resulting plots were characterized by arrows that represented observations and variables simultaneously. When two variables were distant from the plot center, there were three results. The first is that two are close to each other, indicating that they were significantly positively correlated (r close to 1). The second is that two are orthogonal each other, indicating that they are not correlated (r close to 0). The third is that two are opposite to each to other, indicating that they are significantly negatively correlated (r close to −1).

# RESULTS

# Non-Polar Metabolite Profile Complexity in Leaves From North Carolina and India

We used GC-MS to complete 387 assays for 180 biological samples. These included 84 and 96 cut leaf (biological) samples collected from 2 years' field growth studies in North Carolina and India, respectively. The 84 biological samples from North Carolina were composed of seven groups of leaves and the 96 biological samples

from India were composed of eight groups of leaves. Each group of samples was composed of six biological replicates. Two technical replicates for each biological sample were analyzed using GC-MS. In addition, 27 biological samples were randomly selected from 180 samples for a third technical replicate analysis in order to confirm experimental reproducibility of GC-MS. As a result, GC-MS analysis detected more than 700 different metabolite peaks from 387 extractions (**Supplementary Table 5**). When a peak was detected from at least 20 samples, it was considered being a positive one. The peak was then annotated as a compound with at least 80% identity to a standard in the library. Accordingly, 171 non-polar metabolites (**Supplementary Table 6**) were annotated from 387 assays. Based on skeleton structure features, these metabolites were classified into nine groups, including: linear hydrocarbons, cyclic hydrocarbons, alcohol, aldehyde, ketones, acids, terpene, polyphenol, and nitrogen-containing natural products.

All 171 metabolites annotated and their ion chromatographic peak values, growing years, and locations were used as entries for PCA in MPP. The resulting two-dimensional ordination plot showed that the first principal component (PC1) and the second principal component (PC2) accounted for 13.89 and 2.92% of the total variance of metabolites across 387 extracted samples from two different years across two growth locations (**Figure 1**). Based on this plot, the ordination orders of two growing years and locations were separated by variables consisting of 171 metabolite and 387 extracted samples. Regardless of the locations, the growing years were ordinately separated in the PC1 axis, indicating differential effects of growing years on the profiles of 171 metabolites. In each year, North Carolina (USA) and India were ordinately separated in the PC2 axis, indicating the difference of metabolism of these 171 metabolites in two growing locations.

The abovementioned 171 metabolites and their ion chromatographic peak values, growing years, and locations were also used for HCA. As a result, HCA created two dendrograms (I and II), two heat maps (1 and 2), and two color bars (I and II) (**Figure 2**). These graphic features allowed visualizing the clustering distance between paired samples in two growth places, e.g., I-F1H *vs*. U-F1H, and in 2 years, e.g., I-F1H *vs*. I-S1H. In addition, these graphs the revealed clustering distance between sample groups from the bottom to the top of plants, e.g., I-F1H *vs*. I-F2H. These data showed effects of two growing years and two locations on 171 metabolite profiles from 387 extracted samples (30 groups of samples). The color bar I, heat map I, and dendrogram I characterized that based on metabolite profiles, samples in each of the first and second years alone was clustered, respectively, regardless of the field location in India and North Carolina. These data were consistent with the results of PCA described above. These data also indicate that the growing years affect the non-polar metabolite complexity in plants. The heat map I and dendrogram I also showed that under each year, samples from India and samples from North Carolina were grouped together, respectively. These results indicate that in the same year, the non-polar metabolite complexity in samples depend upon each growth location. The color bar II, heat map II, and dendrogram II categorize metabolites from high to low abundance in all samples regardless of growing locations and years. The color bar II from deep red to deep blue characterizes

by their abbreviations. I-F1H through F8H are India's samples in the first year from positions 1 to 8 harvest. U-F1H through F7H are USA's samples in the first year from positions 1 to 7 harvest. I-S1H through S8H are India's samples in the second year from positions 1 to 8 harvest. U-S1H through S7 are USA's samples in the

the peak abundance of the 171 metabolites shown in heat map II, each of which is coded by different colors in the direction of the black arrow (**Figure 2**). Based on color codes, we categorized the 171 metabolites into three groups (G-i, ii, and iii). The G-i is highly abundant in all 387 extracted samples regardless of growing years and locations. Example metabolites of this group includes squalene, vitamin E, stigmasterol, some diterpenes [such as 4,8,13-cyclotetradecatriene-1,3-diol,4,8,13-cyclo tetradecatriene-1,3-diol, 1,5,9-trimethyl-12-(1-methylethyl)- (C22H34O2), and phytol acetate]. The G-ii coded by pink to yellowbrownish color in the clustering are middle abundant metabolites, including nicotine and 1,3-bis(1-formylethyl) benzene. The G-iii coded by yellowish to bluish color are low abundant metabolites. Examples include monoterpenes (such as limonene, 1,7,7-trimethyl-bicyclo[2.2.1]hept-2-ene, and 1,2-dimethyl-3 pentyl-cyclopropane), sesquiterpenes [such as, solavetivone, 9-(3,3-dimethyloxiran-2-yl)-2,7-dimethylnona-2,6-dien-1-ol, 1S,4R,7R,11R-1,3,4,7-tetramethyltricyclo(5.3.1.0{4,11})undec-2-en-8-one], diterpenes (such as thunbergol, andrographolide, and others), triterpenes, 1-butylheptyl-benzene, and decahydro-2,3-dimethyl-naphthalene.

second year from positions 1 to 7 harvest; N = 387: 387 extracted samples.

# Polar Metabolite Profile Complexity in Leaves From North Carolina and India

We used GC-MS to complete total 366 assays for 180 biological samples to analyze polar metabolites. Two technical replicates were analyzed for each biological replicate. In addition, six biological samples were randomly selected from 180 samples for a third technical replicate to examine technical reproducibility. More than 500 metabolite peaks were detected from 366 assays (**Supplementary Table 7**). Based on 80% identity criteria, 225 peaks (**Supplementary Table 8**) were annotated to metabolites using the MPP software. These compounds were categorized into seven different groups, including sugar, amino acid, organic acid, alcohol and ketone, N-containing natural products, hydrocarbons, and phenylpropanoids (**Supplementary Table 8**).

All 225 metabolites annotated and their ion chromatographic peak values, 366 extracted samples, two growing years, and two locations were used as variables for PCA. The resulting ordination plot showed that the percentages of PC1 and PC2 were 18.13 and 7.96% across all variables (**Figure 3**). Regardless of growth locations, the resulting plot showed that the polar metabolome of the 1st year's samples was ordinately separated from that of the 2nd year's samples in the PC1 axis (**Figure 3**), indicating that growing years obviously altered the plant metabolism. In addition, the resulting plot revealed that samples from India and USA were ordinately separated in the PC2 axis (**Figure 3**), indicating different metabolism of these compounds in two growing locations. Based on this plot, the profiles of these 225 metabolites between North Carolina and India were more similar in the first year's leaf samples (U-F1H to U-F7H and I-F1H to I-F8H) than in the 2nd year's leaf samples (U-S1H to U-S7H and I-S1H to I-S8H). It was interesting that in the second year, dynamic metabolite profile distributions were observed between North Carolina and India samples. In the 2nd year, metabolite profiles in seven groups

FIGURE 2 | Hierarchical cluster analysis (HCA) for two consecutive years' nonpolar metabolomes from leaves of Nicotiana tabacum K326 grown in the field of India and North Carolina. A heat map resulted from HCA using log2-fold changes of 171 metabolites during two growing seasons in India and USA. Fold changes and color (bar-I) resulted from normalization that was carried out using each log2 value of each metabolite peak value in each sample to compare its median peak value across all 387 extracted samples. Two color bars are used in the heat map. The horizontal bar-I fold color bar (15.9 to −15.9) from deep red to deep blue shows high to low levels of each metabolite in different groups of biological samples. The bar-II entity color by number passed (350.3–56.7) and the long arrow from the top to the bottom featured using deep red to deep blue color indicate the frequency of metabolites detected from the highest (deep red) to the lowest (deep blue) richness in all 387 extracted samples. For example, squalene listed in the top of the heat map was the most abundant metabolite detected in all samples, while naphthalene, decahydro-2,3-dimethyl- listed in the relative bottom with deep blue color was one of the least abundant metabolite detected in all samples. Samples were labeled by their abbreviations. I-F1H through F8H are India's samples in the first year from positions 1 to 8 harvest. U-F1H through F7H are USA's samples in the first year from positions 1 to 7 harvest. I-S1H through S8H are India's samples in the second year from positions 1 to 8 harvest. U-S1H through S7 are USA's samples in the second year from positions 1 to 7 harvest; N = 387: 387 extracted samples.

1 to 8 harvest. U-F1H through F7H are USA's samples in the first year from positions 1 to 7 harvest. I-S1H through S8H are India's samples in the second year from

positions 1 to 8 harvest. U-S1H through S7 are USA's samples in the second year from positions 1 to 7 harvest. N = 366: 366 extracted samples.

of samples (U-S1H to U-S7H) from North Carolina were closely grouped together, while the profiles of these metabolites in eight groups of samples from India (I-S1H to I-S8H) were ordinately separated into three sections in the PC1 axis (**Figure 3**).

The abovementioned 225 metabolites and their peak values, 366 extracted samples, two growing years, and two locations, were also used as variables for HCA. As described above, HCA created two heat maps (I and II), two dendrograms (I and II), and two color bars (I and II) to visualize metabolite profiles in 366 extracted samples (**Figure 4**). Bar I, heat map I, and dendrogram I characterized metabolite profiles in 366 extracted samples (30 groups of samples) from 2 years. The heat map I and dendrogram I show that 15 groups of the first year's samples (eight from both India and seven from North Carolina) are clustered together, while 13 groups of the second year's samples (seven from both North Carolina and six from India) are clustered together. Two groups of samples in the second year, I-S1H and I-S2H from India, are clustered as an out-group, which is clustered together with the first years' samples. These results were in agreement with those obtained from PCA, in which these two groups of samples were relatively ordinately distant from 13 sample groups of the second year in the first PC1 axis (**Figure 3**). These results indicate that metabolic accumulation of plants are associated with the two growing years. Based on the tree in dendrogram I, in each year, samples from India and North Carolina were separately clustered together. These results indicate that in the same year, growth locations control metabolic complexities. Bar II, heat map II, and dendrogram II characterized peak abundance profiles of the 225 metabolites in 366 extracted samples regardless of years and locations. Based on color codes, we categorize the 225 metabolites into three groups, G-i: high abundance, G-ii: middle abundance, and G-iii: low abundance, which are colored by deep red to orange, orange to yellow, and blue, respectively (**Figure 4**). Examples of G-i include glutamic acid, malic acid, fumaric acid, succinic acid, aspartic acid, serine, phenylalanine, threonine, and others. Examples of G-ii include glucose, arabinose, fructose, and others. Examples of G-iii include xylose, 1'-demethyl nicotine, and other compounds (**Figure 4**).

# Association of Two Climatic Factors and 12 Groups of Metabolites in North Carolina

Both air temperature and precipitation of two growing seasons in North Carolina were recorded daily. Average values for each leaf harvest week were calculated. The dynamic trend of average temperature values was similar at Oxford research station in 2 years. The values decreased from the first (Aug. 3) to the last (Oct. 4) harvest week (**Figure 5A**). The average temperature values were higher in the 1st (Aug. 3), 2nd (Aug. 14), 3rd (Aug. 24) and 5th (Sept. 14) weeks of harvest in the 1st year than in the 2nd year (**Figure 5A**). The dynamics of average weekly precipitation values were different in the 2 years of growth (**Figure 5B**). During the seven continuous weeks of harvest, the average precipitation values in the 3rd (Aug. 24), 5th (Sep. 14) and 7th (Oct. 4) were relatively close, but were

FIGURE 4 | Hierarchical cluster analysis (HCA) for two consecutive years' polar metabolomes from leaves of Nicotiana tabacum K36 grown in the field of India and North Carolina. A heat map resulted from HCA using log2-fold changes of 225 metabolites during two growing seasons in India and USA. Fold changes and color (bar-I) resulted from normalization that was carried out using each log2 value of each metabolite peak value in each sample to compare its median peak value across all 366 extracted samples. Two color bars, bar-I and bar-II, are used in the heat map. The horizontal bar-I fold color bar (16.5 to −16.5) from deep red to deep blue shows high to low levels of each metabolite in different groups of biological samples. The bar-II entity color by number passed (329.6–54.4) and the long arrow from the top to the bottom featured using deep red to deep blue color indicate the frequency of metabolites detected from the highest (deep red) to the lowest (deep blue) richness in all 366 extracted samples. For example, glutamic acid listed in the top of the heat map was the most abundant metabolite detected in all samples, while xylose listed in the relative bottom with deep blue color was one of the least abundant metabolite detected in all samples. Samples were labeled by their abbreviations. I-F1H through F8H are India's samples in the first year from positions 1 to 8 harvest. U-F1H through F7H are USA's samples in the first year from positions 1 to 7 harvest. I-S1H through S8H are India's samples in the second year from positions 1 to 8 harvest. U-S1H through S7 are USA's samples in the second year from positions 1 to 7 harvest. N = 366: 366 extracted samples.

seven groups of samples, in which II and III were collected on Aug. 24, IV and V were collected on Sept. 14, and VI and VII were collected on Oct. 4.

higher in the 1st and 4th, and lower in the 2nd and 6th week of harvest in the first year than in the second year.

The level trends of nicotine levels and 11 groups of metabolites (triterpenes, monoterpenes, diterpenes, sesquiterpenes, benzene, PAH, amino acid, organic acids of the tricarboxylic acid (TCA) cycle, sugars, nitrogen containing compounds, and polyphenols) in seven groups of leaves were characterized using their peak values (**Figures 5C**, **D**; **Supplementary Figure 6**). Except for nicotine, the levels of each group of metabolites were summed from individuals, such as the level of monoterpenes summed from all individuals. The resulting plots showed three types of trends in 2 years. The first trend was similar in 2 years but the levels of group metabolites at each harvest date were lower in the first year than in the second year. This type included nicotine and benzenes. The trends of nicotine in 2 years of growth were similar from the lowest level in the group I leaves to the highest in the group VII leaves (**Figure 5C**). In comparison, the levels of nicotine in each group of leaves at each harvest date was higher in the second year than in the first year. The trends of benzenes were also similar in 2 years. The levels of benzenes were lower in each group of leaves at each harvest date in the first year than in the second year (**Supplementary Figure 6D**). The second type was that the levels of groups of metabolites in all or most samples groups (I–VII) were higher in the first year than in the second year. These groups of metabolites included triterpenes (**Figure 5D**), monoterpenes, diterpenes, amino acids, N-containing compounds (natural products), and polyphenols (**Supplementary Figures 6A**, **C**, **F**, **I**, and **J**). The accumulation patterns of triterpenes showed a trend from the highest in the group I leaves to the lowest in the group VII samples in 2 years (**Figure 5D**). Except for the group VI leaves, the levels of triterpenes in leaves were higher in the second year than in the first year. The levels of monoterpene, amino acids, and polyphenols in each group of samples at each harvest date were higher in the first year than in the second year. The third type trends for sesquiterpenes, PAH, organic acid, and sugar (**Supplementary Figures 6B**, **E**, **G**, **H**) were dynamic in the seven groups of samples between 2 years. Taken together, these results indicate that the metabolism of 12 groups of metabolites differentially respond to the growing years.

These 12 groups of metabolites, seven groups of samples, and 2 years were used as variables for PCA (with JUMpro software) to understand the potential correlation of growing years and metabolite profiles. The resulting two-dimensional ordination plot showed that the PC1 and the PC2 accounted for 41.9 and 23.8% of the total variance from 2 years in North Carolina (**Figure 6A**). In general, the seven group of samples were separated between the first and second years. The ordination plot also showed that in each year, the ordination order of seven group samples was relatively similar from I to VII. These data indicate that the positions (groups) of leaves and harvest times are associated with 12 groups of metabolite profiles.

factors and 12 groups of metabolites (including nicotine) as well as correlation among 12 groups of metabolites themselves in 2 years of field growth.

These 12 groups of metabolites and two climatic factors (average temperature and precipitation values) were used as variables for PCA to understand ordination relevance. The resulting PCA plot showed an ordinate relevance between two climate factors and 12 groups of metabolites (**Figure 6B**). Temperature, precipitation, triterpene, and sesquiterpene fell in the same ordinate region, suggesting a close positive association. By contrast, nicotine, sugar, and organic acid fell in the opposite ordinate region against both temperature and precipitation, suggesting a negative ordinate relevance. Average precipitation also had a positive association with triterpene and sesquiterpene but had a negative relevance to nicotine, sugar, and organic acid. Other groups of metabolites were localized in two other ordinate regions, suggesting a potential partial relevance to these two weather factors.

Furthermore, a Pearson analysis was performed to evaluate the correlation between 12 groups of metabolites and these two climatic factors established by PCA. A bi-variate Pearson analysis revealed a linear relationship among variables (**Table 1**). The resulting data showed that temperature was positively associated with triterpene (P-value less than 0.01), sesquiterpene (P-value less than 0.05), and diterpene (P-value less than 0.05). By contrast, the resulting data showed that temperature was negatively associated with nicotine (P-value less than 0.01), sugar (P-value less than 0.01), and organic acid of TCA (P-value less than 0.05). The resulting data also showed a significantly positive association between average precipitation and triterpene (P-value less than 0.01).

# Association of Two Climatic Factors and 12 Groups of Metabolites in India

As described above, daily temperature and precipitation values were recorded at the field in the two growing seasons in India. The average temperature values were calculated for each harvest week. During the entire growing seasons, the average temperature values showed an increased trend from Feb. 9 through March 31 in both years (**Figure 7A**). In the week of Feb. 9, Feb. 23, and Mar. 31, average temperature values were very close in 2 years, while the temperature values in other weeks of March were higher in the second year than in the first year. Regarding daily precipitation, it was mostly dry during the harvest period in the two growing years (**Figure 7B**). From Feb. 9 to March 31, it rained only once in each year, on Feb. 20 in the first year and on March 13 in the second year (**Figure 7B**).

The trends of nicotine and 11 groups of metabolites were analyzed with their total peak values in eight groups (different positional) of leaves collected on eight dates from Feb. 9 through Mar. 31. The resulting plots showed three types of dynamic trends. The first trend was that the levels of four groups of metabolites in all groups (at all dates) or groups V–VIII (at the late half dates) were higher in the second year than in the first year. Nicotine, triterpenes, benzenes, and PAH followed this trend (**Figures 7C**, **D** and **Supplementary Figures 7D** and **E**). The nicotine levels in the eight groups of leaves showed an increased trend in two years (**Figure 7C**). In comparison, the levels of nicotine in each group of leaves were higher in the second year than in the first year. The second trend was that the total levels of four groups of metabolites in all groups (all dates) or most groups (most dates) samples were higher in the first year than in the second year. These groups of metabolites included monoterpenes, sesquiterpenes, diterpenes, and polyphenols (**Supplementary Figures 7A–C, J**). The third trend was that the levels of four groups of metabolites were dynamic in eight groups of samples in the 2 years. These includes, amino acids, organic acids, and sugars (**Supplementary Figures 7F**, **G**, **H**, **and I**). Although the levels of sugars were dynamic, their values increased from I-F1H and I-S1H through I-F8H and I-S8H (**Supplementary Figure 7H**).

TABLE 1 | Data of Pearson correlation analysis for two weather facts and 12 groups of metabolites in North Carolina.


Bold numbers mean significant difference. \*significant correlation (2-tailed) with P-value less than 0.05, \*\*significant correlation (2-tailed) with P-value less than 0.01, NCSM, nitrogen-containing secondary metabolites.

Ma et al.

PCA (Eigen-vector based scaling) was performed to characterize the correlation of 12 groups of metabolites, 2 years, and eight groups of samples. The resulting plot showed that the PC1 and the PC2 accounted for 35.2 and 22.7% of the total variance from two different years (**Figure 8A**). Eight groups of samples (I–VIII) in the first and second years could be ordinately grouped together, respectively. In comparison, the ordinate association was better in the first year than in the second year. These results indicate differential effects of the two growing years on the profiles of 12 group metabolites.

FIGURE 8 | Scaling principal component analysis (PCA) showing associations between metabolite profiles with growing years and two climatic factors in India. This analysis was completed with JUMpro software. (A), A plot of PCA results show an association between two growing years and metabolite profiles in eight groups of samples. (B), A plot of PCA results shows associations between 2-year's temperature and 12 groups of metabolites (including nicotine) as well as correlations among 12 metabolites themselves in 2 years of field growth.

Ma et al. Effects of Climate on Tobacco Metabolism

Eight average temperature values, eight average precipitation values, and 12 groups of metabolites were used as variables for PCA. The resulting PCA plot showed potential ordinate relevance between temperature and 12 groups of metabolites (**Figure 8B**). Given that most of precipitation values were zero, precipitation was absent in the plot. This plot showed an opposite ordinate relevance between temperature and sugars, polyphenol, amino acids, organic acids of TCA, and nitrogen-containing secondary metabolites, suggesting a potential negative association. By contrast, this plot showed a close ordinate relevance of temperature with PAH, triterpenes, and benzene in the PC1 axis, suggesting a potential positive association. Pearson analysis was carried out using eight average temperature values, levels of 12 groups of metabolites in 2 years, and eight groups of samples. As described in PCA, precipitation was not included due to almost no rain. A bi-variate Pearson analysis revealed a linear relationship among variables (**Table 2**). The resulting correlation parameters showed that temperature was positively correlated with benzene and PAH (P-values less than 0.05), but negatively correlated with amino acid and organic acid (P-values less than 0.05).

# Leaf Nicotine Trends Are Not Altered by Two Growing Locations

Although our experimental design did not specifically target nicotine, we used untargeted data to compare nicotine profiles in leaves from India *vs*. North Carolina. The rationale is that unlike other metabolites, nicotine is the signature metabolite of tobacco, which is biosynthesized in roots and stored in leaves. The resulting data showed that although the levels of nicotine between 2 years were altered in samples from both India and North Carolina, the profile trends from the bottom to top leaves were similar (**Figures 5C**, **7C**). Pearson analysis showed a negative association between nicotine profiles and temperature in North Carolina in 2 years. By contrast, Pearson analysis did not show a significant association between temperature and nicotine profiles in India. In addition, there was not a significant association between precipitation and nicotine profiles in North Carolina. Pearson analysis could not be preformed to evaluate the association between precipitation and nicotine profiles in India due to almost no rain in two farming seasons (**Figure 7B**). In summary, these findings show that the trends of nicotine levels from the bottom to top leaves are independent upon local climates and farming locations.

# DISCUSSION

# Association of Growing Locations and Plant Metabolomes

To date, environmental changes have been reported to highly influence the sensitivity of plants to environments and plant productivity (Aggarwal, 2008; Ahmad et al., 2010; Walter et al., 2010; Gosling et al., 2011). Furthermore, numerous studies have particularly reported that global climate change can highly affect agricultural productivity and crop yield (Walter et al., 2010; Gosling et al., 2011; Bowen and Friel, 2012; Dwivedi et al., 2013). Most of this type of research has focused on single climate factors, such as temperature in a controlled condition or a specific local environment. A recent study was completed to understand the metabolic responses of rice to temperature changes (Glaubitz et al., 2015). This study used 12 rice cultivars, having different sensitivity to high temperature conditions, to understand their metabolic reactions in vegetative tissues in a given condition. The results showed that the tricarboxylic acid cycle and amino acid biosynthesis in sensitive cultivars were significantly affected by high night temperature and the levels of certain metabolites such as putrescine, spermidine, and spermine were increased in sensitive cultivars (Glaubitz et al., 2015). Another greenhouse study showed that effects of elevated CO2, elevated temperature, and water deficit alone or in combination differentially affected grape performance in both grape yield and plant growth in a cultivar-dependent manner (Kizildeniz et al., 2015). Drought under an elevated temperature was found to drastically inhibit the vegetative growth with reduced bunch fresh and dry weights of two grape cultivars. Increased carbon dioxide, elevated temperature, and drought were found to reduce the total polyphenolics (Kizildeniz et al., 2015). Similar studies have been completed to understand the effects of single or combined climate factors on other crop growth and production (Sengar et al., 2015). However, studies on plant metabolomes responsive to global climate changes are limited. Moreover, effects of real field climate on crop metabolism in the farming field remains largely open for investigation. In the study reported herein, the objective was to use the globally commercialized K326 tobacco cultivar to understand the effects of real field environment conditions on plant metabolome. We selected two research station fields in two completely different environmental conditions, India and USA. The planting of K326 and field management were exactly followed as per location-specific farming protocols. Using an untargeted metabolomics to profile metabolites we were able to annotate 171 non-polar metabolites from 387 assays and 225 polar metabolites from 366 assays. Further PCA and HCA using these metabolites provided the same result that although plants were grown in two continents, the non-polar and polar metabolomes annotated were primarily grouped and clustered by growing years (**Figures 1**, **2**, **3**, and **4**). These results indicated that the two growing years played a primary role leading to differentiation of metabolomes. Moreover, plants were grown in the same place and managed with the same field protocols during both the years. Both Pearson analysis and PCA revealed positive or negative relevance between temperature and six groups of metabolites in two years in North Carolina (**Table 1**). In addition, precipitation in North Carolina was also significantly associated with triterpenes (**Table 1**). These results suggest that climate changes between 2 years are the main factor associated with the observed plant metabolism differentiation.

# Effects of Field Air Temperature on Metabolome

All environmental factors, such as temperature, precipitation, intensity of light, photoperiod, soil condition, wind, pests, diseases, ultraviolet lights, nutrients, soil moisture, and others can significantly affect plant growth and metabolism in the field. Each environmental factor plays a significant role. Therefore, metabolite

 | Volume 10 |

Article 1370

TABLE 2 | Data of Pearson correlation analysis for temperature and 12 groups of metabolites in India.


Bold numbers mean significant difference. \*significant correlation (2-tailed) with P-value less than 0.05, \*\*significant correlation (2-tailed) with P-value less than 0.01, NCSM, nitrogen-containing secondary metabolites.

profile data obtained from 2 years of investigation actually resulted from the comprehensive effect of all field factors in two growing locations. We understood that data for all environmental factors were essential to understand their effects on tobacco metabolism. Although it was difficult to track accurate data of all factors, we could obtain accurate values (hourly and daily) for two weather factors from local weather stations, air temperature and precipitation. Therefore, both air temperature and precipitation of two growing seasons were used to understand their association with metabolite profiles.

Temperature has been reported a primary factor that can affect plant metabolism. Single temperature factor in the control condition has been showed to affect plant metabolism in different ways. A genome-wide study of *Arabidopsis* indicated metabolic network changes caused by different temperature conditions (Topfer et al., 2013). The increase of certain amino acids during seed imbibition of *Ricinus communis* was reported to be responsive to temperature increase to 35°C (Ribeiro et al., 2015). Furthermore, the reprogramming of the metabolome was reported to occur in temperature stress. For example, the central carbohydrate metabolism can be regulated by the temperaturestress (Guy et al., 2008). In our study, we used Pearson analysis to characterize association of the field air temperatures in two different geographical areas with leaf metabolome composition. To understand the potential association of temperature and metabolome, we selected nicotine and 11 groups of metabolites (**Figures 5** and **7**, **Supplementary Figures 6** and **7**, and **Supplementary Table 9**). The 11 groups included both plant primary and secondary metabolites (**Supplementary Table 9**). The total level of each group at different harvest date was summed from individual metabolites, and then used for characterization of their level trends, PCA, and Pearson analysis. The resulting data allowed evaluating association of metabolite groups with field air temperature. We selected nicotine as a representative metabolite, because this health-associated alkaloid is the signature metabolite of tobacco. The resulting dynamic trend data developed from each harvest time, PCA, and Pearson analysis simultaneously showed that the levels of nicotine were negatively associated with field air temperatures in 2 years of growth in North Carolina (**Figures 5A**, **C**, **6B**, **Supplementary Figure 6D**, and **Table 1**). By contrast, although the trends of temperature and nicotine levels in India (**Figures 7A**, **C**) were similar to those in North Carolina, the results from PCA and Pearson analysis showed a potentially partial association (**Figure 8B** and **Table 2**). This observation was consistent with our previous data reporting nicotine differentiation in tobacco samples from India, USA, and Brazil (Ma et al., 2013).

In addition, we observed the associations between temperature and other metabolites. It was interesting to note that the total level of organic acids of TCA cycle (**Supplementary Table 8**) was negatively associated with temperature in both North Carolina and India (**Tables 1** and **2**). One of reduced organic acids is malonic acid, an intermediate in the TCA cycle (Fernie et al., 2004). The reduction of this acid and other acids of TCA by high temperature was also observed in soybean (Sicher, 2013). These data suggest that the accumulation of organic acids of TCA is associated with the field air temperature. Moreover, we observed the different responses of metabolites to temperatures in two locations. In North Carolina, temperature was positively associated with sesquiterpenes, diterpenes, and triterpenes, but negatively associated with sugars (**Table 1**). In India, temperature was positively associated with PAH and benzene but was negatively associated with amino acids. These responsive differences likely result from multiple environmental factors in the field. Experimental evidence has showed that a combination of environmental factors such as heat shock and drought can lead to multiple alterations in plant metabolism, as in photosynthesis and enzyme activity (Rizhsky et al., 2002). Extreme temperature (together with water deficit) and high solar radiation were reported to strongly affect grapevine growth, which further led to negative effects on fruit and wine quality (Teixeira et al., 2014). A field study in China revealed that different field conditions also affected the profiles of polar metabolites in tobacco leaves such as flavonoids (Li et al., 2014). In 2 years of our field experiments, it rained only once during each year in India. We hypothesize that this extreme precipitation event can affect effects of temperature on plant metabolism in India. In summary, temperature conditions play a primary role in regulating plant metabolism in the field. We further believe that as more field experiments will be performed, continuous documentation of data will enhance the understanding of combined effects of temperature and other factors on plant metabolism.

# Effects of Precipitation on Metabolome

Global changes in precipitation regimes have been continuously documented (Keller, 2007; Schnoor, 2007). Those precipitation changes are associated with different factors, such as rising temperature (Le Roux et al., 2013; Vlam et al., 2014). Variation in precipitation patterns such as water deficit (drought) have been reported to lead to lower the photosynthetic rates and plant growth as well as other physiological features (Munne-Bosch and Penuelas, 2003; Tezara et al., 2003). In comparison, studies on effects of variation in precipitation patterns such as drought on metabolites accumulation remain open for study. In this study, we recorded daily precipitation and calculated average precipitation during the growing seasons in two continuous years. The weather was dry in India during these 2 years, wherein the field received raining only once during the growing season in each year (**Figure 7B**). Consequently, we could only analyze relevance of precipitation and metabolomes for North Carolina samples. The weekly average and daily precipitation values were used for PCA and Pearson analysis, respectively. The resulting PCA data showed a positive relevance among average precipitation, temperature, triterpenes, and sesquiterpenes (**Figure 6B**). The resulting Pearson analysis data showed a significant association between daily precipitation and triterpenes. These data provide evidence that precipitation can affect plant metabolism, although it is hard to predict precipitation year by year.

# Leaf Nicotine Trends Independent of Environmental Factors

Nicotine is the main signature metabolite of tobacco. It is formed in the roots, transported to above ground tissues, and stored in the leaves. Our study reveals an interesting metabolic independence of nicotine level trends upon climate from 2 years of studies. This independent feature is that although Pearson analysis and PCA showed positive and negative associations between nicotine and temperature in North Carolina and India, respectively, the trend of nicotine levels was not altered in 2 years of experiments. In two fields, the bottom leaves had lower nicotine levels, while the top leaves had the higher nicotine levels (**Figures 6C** and **7C**). In contrast, the level trends of 11 other groups of metabolites in different groups of samples were dynamic in tobacco leaves from both North Carolina and India. These interesting results indicate that the trend of nicotine levels is stably controlled by tobacco plants. The mechanism behind this trend stability of nicotine remain uninvestigated, although its biosynthetic pathway has been studied intensively (Xie and Fan, 2016; Kajikawa et al., 2017). Furthermore, it has been vastly documented that nicotine is exclusively biosynthesized in roots and then transported to leaves for storage. In our study, we used standardized tobacco farming protocols to grow plants in the field. Therefore, the use of fertilizer and the time of fertilization were consistent in 2 years. Based on these observations we hypothesize that although temperature and other environmental factors can affect nicotine levels in leaves, weather factors might have limited effects on the trend of nicotine levels controlled by the root-specific biosynthesis. This finding shows that it is interesting to understand effects of environmental factors on the formation of root-specific metabolites in the future.

# CONCLUSION REMARKS

Herein, we integrate unbiased un-targeted metabolomics with standard farming practices followed for commercial cultivation of tobacco to understand the formation of root-specific effects of field environments on plant metabolism. From a large number of peaks detected, we could annotate 171 non-polar and 225 polar metabolites and further grouped them into different classes. Nicotine, the main tobacco alkaloid, is one of the non-polar metabolites and used as signature metabolite in our data analysis. PCA and HCA characterized that two different years and two different geographical locations played primary and secondary roles in controlling metabolite complexity. Field air temperature was characterized to be one of the main factors that was associated

# REFERENCES


with profiles of several main groups of metabolites. We further used nicotine as a signature metabolite to indicate that its profile trend in different leaves controlled by root-specific biosynthesis is independent of climate factors. This pilot study provides useful findings to enhance the understanding of the effects of two diverse climate factors on plant metabolism across two continents.

# DATA AVAILABILITY STATEMENT

The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.

# AUTHOR CONTRIBUTIONS

D-YX and SG conceived this research and designed all experiments in North Carolina and India, respectively. D-YX and SG also participated in sample collection and collected weather data in North Carolina and India, respectively. D-YX participated in data analysis and figure preparation, and drafted and finalized this manuscript. D-MM managed plant growth in field in North Carolina, collected samples, performed metabolic extraction and GC-MS analysis and other lab experiments, analyzed data, and prepared figures and manuscript. CH participated in field design and management of plant growth, involved in sampling, and metabolite extraction. RM participated in field design and management and sampling in India.

# ACKNOWLEDGMENTS

This project was supported by Life Sciences and Technology Centre, Bangalore of ITC Limited, India.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.01370/ full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Ma, Gandra, Manoharlal, La Hovary and Xie. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Contribution of Functional Divergence Through Copy Number Variations to the Inter-Species and Intra-Species Diversity in Specialized Metabolites

*Kazumasa Shirai and Kousuke Hanada\**

Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, Fukuoka, Japan

#### Edited by:

Takayuki Tohge, Nara Institute of Science and Technology (NAIST), Japan

#### Reviewed by:

Dirk Walther, Max Planck Institute of Molecular Plant Physiology, Germany Qing Liu, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia

> \*Correspondence: Kousuke Hanada kohanada@bio.kyutech.ac.jp

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 18 July 2019 Accepted: 08 November 2019 Published: 26 November 2019

#### Citation:

Shirai K and Hanada K (2019) Contribution of Functional Divergence Through Copy Number Variations to the Inter-Species and Intra-Species Diversity in Specialized Metabolites. Front. Plant Sci. 10:1567. doi: 10.3389/fpls.2019.01567

There is considerable diversity in the specialized metabolites within a single plant species (intra-species) and among different plant species (inter-species). The functional divergence associated with gene duplications largely contributes to the inter-species diversity in the specialized metabolites, whereas the intra-species diversity is due to gene dosage changes via gene duplications [i.e., copy number variants (CNVs)] at the intra-species level of evolution. This is because CNVs are thought to undergo associated with less functional divergence at the intra-species level of evolution. However, functional divergence caused by CNVs may induce specialized metabolite diversity at the intraspecies and inter-species levels of evolution. We herein discuss the functional divergence of CNVs in metabolic quantitative trait genes (mQTGs). We focused on 5,654 previously identified mQTGs in 270 Arabidopsis thaliana accessions. The ratio of nonsynonymous to synonymous variations tends to be higher for mQTGs with CNVs than for mQTGs without CNVs within A. thaliana accessions, suggesting that CNVs are responsible for the functional divergence among mQTGs at the intra-species level of evolution. To evaluate the contribution of CNVs to inter-species diversity, we calculated the ratio of nonsynonymous to synonymous substitutions in the Arabidopsis lineage. The ratio tends to be higher for the mQTGs with CNVs than for the mQTGs without CNVs. Additionally, we determined that mQTGs with CNVs are subject to positive selection in the Arabidopsis lineage. Our data suggest that CNVs are closely related to functional divergence contributing to adaptations via the production of diverse specialized metabolites at the intra-species and inter-species levels of evolution.

Keywords: specialized metabolite, adaptation, Arabidopsis, copy number variant, gene duplication

# INTRODUCTION

Plants produce various specialized metabolites, the diversity of which is closely related to adaptive evolution (Pichersky and Lewinsohn, 2011). Specialized metabolites vary among different species as well as within single species (Chan et al., 2010; Weigel, 2012; Carreno-Quintero et al., 2013; Alseekh et al., 2015; Matsuda et al., 2015). The diversity of the specialized metabolites resulted from gene duplications among various plant species. We previously revealed that copy number variants (CNVs) derived from gene duplications are associated with specialized metabolites (Shirai et al., 2017).

Gene duplications contribute to the diversity in specialized metabolites because of two possible effects. The first effect is functional divergence. After gene duplication events, the copied genes tend to accumulate nonsynonymous mutations because of relaxed selection pressures (Scannell and Wolfe, 2008). Consequently, the copied genes induce functional divergence (Ohno, 1970), ultimately leading to the variability in the specialized metabolites among various plant species (Hanada et al., 2008; Kliebenstein and Osbourn, 2012; Panchy et al., 2016). The second effect involves gene dosage changes. Specifically, gene duplications increase gene dosage (Ohno, 1970). In particular, CNVs are believed to be the main cause of intra-species gene dosage changes (Zmienko et al., 2014). There is experimental evidence that the abundance of specialized metabolites within a single species is critically controlled by altered gene dosages due to CNVs (Kliebenstein, 2001). However, it remains unclear whether CNVs associated with specialized metabolites tend to induce functional divergence at the genomic scale.

We herein discuss the functional divergence of CNVs associated with specialized metabolites. For this discussion, we performed additional analyses involving our previously published data. On the basis of the analyses, we propose that CNVs induce functional divergence that generates various specialized metabolites during the evolution of *A. thaliana*.

# Functional Divergence of CNVs at the Intra-Species Level of Evolution

It is believed that CNVs mainly cause quantitative changes rather than qualitative changes (Zmienko et al., 2014), likely because of an insufficient amount of time for CNVs to accumulate nonsynonymous mutations leading to the diversity in specialized metabolites. However, several studies have identified a few nonsynonymous mutations responsible for the functional divergence of genes related to specialized metabolites (Chye et al., 2000; Yu et al., 2015; Bunsupa et al., 2016). These reports suggest CNVs may induce functional divergence.

To examine the functional divergence of duplicated genes, the selection pressure based on the ratio between the nonsynonymous mutation/substitution rate and the synonymous mutation/ substitution rate is useful (Hanada et al., 2009). High and low selection pressures are associated with functional divergence and constraint, respectively. Therefore, we estimated the dNSNP/dSSNP ratio, which is the ratio between the number of nonsynonymous mutations relative to the number of nonsynonymous sites (dNSNP) and the number of synonymous mutations relative to the number of synonymous sites (dSSNP) (Nei and Gojoborit, 1986; Hanada et al., 2009), for 27,130 annotated protein-coding genes in 270 A*. thaliana* accessions.

The single nucleotide polymorphism (SNP) data examined in this study were compiled from 270 A*. thaliana* accessions analyzed in several studies (http://1001genomes.org, 1001 Genomes; Mouille et al., 2006; Cao et al., 2011; Gan et al., 2011; Schmitz et al., 2013; Shirai et al., 2017. A total of 7,624,270 SNPs were included. For each accession, nonsynonymous and synonymous variations were annotated according to the TAIR10 database with the SnpEff program (https://www.arabidopsis.org; Cingolani et al., 2012). Of the 7,624,270 SNPs, 1,330,920 were located in 27,130 annotated protein-coding genes in the reference *A. thaliana* genome. For each of the 270 accessions, the 1,330,920 SNPs were classified as 733,796 nonsynonymous and 597,124 synonymous mutations in the 27,130 coding sequences. There was an average of 27 nonsynonymous and 22 synonymous mutations in each of the 27,130 coding sequences. Because the number of nonsynonymous and synonymous sites in codons varied, we calculated the number of synonymous and nonsynonymous sites in all 27,130 coding sequences with scripts that we developed following the Nei-Gojobori method (Nei and Gojoborit, 1986; Hanada et al., 2009). The 27,130 genes were classified as metabolic quantitative trait genes (mQTGs) and mQTGs with CNVs. We previously predicted 5,654 mQTGs for 1,335 specialized metabolites in *A. thaliana* (Shirai et al., 2017). In that study, mQTGs were detected by combining a genomewide association study (GWAS) and a metabolite-transcriptome correlation analysis (MTCA). This method enabled the prediction of mQTGs with a lower false positive rate than that of the general GWAS method. Genes with CNVs were previously detected by comparing genomic read counts among *A*. *thaliana* accessions (Gan et al., 2011). Of the 27,130 genes, 929 were predicted as genes with CNVs (*P* < 0.05).

To assess whether the functional divergence of CNVs is associated with the diversity in specialized metabolites, we compared the dNSNP/dSSNP ratios among mQTGs, mQTGs with CNVs, and randomly selected genes (**Figure 1A** and **Supplementary Table S1**). The dNSNP/dSSNP ratios were significantly higher for the mQTGs and mQTGs with CNVs than for the randomly selected genes (Wilcoxon rank sum test: *P* < 0.001; **Figure 1A**). Moreover, the ratios of mQTGs with CNVs were also significantly higher than the ratios of mQTGs (Wilcoxon rank sum test: *P* < 0.001; **Figure 1A**). These results imply that nonsynonymous variations tend to accumulate in mQTGs more frequently than in genes not associated with specialized metabolites. Specifically, mutations that alter the amino acid sequence accumulated in mQTGs with CNVs at a higher rate than in mQTGs without CNVs. These findings suggest that the diversity in specialized metabolites due to CNVs is the result of the functional divergence of mQTGs in addition to gene dosage changes at the genomic scale. Additionally, mQTGs with CNVs tended to be associated with a larger number of specialized metabolites than mQTGs without CNVs (Wilcoxon rank sum test: *P* = 1.92 × 10−3; **Supplementary Figure S1**), implying that the functional divergence derived from CNVs enhances the divergence of specialized metabolites.

It was unclear whether CNVs induce functional divergence for mQTGs only or for other genes as well. Therefore, we compared the dNSNP/dSSNP ratios of randomly selected genes and non-mQTGs with CNVs. The dNSNP/dSSNP ratios were significantly higher for the non-mQTGs with CNVs than for the randomly selected genes (Wilcoxon rank sum test: *P* < 2.2 ×

**Abbreviations:** CNV, copy number variant; mQTG, metabolic quantitative trait gene; NI, neutrality index; PacBio, Pacific Bioscience; SNP, single nucleotide polymorphism.

10−16; **Supplementary Figure S2**). Thus, CNVs are generally responsible for the functional divergence of genes at the intraspecies level of evolution.

# Functional Divergence of CNVs at the Inter-Species Level of Evolution

It was recently reported that CNVs are associated with various phenotypic differences within a plant species (Lye and Purugganan, 2019). By contrast, the contribution of CNVs to inter-species diversity remains relatively unknown in plants.

The dNSNP/dSSNP ratio indicates the intra-species level of evolution. Therefore, to characterize the functional divergence of CNVs at the inter-species level of evolution, we examined the KA/ KS ratio, which is the ratio between the number of nonsynonymous substitutions relative to the number of nonsynonymous sites (KA) and the number of synonymous substitutions relative to the number of synonymous sites (KS). The KA/KS ratio was estimated for 20,498 orthologs between *A*. *thaliana* and *Arabidopsis lyrata.* These orthologs were detected based on the reciprocal best hit (E-value < 1.0 × 10−3 and coverage > 90%) of a BLASTP (version 2.8.1) analysis of *A*. *thaliana* and *A*. *lyrata* (https://www.arabidopsis.org, TAIR10; http://genome.jgi.doe.gov, Phytozome v12: Alyrata\_384\_v2.1; Rawat et al., 2015; Boratyn et al., 2013). The coding sequences were aligned according to the amino acid sequences aligned by MAFFT (version 7.407) (Katoh and Standley, 2013). To evaluate the functional divergence between *A. thaliana* and *A. lyrata*, the nonsynonymous and synonymous substitutions in the 20,498 orthologs were counted. The KA/KS ratio was calculated according to Yang and Nielsen's method in the "yn00" program of PAML (version 4.8a) (Yang and Nielsen, 2000; Yang, 2007).

We compared the KA/KS ratios of mQTGs, mQTGs with CNVs, and randomly selected genes (**Figure 1B** and **Supplementary Table S1**). The mQTGs were found to have significantly higher KA/KS ratios than the randomly selected genes (Wilcoxon rank sum test: *P* < 0.001; **Figure 1B**), indicating that functional divergence was more commonly detected for mQTGs than for the other genes. Additionally, the KA/KS ratios were higher for mQTGs with CNVs than for mQTGs and randomly selected genes (Wilcoxon rank sum test: *P* < 0.001; **Figure 1B**), suggesting that CNVs enhanced the functional divergence of mQTGs between *A*. *thaliana* and *A*. *lyrata*.

# Selection Pressure for CNVs in a Species Lineage

A strong positive selection decreases the nucleotide diversity around target sites throughout the genome (i.e., selective sweep). The mQTGs with CNVs are more frequently affected by a selective sweep than the other genes in *A*. *thaliana* accessions (Shirai et al., 2017). This suggests that CNVs contribute to local adaptations at the intra-species level of evolution. The results of the present study suggest that CNVs contribute to the functional divergence of mQTGs at the inter-species and intra-species levels. However, it remains unclear whether positive or relaxed selection pressure controls mQTGs with CNVs at the inter-species level of evolution.

In earlier investigations, determining the selection pressure generally involved comparisons between variations at the interspecies and intra-species levels of evolution (McDonald and Kreitman, 1991; Rand and Kann, 1996; Smith and Eyre-Walker, 2002; Stoletzki and Eyre-Walker, 2011). These studies compared the number of nonsynonymous mutations (Pn), the number of synonymous mutations (Ps), the number of nonsynonymous substitutions (Dn), and the number of synonymous substitutions (Ds). The neutrality index (NI; i.e., Pn/s/Dn/s) is one of the parameters for comparing the variations and inferring the selection pressure (Rand and Kann, 1996). The NI quantifies the direction and extent of the difference from neutrality in which Pn/s equals Dn/s. That is, an NI of 1 means the intra-species and inter-species functional divergences are the same. Additionally, NI < 1 and NI > 1 reflect greater inter-species and intra-species functional divergences, respectively. Moreover, NI < 1 and NI > 1 represent the effects of positive and negative selection, respectively. We calculated the NI based on the variations of mQTGs with CNVs within *A*. *thaliana* accessions (intra-species) and between *A*. *thaliana* and *A*. *lyrata* (inter-species) among 20,214 genes. The Pn and Ps were estimated according to the SNPs of the 270 accessions (dNSNP/dSSNP calculation). The Dn and Ds were estimated based on the substitutions of the orthologs between *A*. *thaliana* and *A*. *lyrata* (KA/KS calculation).

We found that mQTGs and mQTGs with CNVs tend to have a lower NI than the randomly selected genes (Wilcoxon rank sum test: *P* < 0.05; **Figure 1C** and **Supplementary Table S1**). These results indicate that mQTGs and mQTGs with CNVs enhanced the inter-species functional divergence over the intraspecies functional divergence. To address whether mQTGs with CNVs are associated with positive selection due to functional divergence, we examined the proportion of mQTGs with CNVs in positively selected genes and in other genes. We defined positively selected genes as a gene with NI < 1 and a significant difference between Pn/s and Dn/s (false discovery rate < 0.05 according to the chi-squared test; **Supplementary Table S1**). The proportion of mQTGs with CNVs (0.37% = 4/1,076) was significantly higher for positively selected genes than for the other genes (0.22% = 45/19,138) (chi-squared test: *P* = 2.42 × 10−49; **Supplementary Table S2**). These results imply that CNVs tend to be contained in the mQTGs related to the adaptive evolution of *A*. *thaliana*.

The NI reportedly leads to the incorrect determination of natural selection when there is an insufficient number of substitutions and mutations (Stoletzki and Eyre-Walker, 2011). Therefore, we validated the inferred selection pressure based on the direction of selection (DoS) (Stoletzki and Eyre-Walker, 2011). The DoS was defined as Dn/(Dn + Ds) − Pn/(Pn + Ps). Additionally, DoS > 0 and DoS < 0 represent the effect of positive and negative selection, respectively. We defined positively selected genes as genes with DoS > 0 and a significant difference between Pn/s and Dn/s (false discovery rate < 0.05 according to the chi-squared test; **Supplementary Table S1**). We examined the proportion of mQTGs with CNVs in positively selected genes and in other genes. Similar to the results of our analyses of NI, the proportion of mQTGs with CNVs (0.37% = 4/1,090) was significantly higher for positively selected genes than for the other genes (0.23% = 46/19,440) (chi-squared test: *P* = 6.12 × 10−46; **Supplementary Table S3**). Thus, the DoS analysis supported the NI results.

# Conclusion and Perspectives

The current study examined the relationship between CNVs and the functional divergence of mQTGs at the inter-species and intraspecies levels of evolution (**Figure 2**). Gene duplications induce nonsynonymous mutations *via* relaxed selection pressures. The CNVs derived from gene duplications seem to have accelerated nonsynonymous mutations. Thus, the mQTGs with CNVs have a high functional divergence at the intra-species level of evolution. Additionally, this intra-species functional divergence increases the inter-species functional divergence of the mQTGs. In fact, the functional divergence of mQTGs with CNVs tends to be high between *A*. *thaliana* and *A*. *lyrata*. Therefore, CNVs contribute to the functional divergence related to the diversity in specialized metabolites at the inter-species and intra-species levels. Consequently, CNVs tend to contribute to adaptations at the inter-species and intra-species levels. We propose that CNV is an important adaptive mechanism for generating diverse specialized metabolites in plants.

Our analyses are based on SNP calling with short-read sequencing. When SNPs are predicted in genes with CNVs based on the short reads, the SNPs are detected in the representative sequence of copied genes. The SNP detection over- or

A. thaliana, mQTGs tend to have copy number variants (CNVs) because of gene duplications. After a gene duplication event, the duplicated copies accumulate nonsynonymous mutations. This causes the functional divergence of the mQTGs among A. thaliana accessions (intra-species level). Consequently, CNVs induce the functional divergence of mQTGs between A. lyrata and A. thaliana.

under-estimates the number of SNPs depending on the number of copied genes. In this study, we focused on only the rate of nonsynonymous or synonymous mutations. It is unlikely that the miscalling of SNPs between nonsynonymous and synonymous mutations is biased. Therefore, we believe that the effect of miscalling is limited for our analyses.

In the past 10 years, short-read sequencing has mainly been applied in investigations at the genome scale. Unfortunately, detecting structural variants is difficult based on short-read sequencing (van Dijk et al., 2018). Therefore, there have been relatively few studies on the CNVs in plants. However, thirdgeneration sequencing platforms, such as Pacific Bioscience (PacBio), that can generate long reads (> 5 kb) have recently been applied for plant genomic research (Zhang et al., 2016; Fukushima et al., 2017; Lan et al., 2017; Baek et al., 2018; Edger et al., 2019). The long-read sequencing data may enable the accurate detection of structural variants (Jiao and Schneeberger, 2017; van Dijk et al., 2018). For example, structural variants were recently detected by PacBio in a tropical maize inbred line (Yang et al., 2019). If this experimental approach becomes more affordable, CNVs in plants may be more easily detected. Therefore, in the near future, it will be possible to verify conclusions in other plant species.

# DATA AVAILABILITY STATEMENT

The datasets for this study are available in the 1001 Genomes (http://1001genomes.org), TAIR10 (https://www.arabidopsis.org), and Phytozome v12 (http://genome.jgi.doe.gov) databases.

# REFERENCES


# AUTHOR CONTRIBUTIONS

KS analyzed the data and wrote the manuscript. KS and KH designed the data analysis method and revised and approved the manuscript.

# FUNDING

This work was supported by Grants-in-Aid for Scientific Research (25710017, 15H02433, 17H03727, 18KK0176, 18H02420, and 19H05348; to KH) as well as research grants from the Takeda Science Foundation (to KH), the Sumitomo Foundation (to KH), Kurume Research Park (to KH), and the Asahi Glass Foundation (to KH).

# ACKNOWLEDGMENTS

We thank the National Institute of Genetics of the Research Organization of Information and Systems for providing excellent supercomputer services. We also thank Edanz Group (www. edanzediting.com/ac) for editing a draft of this manuscript.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2019.01567/ full#supplementary-material

melanogaster strain w1118; iso-2; iso-3. *Fly (Austin).* 6, 80–92. doi: 10.4161/ fly.19695


glucosinolate biosynthesis in *Arabidopsis*. *Plant Cell Online* 13, 681–693. doi: 10.1105/tpc.13.3.681


Ohno, S. (1970). *Evolution by Gene Duplication.*.


specialized metabolites in *Arabidopsis thaliana* accessions. *Mol. Biol. Evol.* 34, 3111–3122. doi: 10.1093/molbev/msx234


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Shirai and Hanada. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Molecular Basis of C-30 Product Regioselectivity of Legume Oxidases Involved in High-Value Triterpenoid Biosynthesis

#### Edited by:

Dae-Kyun Ro, University of Calgary, Canada

#### Reviewed by:

Yansheng Zhang, Chinese Academy of Sciences, China Bjoern Hamberger, Michigan State University, United States

#### \*Correspondence:

Toshiya Muranaka muranaka@bio.eng.osaka-u.ac.jp

#### †Current address:

Hiroshi Sudo, School of Pharmacy and Pharmaceutical Sciences, Hoshi University, Shinagawa, Japan Kiyoshi Ohyama, Leaf Tobacco Research Center, Japan Tobacco Inc., Oyama, Japan

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 15 June 2019 Accepted: 31 October 2019 Published: 26 November 2019

#### Citation:

Fanani MZ, Fukushima EO, Sawai S, Tang J, Ishimori M, Sudo H, Ohyama K, Seki H, Saito K and Muranaka T (2019) Molecular Basis of C-30 Product Regioselectivity of Legume Oxidases Involved in High-Value Triterpenoid Biosynthesis. Front. Plant Sci. 10:1520. doi: 10.3389/fpls.2019.01520

*Much Zaenal Fanani1, Ery Odette Fukushima1,2, Satoru Sawai1,3,4,5, Jianwei Tang3, Masato Ishimori4, Hiroshi Sudo5†, Kiyoshi Ohyama3,6†, Hikaru Seki1,3, Kazuki Saito3,4 and Toshiya Muranaka1,3\**

1 Department of Biotechnology, Graduate School of Engineering, Osaka University, Suita, Japan, 2 Department of Biotechnology, Faculty of Life Sciences, Universidad Regional Amazónica IKIAM, Tena, Ecuador, 3 RIKEN Center for Sustainable Resource Science, Yokohama, Japan, 4 Graduate School of Pharmaceutical Sciences, Chiba University, Chiba, Japan, 5 Tokiwa Phytochemical Co., Ltd., Sakura, Japan, 6 Department of Chemistry and Materials Science, Tokyo Institute of Technology, Meguro, Japan

The triterpenes are structurally diverse group of specialized metabolites with important roles in plant defense and human health. Glycyrrhizin, with a carboxyl group at C-30 of its aglycone moiety, is a valuable triterpene glycoside, the production of which is restricted to legume medicinal plants belonging to the Glycyrrhiza species. Cytochrome P450 monooxygenases (P450s) are important for generating triterpene chemodiversity by catalyzing site-specific oxidation of the triterpene scaffold. CYP72A154 was previously identified from the glycyrrhizin-producing plant Glycyrrhiza uralensis as a C-30 oxidase in glycyrrhizin biosynthesis, but its regioselectivity is rather low. In contrast, CYP72A63 from Medicago truncatula showed superior regioselectivity in C-30 oxidation, improving the production of glycyrrhizin aglycone in engineered yeast. The underlying molecular basis of C-30 product regioselectivity is not well understood. Here, we identified two amino acid residues that control C-30 product regioselectivity and contribute to the chemodiversity of triterpenes accumulated in legumes. Amino acid sequence comparison combined with structural analysis of the protein model identified Leu149 and Leu398 as important amino acid residues for C-30 product regioselectivity. These results were further confirmed by mutagenesis of CYP72A154 homologs from glycyrrhizin-producing species, functional phylogenomics analyses, and comparison of corresponding residues of C-30 oxidase homologs in other legumes. These findings could be combined with metabolic engineering to further enhance the production of high-value triterpene compounds.

Keywords: chemodiversity, cytochrome P450 monooxygenase, legume, product regioselectivity, triterpene

# INTRODUCTION

The triterpenoids are a large group of plant specialized metabolites consisting of six isoprene units. Plants produce structurally diverse triterpenoids that often have important roles in plant defense (Osbourn, 1996; Kuzina et al., 2009; Liu et al., 2019). Moreover, some triterpenoids exhibit properties beneficial for human health (Ito et al., 1988; Kenarova et al., 1990; Zhao et al., 2006; Kojoma et al., 2010). Due to this structural diversity, triterpenoids are considered important sources for new drug leads (Geisler et al., 2013; Vo et al., 2017). However, harnessing the potential of the structural diversity of triterpenoids has been hampered by limited information regarding the molecular mechanisms underlying their structural diversity.

Decoration of the triterpene scaffold catalyzed by cytochrome P450 monooxygenases (P450s) is the second step of triterpene biosynthesis (Seki et al., 2015). Generally, P450s have the ability to stereo- and regioselectively oxidize non-activated carbon by introducing various functional groups, such as hydroxyl, carbonyl, carboxyl, and even epoxy moieties (Qi et al., 2006; Ghosh, 2017). Moreover, the introduction of a hydroxyl group into the triterpene scaffold allows to the generation of glycosylated and acylated triterpenes (Osbourn et al., 2011; Seki et al., 2015). Therefore, P450s are believed to play important roles in the diversity of triterpene structures (Ghosh, 2017; Miettinen et al., 2017).

Glycyrrhizin is a triterpene saponin that is the main active compound in legume medicinal *Glycyrrhiza* plants (Shibata, 2000). In addition to its sweet taste (150 times sweeter than sucrose; Kitagawa, 2002), glycyrrhizin also shows various pharmacological activities (Shibata, 2000), including antiinflammatory (Kroes et al., 1997), hepatoprotective (Jeong et al., 2002), and antiviral effects (Ito et al., 1988). Among the *Glycyrrhiza* species, *Glycyrrhiza uralensis*, *Glycyrrhiza glabra*, and *Glycyrrhiza inflata* are known to produce glycyrrhizin (Hayashi et al., 2000). Large amounts of glycyrrhizin accumulate in their roots and stolons, accounting for an estimated 2%–8% of the dry weight (Shibata, 2000). Glycyrrhizin itself has been used as an ingredient in a number of commercial products, including foods, personal health care products, and medicines. However, the production of glycyrrhizin is dependent on natural resources that require an approximately 2–3-year growth period before harvesting (Chen et al., 2014). Due to the economic value and market demand for licorice, overexploitation of wild licorice has led to significant environmental issues (Marui et al., 2011). Therefore, a rapid and environmental friendly system for glycyrrhizin production is required.

Metabolic engineering has been studied extensively for production of plant specialized metabolites in engineered organisms. The biosynthesis of glycyrrhizin involves the initial cyclization of 2,3-oxidosqualene to the pentacyclic triterpene β-amyrin, followed by a series of oxidative reactions at positions C-11 and C-30 (Seki et al., 2008; Seki et al., 2011). Previously, we identified two P450s (CYP88D6 and CYP72A154) involved in glycyrrhizin biosynthesis (Seki et al., 2008; Seki et al., 2011). Functional characterization of CYP72A154 showed that this enzyme catalyzed oxidation at C-30, accompanied by the production of isomers as minor products (Seki et al., 2011). *Medicago truncatula* does not produce glycyrrhizin, and the homologous enzyme, CYP72A63, showed superior regioselective oxidization of the C-30 position only (Seki et al., 2011). Use of CYP72A63 for production of glycyrrhizin aglycone in engineered yeast further enhanced its yield (Zhu et al., 2018). Although improvement of glycyrrhizin aglycone production in yeast has been achieved, it still produces a number of byproducts, such as 11α,30-dihydroxy-β-amyrin and 30-hydroxy-β-amyrin (Zhu et al., 2018). Further improvement of glycyrrhizin production by combining metabolic engineering and protein engineering is hampered by limited knowledge regarding the structure–function relationship of P450s involved in glycyrrhizin biosynthesis.

A great deal of research effort has focused on discovering the P450s involved in decoration of the triterpene skeleton. Based on amino acid sequence identity, CYP72A154 and CYP72A63 are classified as members of the CYP72A subfamily. The CYP72A subfamily is known as P450 subfamily enzyme involved in generating triterpene chemodiversity, where they catalyze sitespecific oxidation of the oleanane-type triterpenoid scaffolds, C-2β (Biazzi et al., 2015), C-21β (Yano et al., 2017; Leveau et al., 2019), C-22β (Ebizuka et al., 2011; Fukushima et al., 2013), C-23 (Fukushima et al., 2013; Biazzi et al., 2015; Liu et al., 2019), and C-30 (Seki et al., 2011). Prall et al. (2016) reported that the CYP72A subfamily in flowering plants showed high variability of amino acid residues among substrate recognition sites (SRSs). However, there have been no experimental studies investigating the roles of amino acid residues in the SRSs in this subfamily. Moreover, no CYP72A subfamily protein crystal structures have yet been reported, even for closely related P450 with more than 40% identity. Therefore, it is still difficult to apply rational protein engineering to the CYP72A subfamily to improve product specificity and regioselectivity. Some reports suggested that gene mining of publicly available genomic or transcriptome databases is a more practical method for obtaining candidate genes encoding CYP72A enzymes showing better catalytic activity (Moses et al., 2014; Suzuki et al., 2018; Zhu et al., 2018). However, the majority of natural enzymes still exhibit some properties unfavorable for application in metabolic engineering, such as low product specificity and regioselectivity (Jung et al., 2011; Goldsmith and Tawfik, 2017). Therefore, it would be useful and interesting to determine the molecular mechanism underlying product regioselectivity of C-30 oxidases involved in high-value triterpenoid biosynthesis.

In this study, we mined the CYP72A subfamily from *M. truncatula* and characterized its enzymatic activity against β-amyrin. Functional characterization of the CYP72A subfamily from *M. truncatula* showed that CYP72A63 is an enzyme with high C-30 product regioselectivity. Interestingly, CYP72A62v2 and CYP72A64v2, which share more than 90% amino acid sequence identity, showed completely different product regioselectivity. By comparing the SRS sequences of CYP72A63 and its homologs, and by protein homology modeling, we identified Leu149 and Leu398 as key amino acid residues responsible for C-30 regioselective oxidation activity in CYP72A63. Analysis of CYP72A154 variants from both glycyrrhizin-producing and non-glycyrrhizin-producing *Glycyrrhiza* species also indicated that amino acid residue #398 differentiated the product regioselectivity of CYP72A154 variants. The results of this study will provide opportunities to engineer P450s for manipulation of product regioselectivity by rational protein engineering, to achieve the production of valuable triterpenoids such as glycyrrhizin.

# MATERIALS AND METHODS

# Plant Materials

The seed plants used in this experiment are listed in **Supplementary Table 1**. Seeds were germinated by mechanical scarification and imbibition in the dark at 23°C for 2 days. Germinated seeds were planted in soil and grown in a plant room with a controlled temperature of 23°C under a 16-h light/8-h dark photoperiod. Plant samples for RNA isolation were collected from 4-week-old seedlings, immediately frozen in liquid nitrogen, and then stored at -80°C until use. Underground parts of *Glycyrrhiza pallidiflora* were harvested from the Medicinal Plant Garden of Chiba University (Chiba, Japan); *Glycyrrhiza lepidota* and *Glycyrrhiza macedonica* were obtained from Osaka University of Pharmaceutical Science (Osaka, Japan), and *Glycyrrhiza glabra* was obtained from Health Sciences University of Hokkaido (Hokkaido, Japan).

# Authentic Standards

β-Amyrin was purchased from Extrasynthese (Lyon, France). Sophoradiol, 30-hydroxy-β-amyrin, and 11-deoxoglycyrrhetinic acid were synthesized as described in our previous report (Seki et al., 2008).

# Gene Mining for the CYP72A Subfamily

CYP72A subfamily candidates were identified by BLAST search using the amino acid sequences of CYP72A61v2, CYP72A63, and CYP72A67 as queries against the *M. truncatula* genome project Mt4.0v1 proteins (Tang et al., 2014). Hits showing >50% identity with unique sequential gene ID numbers were manually checked for surrounding 50 kb in genome JBrowser (Krishnakumar et al., 2015). Natural variants of gene cluster subgroup III were mined from *M. truncatula* Hapmap (Zhou et al., 2017) using the genomic sequences of *CYP72A62* (Medtr8g042060), *CYP72A63*  (Medtr8g042040), *CYP72A64* (Medtr8g042020), and *CYP72A65*  (Medtr8g042000) (obtained from *M. truncatula* genome database) as queries. Amino acid sequences were predicted according to the known coding sequences of gene references in *M. truncatula*. CYP72A63 homologs from other legumes were mined from publicly available genomic sequences in the 1KP database (Matasci et al., 2014), Clover GARDEN (www. clovergarden.jp/) (Hirakawa et al., 2016), Vigna Genome Server (viggs.dna.affrc.go.jp/) (Sakai et al., 2016), Cool Season Food Legume Genome Database (www.coolseasonfoodlegume.org/) (Main et al., 2013), and Legume Information System (legumeinfo. org/) (Dash et al., 2016).

# Cloning and Vector Construction

RNA preparation and cloning methods are explained briefly in the **Supplementary Methods**. All candidates were verified by sequencing and submitted to the P450 Committee for naming. Yeast expression clones, using pELC-CPR-GW (Seki et al., 2008), pYES2-DEST52 (Thermo Fisher Scientific, Waltham, MA), and a Gateway-compatible version of pESC-HIS (Seki et al., unpublished) as destination vectors, were constructed by LR reaction using LR clonase II Enzyme mix (Thermo Fisher Scientific).

# Accession Numbers

```
Sequence data from this experiment have been submitted 
to the DNA DataBank of Japan (DDBJ), the European 
Nucleotide Archive (ENA), and GenBank databases under 
the following accession numbers: MK534530 (CYP72A695), 
MK534531 (CYP72A696), MK534532 (GmaxCYP72A141), 
MK534533 (GgCYP72A154), MK534534 (GpCYP72A154), 
MK534535 (LcCYP72A698), MK534536 (PsCYP72A698), 
MK534537 (CYP72A302), MK534538 (CYP72A694), 
MK534539 (CYP72A697), MK534540 (CYP72A336v2), 
MK534541 (CYP72A70), MK534542 (GsojaCYP72A141), 
MK534543 (CYP72A337v2), MK534544 (CYP72A557), 
MK534545 (CYP72A558), MK534546 (CYP72A559), 
MK534547 (CYP72A560), MK534548 (CYP72A64v2), 
MK534549 (CYP72A699), MK792941 (CYP72A66v2).
```
# In Vivo Enzymatic Assay in Yeast

*In vivo* enzymatic assay in yeast was performed by co-expression of *Lotus japonicus* cytochrome P450 reductase (CPR) and β-amyrin synthase (OSC1), with three yeast expression clones for each of the CYP72A subfamily genes. Each corresponding set of pELC-CPR-CYP72A, pYES2-DEST52-CYP72A, and pESC-HIS-CYP72A was transformed into *Saccharomyces cerevisiae* INVSc1 (*MATa his3D1 leu2 trp1-289 ura3-52*; Thermo Fisher Scientific) carrying pYES3-ADH-OSC1 (Seki et al., 2008) using Frozen-EZ Yeast Transformation II™ (Zymo Research, Orange, CA), and a yeast strain harboring the three empty vectors (pELC-CPR-GW, pYES2-DEST52, and pESC-HIS) was used as a control. All genes tested in this experiment are listed in **Supplementary Table 3**. *In vivo* enzymatic assay in yeast was performed as reported previously (Fukushima et al., 2011) with some modifications. Yeast strains were pre-cultured in appropriate synthetic defined medium (Clontech, Palo Alto, CA) containing 2% glucose, and incubated overnight at 30°C, 200 rpm. Aliquots of 50 µl of yeast pre-cultures were added into 5 ml of appropriate synthetic defined medium (Clontech) containing 2% glucose and incubated overnight at 30°C, 200 rpm. Yeast cells were collected by centrifugation, resuspended in 5 ml of appropriate synthetic defined medium (Clontech) containing 2% galactose and incubated at 30°C for 4 days at 200 rpm. Yeast metabolites were extracted using ethyl acetate after sonication, three times for 30 minutes each time, and portions of the extracts were analyzed by gas chromatography– mass spectrometry (GC-MS) after derivatization with *N*-methyl-*N*-(trimethylsilyl)trifluoroacetamide (Sigma-Aldrich, St. Louis. MO). GC-MS analysis was performed using a gas chromatograph (7890B; Agilent Technologies, Santa Clara, CA) connected to a mass spectrometer (5977A; Agilent Technologies) and HP-5MS capillary column (0.25 mm × 30 m, 0.25 µm) (Agilent Technologies). The initial oven temperature was 150°C with a hold time of 1 min, increasing from 150°C to 260°C at 30°C/min and 260°C to 300°C at 1°C/min. Samples were injected in splitless mode with an injection temperature of 250°C, with helium as the carrier gas at a flow rate of 1.0 ml/min. Comparison of retention times and mass fragmentation patterns of detectable compounds with those of authentic standards was performed to assign the peaks.

# Bioinformatics Analyses

The predicted amino acid sequences of identified CYP72A subfamily members were used for multiple sequence alignment using ClustalW in MEGA7 (Kumar et al., 2016). A neighborjoining tree was generated using MEGA6 with the Jones–Taylor– Thomson substitution model and bootstrap analysis of 1,000 replicates. The SRSs of the *M. truncatula* CYP72A subfamily were predicted as described previously (Gotoh, 1992; Prall et al., 2016). A CYP72A63 structure model was constructed using 6C93.A as the template with about 27% identity in SWISS-MODEL (Arnold et al., 2006). Heme was docked with 40 × 40 × 40 grid points, spacing 0.375, and grid center -14.738; 21.827; 9.827 using Autodock4.0 (Morris et al., 2009). PyMol (Schrödinger, 2017) was used to visualize the amino acids in the active site of the enzyme.

# Site-directed Mutagenesis

Designed protein variants were generated using the site-specific nucleotides listed in **Supplementary Table 2**. Mutagenesis experiments were performed using a PrimeSTAR Mutagenesis Basal Kit (TaKaRa Bio, Kyoto, Japan) and the entry clone was used as a template.

# Elucidation of the Structure of Compound 2 (Peak 2)

A yeast strain carrying β-amyrin synthase co-expressed together with CPR and CYP72A63L149V/L398V was cultured in appropriate synthetic defined medium with a total volume of 25.2 L (250 ml × 72, 150 ml × 48). Yeast metabolites were obtained by saponification prior to extraction with *n*-hexane three times. Yeast extracts were evaporated and the residues were applied to silica gel chromatography (60N, spherical, neutral) (Kanto Chemical, Tokyo, Japan). Hexane:ethyl acetate (1:9) was used as the mobile phase, and afforded about 8 mg of compound 2. Nuclear magnetic resonance (NMR) data were recorded on a Bruker Avance III 600 MHz spectrometer (Bruker Daltonic, Bremen, Germany) using CDCl3 as the solvent.

# RESULTS

# The Chromosomal Localization of M.truncatula CYP72As Corresponds to the Phylogenetic Tree Topology

Using representatives of each subgroup in the CYP72A subfamily described previously (Seki et al., 2011), we identified 20 candidate genes encoding CYP72A subfamily enzymes in the *M. truncatula* genome. *CYP72A59-like6*, *CYP72A59-like7*, and *CYP72A68-like* showed shorter sequences and no transcripts in transcriptome data (**Supplementary Table 4**). We considered them to be pseudogenes and they were excluded from subsequent experiments. Among the 17 genes identified as encoding CYP72A subfamily enzymes from *M. truncatula* Mt4.0v1, only eight (*CYP72A59v2*, *CYP72A61*, *CYP72A62v2*, *CYP72A63*, *CYP72A65v2*, *CYP72A67*, *CYP72A68-430*, *CYP72A68-470*) were functionally characterized (Seki et al., 2011; Fukushima et al., 2013; Biazzi et al., 2015; Reed et al., 2017). In summary, nine genes (*CYP72A64v2*, *CYP72A66v2*, *CYP72A336v2*, *CYP72A337v2*, *CYP72A70*, *CYP72A557*, *CYP72A558*, *CYP72A559*, *CYP72A560*) were reported in this study.

In the BlastP results, hits had unique sequential gene ID numbers (**Supplementary Table 5**). Based on this observation, we next analyzed the chromosomal localization of the CYP72A subfamily in the *Medicago truncatula* Mt4.01 genome database using JBrowser (Krishnakumar et al., 2015). The CYP72A subfamily genes were shown to be clustered in tandem arrays on chromosomes 2 and 8, with the exception of *CYP72A67*, *CYP72A61*, and *CYP72A70* (**Figure 1A**). Notably, the gene cluster on chromosome 2 contains half of the total number of CYP72A subfamily genes present in the *M. truncatula* genome. The CYP72A subfamily enzymes within the cluster showed 81%–94% identity except for CYP72A337 (**Supplementary Table 6**). The constructed phylogenetic tree showed that the CYP72A subfamily genes clustered on chromosomes 2 and 8 are also grouped into the same subgroups (IV and II, respectively; **Figure 1B**). Consistent with its amino acid sequence identity, CYP72A337v2 is located out of the clade within the gene cluster, and was further classified into a new subgroup IV. These results suggested that gene duplication likely occurred multiple times within the CYP72A subfamily in *M. truncatula*.

# CYP72A63 Regioselectively Oxidized **β**-Amyrin C-30

The enzymatic activities of CYP72A subfamily enzymes against β-amyrin were characterized by co-expression of CYP72A subfamily genes together with CPR and β-amyrin synthase in *S. cerevisiae* INVSc1. The CYP72A subfamily enzymes within the cluster showed a range of regioselective oxidation activity (**Figure 2**, **Supplementary Figure 1A**). CYP72A61v2 of subgroup I showed oxidation activity against β-amyrin at position C-22β, producing sophoradiol. β-Amyrin C-22β oxidation activity was also detected in CYP72A66v2 (subgroup III), which showed oxidation activity at more than one site producing sophoradiol, 30-hydroxy-β-amyrin, compound **2** (peak 2 in **Figure 2**, which may correspond to a monohydroxylated β-amyrin product with a hydroxyl group on the D or E ring, based on the characteristics of retro Diels-Alder fragmentation at the C ring in the mass fragmentation pattern shown in **Supplementary Figure 1A**), 11-deoxoglycyrrhetinic acid, and some minor unknown compounds. CYP72A557 and CYP72A558 also showed oxidation activity against β-amyrin producing unknown compound 1 (peak 1 in **Figure 2**), compound **2** (peak 2 in **Figure 2**), and unknown compound 3 (peak 3 in **Figure 2**), which were predicted to be monohydroxylated β-amyrin products with a hydroxyl group on the D or E, based on the characteristics of retro Diels-Alder

fragmentation at the C ring in the mass fragmentation pattern shown in **Supplementary Figure 1A**. Unlike other enzymes in the same cluster, CYP72A559 and CYP72A560 showed oxidation activity against β-amyrin at the D or E ring, to produce unknown compound **1** (peak 1 in **Figure 2**) alone.

CYP72A subfamily enzymes in tandem array subgroup II showed oxidation activity with different regioselectivities (**Figure 2**, **Supplementary Figure 1A**). CYP72A65v2 showed oxidation activity against β-amyrin, producing unknown compound **1** (peak 1 in **Figure 2**) as the major product and compound **2** (peak 2 in **Figure 2**) as a minor product. CYP72A62v2 and CYP72A64v2 showed oxidation activity against β-amyrin, producing compound **2** (peak 2 in **Figure 2**) and unknown compound 4 (peak 4 in **Figure 2**) as minor carboxylated form products. In addition, unknown compound **1** (peak 1 in **Figure 2**) was detected in trace amounts in the reaction product of CYP72A62v2. CYP72A63 showed oxidation activity against β-amyrin at the C-30 position producing 30-hyroxy-β-amyrin and 11-deoxoglycyrrhetinic acid. The *in vivo* enzymatic assay clearly showed that CYP72A63 was the only one CYP72A subfamily enzyme with high regioselectivity at β-amyrin C-30 from *M. truncatula*.

No oxidized β-amyrin product was detected on *in vivo*  enzyme assay of CYP72A70, CYP72A336v2, CYP72A59v2, and CYP72A337v2 (**Figure 2**, **Supplementary Figure 1A**). The lack

of detectable enzymatic activity of these four CYP72A enzymes may have different causes; current detection method may not be able to detect trace amounts of its oxidation products, enzymes may not be expressed in correct way, enzymes may have different substrate specificity or mutation in the signature region may cause loss-of-function. Possible mutations in the signature region of these four enzymes were investigated by multiple alignment of the putative oxygen activation region of the CYP72A subfamily (**Supplementary Figure 3**). An amino acid substitution in a conserved acidic amino acid (Glu) to basic amino acid (Lys) was found in CYP72A336v2. To examine whether this substitution of conserved amino acid residue Glu327 caused loss of function in CYP72A336v2, the mutant CYP72A336v2K327E was generated. *In vivo* enzymatic assay of CYP72A336v2K327E showed that substitution of Lys327 to Glu327 could recover enzyme activity of CYP72A336v2, producing sophoradiol, 30-hydroxy-β-amyrin, and compound **2** (peak 2 in **Supplementary Figure 3C**).

# Leu149 and Leu398 are Essential for Regioselective Oxidation at C-30

To identify residues important for C-30 regioselectivity of CYP72A63, we examined amino acid residues in the predicted SRS (**Figure 3A**). A number of criteria were applied; the amino acid residue must be conserved in CYP72A62v2 and CYP72A64v2, but not in CYP72A63, or must not be conserved at all. Six positions (Val149, His150, Ile244, Gln246, Glu269, and Val398) were selected (marked in the box, **Figure 3A**). *In vivo* enzymatic assay of CYP72A62v2 mutants showed that CYP72A62v2V398L was sufficient to alter the product regioselectivity by detecting compound **2** (peak 2) and 30-hydroxy-β-amyrin on GC-MS analysis (**Figure 3C**, **Supplementary Figure 1B**). In parallel, we also mapped the positions of key residues important for C-30 product regioselectivity by generating protein chimeras of CYP72A63 and CYP72A62v2 using the segment exchange approach (**Supplementary Figure 4**). *In vivo* enzymatic assay of chimeric proteins suggested that key amino acid residues for C-30 product regioselectivity are located between residues #133 and #409, and more than one amino acid residue is required (**Supplementary Figure 4**). These results suggested that Leu398 is not the sole residue responsible for C-30 product regioselectivity.

To identify the second important amino acid residue determining product regioselectivity, we next mapped the

sites (SRSs). Amino acid residue candidates for mutagenesis are indicated by gray boxes. Key residues are indicated by red stars. (B) Structural analysis of key amino acid residues in the CYP72A63 model. The gray boxes show enlargements of areas where key residues are located in the active site above heme. (C) Reciprocal mutagenesis studies to identify key amino acid residues. Molecular ions with m/z 306 and 320 were selected for EIC analysis of β-amyrin-oxidized products. C-30 product regioselectivity-related products or enzymes are indicated in red. Peaks corresponding to unconfirmed β-amyrin are indicated with asterisks. position of Leu398 in the three-dimensional homology model of CYP72A63 (**Figure 3B**). The homology model of CYP72A63 showed that Leu398 is located in the area surrounding the reaction center of P450, where the enzyme catalytic reaction takes place. Based on these findings, we hypothesized that the second important amino acid residue may be located close to Leu398. Therefore, we examined amino acid residues surrounding Leu398. Among five amino acid residues in this region, Leu149 (SRS1) is located in a relatively face-to-face position to Leu398 (SRS5) (**Figure 3B**). Thus, both amino acid residues may determine the regioselectivity of CYP72A63. To examine the role of Leu149 together with Leu398 in C-30 product regioselectivity, the mutant CYP72A62v2V149L/V398L was generated. *In vivo* enzymatic assay of CYP72A62v2V149L/V398L showed that CYP72A62v2V149L/V398L produced only C-30 oxidized products, 30-hydroxy-β-amyrin and 11-deoxoglycyrrhetinic acid (**Figure 3C**). These results showed that Leu149 and Leu398 are important amino acid residues for regioselective oxidation of β-amyrin at C-30.

Substitution of amino acid residues #149 and #398 markedly altered the product regioselectivity of the CYP72A62v2 enzyme (**Figure 3C**, **Supplementary Figure 1C**). We also generated mutants of CYP72A63, designated as CYP72A63L149V/L398V and CYP72A63L149V/L398I, which resembled CYP72A62v2 and CYP72A64v2, respectively. *In vivo* enzymatic assay showed that CYP72A63L149V/L398V and CYP72A63L149V/L398I had altered product regioselectivity resembling their counterparts, CYP72A62v2 and CYP72A64v2, respectively (**Figure 3C**, **Supplementary Figure 1C**). These results clearly showed that Leu149 and Leu398 are important amino acid residues for C-30 product regioselectivity of CYP72A63. To determine the regioselective oxidation activities of CYP72A62v2 and CYP72A64v2, we examined the structure of compound 2 (peak 2) by NMR spectroscopy. Complete13 C assignment of purified compound 2 (peak 2) was not achieved due to incomplete removal of impurities. However, the data indicated the presence of 29-hydroxy-β-amyrin (**Supplementary Data**).

We used the *M. truncatula* Hapmap to further investigate the roles of residues #149 and #398 in product regioselectivity (Zhou et al., 2017). By focusing on amino acid residues #149 and #398, amino acid residue variants present in CYP72A62v2 were identified from *M. truncatula* accessions (Zhou et al., 2017) (**Figure 4A**). The variants were classified into three types based on differences in amino acid residues #149 and #398: Type VV (Val149, Val398), Type VL (Val149, Leu398), and Type IL (Ile149, Leu398). To evaluate the effects of divergent amino acid residues on product regioselectivity, we generated mutants of CYP72A63 mimicking each Types, VV, VL, and IL. *In vivo* enzymatic assay showed that these artificial mutant enzymes had differences in product regioselectivity (**Figure 4B**, **Supplementary Figure 1D**). Unexpectedly, Type IL oxidized at the C-30 position, producing 30-hydroxy-β-amyrin and 11-deoxoglycyrrhetinic acid, which resembled CYP72A63 rather than CYP72A62v2. These results clearly showed that amino acid residues #149 and #398 are essential for fine tuning of product regioselectivity.

# Residue #398 may be Involved in Generating Triterpene Chemodiversity in Glycyrrhiza Species

To investigate the roles of amino acid residues #149 and #398 in generating triterpene chemodiversity in *Glycyrrhiza* species, CYP72A154 variants from glycyrrhizin-producing species (*Glycyrrhiza uralensis*, *Gu*CYP72A154; *G. glabra*, *Gg*CYP72A154) and non-glycyrrhizin-producing species (*Glycyrrhiza pallidiflora*, *Gp*CYP72A154; *G. lepidota*, *Gl*CYP72A154; *G. macedonica*, *Gmac*CYP72A154) were investigated. As the full-length amino acid sequences of the three non-glycyrrhizin-producing species were identical, we selected *Gp*CYP72A154 as a representative species. Sequence alignment of CYP72A154 variants showed that amino acid residue #149 (numbering based on CYP72A63) is Val149 for both types, but amino acid residue #398 (numbering based on CYP72A63) differs between them, i.e., Gly398 for glycyrrhizin-producing species and Ala398 for non-glycyrrhizinproducing species (**Figure 5**). To characterize the role of divergent amino acid residue #398 in product regioselectivity in CYP72A154 variants, *in vivo* enzymatic assay was performed. *Gu*CYP72A154 and *Gg*CYP72A154 oxidized β-amyrin with less regioselectivity at the D or E ring, producing unknown compound 1 (peak 1) and 30-hydroxy-β-amyrin as the main products and 29-hydroxy-β-amyrin as a trace product (**Figure 5**, **Supplementary Figure 1E**). In contrast, the non-glycyrrhizinproducing species *Gp*CYP72A154 oxidized β-amyrin, producing unknown compound 1 (peak 1) and 29-hydroxy-β-amyrin as the main products and 30-hydroxy-β-amyrin as a trace product (**Figure 5**, **Supplementary Figure 1E**). These results suggested that the product regioselectivity of CYP72A154 variants differs between glycyrrhizin-producing species and non-glycyrrhizinproducing species.

To characterize the product regioselectivity of CYP72A154 variants, we also elucidated two unknown products in the residue remaining after purification of 30-hydroxy-11-oxo-βamyrin from yeast co-expressing β-amyrin synthase, CYP88D6, *Gu*CYP72A154, and CPR (Peaks 4b and 4c in **Figure 2A** shown in Seki et al., 2011). NMR spectroscopy showed that the unknown compounds of 30-hydroxy-11-oxo-β-amyrin isomers were 29-hydroxy-11-oxo-β-amyrin and 21β-hydroxy-11-oxoβ-amyrin (**Supplementary Data**). These results showed that *Gu*CYP72A154 catalyzed oxidation at C-21β, C-29, and C-30.

To investigate whether the differences in product regioselectivity of CYP72A154 variants are associated with divergence of amino acid residue #398, we generated mutant *Gu*CYP72A154G398A resembling *Gp*CYP72A154. *In vivo* enzymatic assay showed that product regioselectivity of *Gu*CYP72A154G398A changed markedly, more closely resembling *Gp*CYP72A154 product regioselectivity and producing unknown compound 1 (peak 1, putative 21β-hydroxy-β-amyrin) and 29-hydroxy-β-amyrin as main products (**Figure 5**, **Supplementary Figure 1E**). In addition, we also generated mutants of CYP72A63, CYP72A63L149V/L398G and CYP72A63L149V/L398A, carrying the amino acids at residues #149 and #398 in glycyrrhizin-producing species *Gu*CYP72A154 and non-glycyrrhizin-producing species *Gp*CYP72A154 (**Figure 5**, **Supplementary Figure 1E**). *In vivo* enzymatic assay of

CYP72A63L149V/L398G (mimicking glycyrrhizin-producing variants) showed oxidation activity mainly at the C-30 position, producing 30-hydroxy-β-amyrin as the main product, while CYP72A63L149V/ L398A (mimicking non-glycyrrhizin-producing variants) showed oxidation activity mainly at the C-29 position producing 29-hydroxy-β-amyrin as the main product. Mutagenesis of CYP72A63 mimicking CYP72A154 variants showed good agreement with the product regioselectivity of CYP72A154 variants from glycyrrhizin-producing and non-producing species. These results suggested that differences in amino acid residue #398 may be involved in generating triterpene chemodiversity in *Glycyrrhiza* species by conferring variable product regioselectivity.

# Divergent Amino Acid Residues #149 and #398 in Legume CYP72A63 Homologs

To further analyze the roles of amino acid residues #149 and #398 in CYP72A63 homologs from other legumes, we performed phylogenomic analyses of CYP72A63 homologs by constructing a phylogenetic tree, comparing amino acid residues #149 and #398, and performing *in vivo* enzymatic assays. CYP72A63 homologs were searched from publicly available genomic information and transcriptome databases of legume plants. Selected CYP72A63 homologs were cloned, confirmed by DNA sequencing, and submitted to the P450 Committee for naming (**Supplementary Table 3**). Phylogenetic analysis of CYP72A63 homologs showed

the CYP72A63 amino acid sequence. Total ion current chromatograms are shown in enlargement mode. Identified peaks are indicated. Amino acid residue #398 and product regioselectivity are indicated in red and blue, for C-30 and C-29, respectively.

that the legume species have a variable number of CYP72A63 homologs present in their genome (**Figure 6**); amino acid residues #149 and #398 varied among them. To investigate the relationships of amino acid residues #149 and #398 to product regioselectivity, *in vivo* enzymatic assays against β-amyrin were performed (**Figure 6**, **Supplementary Figure 1F**). None of the CYP72A63 homologs from these legumes showed C-30 oxidation activity, except CYP72A66v2 from *M. truncatula* and CYP72A154 from glycyrrhizin-producing *Glycyrrhiza* plants (**Figure 7**). The combinations of amino acid residues #149 and #398 differed among *Vigna angularis Va*CYP72A694 (Ile149, Thr398), *Glycine max Gmax* CYP72A141 (Leu149, Thr398), and *Lotus japonicus Lj*CYP72A697 (Val149, Val398), but *in vivo* enzymatic assays showed that they have regioselectivity in the C-29 position. *Trifolium pratense Tp*CYP72A699 and *Phaseolus vulgaris Pv*CYP72A302 have a combination of amino acid residues, Val149 and Val398, as seen in *Lotus japonicus Lj*CYP72A697, but *in vivo* enzymatic assay showed that they differed in product regioselectivity; however, regioselectivity in the C-29 position was common among them. These results support the important roles of amino acid residues #149 and #398 in determining regioselective oxidation activity.

# DISCUSSION

In the absence of protein structures for P450s involved in triterpene biosynthesis, comprehensive analyses of the structure– function relationships could not be performed. Recent studies successfully demonstrated that comparative functional analyses of natural variants or evolutionarily related enzymes could be useful in narrowing down important amino acid residues, such as those involved in substrate and product specificity (Komori et al., 2013; Chen and Li, 2017; Xue et al., 2018). Elucidation of the molecular basis of product specificity and regioselectivity of the enzymes enables us to improve product specificity and produce desired compounds more efficiently. In this study, investigation of tandem duplicated CYP72A subfamily genes in

*M. truncatula* identified two amino acid residues, Leu149 and Leu398, responsible for C-30 regioselective oxidation activity.

CYP72A subfamily genes were found in tandem arrays in the genome of *M. truncatula* (**Figure 1A**). The tandem array of CYP72A subfamily members is not a specific feature of *M. truncatula* because it was also found in other legumes (**Supplementary Figure 2**), and even in non-legumes, such as *Barbarea vulgaris* (Liu et al., 2019), *Oryza sativa*, and *Arabidopsis thaliana* (Saika et al., 2014). Recently, Liu et al. (2019) reported that *CYP72A552*, one of the CYP72A subfamily genes present in tandem arrays, is involved in hederagenin-based saponin biosynthesis in *Barbarea vulgaris*. Similarly, Saika et al. (2014) also reported that *CYP72A31*, one of the CYP72A subfamily genes present in tandem arrays, is involved in the mechanism of herbicide tolerance in rice. These observations indicated that genes present in tandem arrays may have different functions.

Gene duplication followed by subsequent mutations is a well-known mechanism by which genes gain new functions through neofunctionalization and escape from adaptive conflict (Panchy et al., 2016). In neofunctionalization, one copy of a duplicated gene maintains the original function, while the other copy gains a novel function by accumulation

of mutations (Panchy et al., 2016). Although the reasons why members of the CYP72A subfamily are commonly found in tandem arrays are still unknown, these findings suggest a mechanism of functional diversification of CYP72A subfamily enzymes in plants. The existence of the CYP72A subfamily is not specific to legumes. However, its contribution to the synthesis of structurally diverse triterpenes has been reported almost exclusively in legumes, with the exception of the C-21 oxidase of *Avena* spp. (Leveau et al., 2019) and C-23 oxidase of *Kalopanax septemlobus* (Han et al., 2018) and *Barbarea vulgaris* (Liu et al., 2019).

CYP72A63 is located in the tandem array together with CYP72A65v2, CYP72A64v2, and CYP72A62v2 on chromosome 8. A previous report suggested that CYP72A65 also showed C-30 oxidation activity (Zhu et al., 2018). However, our *in vivo* enzymatic assay of CYP72A subfamily enzymes from *M. truncatula* clearly showed that CYP72A63 is the one enzyme that selectively oxidizes the C-30 position. Considering the chromosomal localization and phylogenetic tree topology, this tandem array likely evolved from a common ancestor with accumulation of mutations. Mutations can directly affect enzyme function, or can be silent. In the case of CYP72A64v2 and CYP72A62v2, substitution of Ile398 to Val398 did not alter the enzyme regioselectivity. However, substitution of Val149 to Leu149 and Ile/Val398 to Leu398 changed the product regioselectivity from C-29 toward the C-30 position. Thus, amino acid residues #149 and #398 determined the substrate orientation, which controlled its product regioselectivity.

The enzymes catalyzing β-amyrin at the C-29 position were identified for the first time in this study (**Figure 7**). Among the C-29 oxidases, *Va*CYP72A694 exhibited greater accumulation of the carboxylated product (putative C-29 carboxylated product; up to 70% product ratio; **Supplementary Figure 1F**). The C-29-derived saponins have been found in legume plants; adzukisaponins in *V. angularis* (Kitagawa et al., 1983) and macedonocides in *G. macedonica* (Hayashi et al., 2000), and some showed promise as high-value triterpenoids. Yoshikawa et al. (2002) identified albiziasaponin B, a triterpene saponin with a carboxyl group at C-29 of its aglycone moiety, from the Thai medicinal plant, *Albizia myriophylla* (Cha-em Thai), showing sweetness 600 times greater than sucrose. In Thai folk medicine, the stem of *A. myriophylla* has been used as a substitute for licorice due to its sweetness (Yoshikawa et al., 2002), and is one of the ingredients in traditional medicine used for treatment of diabetes (Neamsuvan et al., 2015). Thus, identification of C-29 oxidases provides a new genetic tool for production of high-value C-29-derived triterpenoids by synthetic biology.

*Glycyrrhiza* species were classified into two types according to the accumulation of glycyrrhizin, i.e., glycyrrhizin-producing and non-glycyrrhizin-producing species. Glycyrrhizin-producing species mainly show accumulation of C-30-derived saponins, while non-producing species accumulate C-29-derived saponins (Hayashi et al., 2000). The enzymatic activity of CYP72A154 variants showed good agreement with saponin accumulation in both types of *Glycyrrhiza* species. Comparison of amino acid residues #149 and #398 in CYP72A154 variants suggested that divergence in amino acid residue #398 may be involved in the generation triterpene chemodiversity in *Glycyrrhiza* species.

A substrate–enzyme complex model could not be obtained due to the low quality of the protein model. However, C-30 product regioselectivity was illustrated by *in silico* mutagenesis of CYP72A63 (**Figure 8**). Regioselectivity on C-30 and C-29 positions was controlled by amino acid residues #149 and #398, but the combinations of amino acid residues were different among them. The combination of amino acid residues with long nonpolar side chains (Ile/Leu149 and Leu398) resulted in C-30 product regioselectivity, while the combination of an amino acid residue with a nonpolar short side chain (Val149) and long nonpolar side chain (Ile149) resulted in C-29 product regioselectivity. Other combinations of amino acid residues with nonpolar short side chains (Val, Gly, Ala) at both positions resulted in broad regioselectivity, producing a number of isomers, i.e., C-30, C-29, and 21β. This suggested that C-30 product regioselectivity required amino acid residues with a nonpolar long side chain (Leu/Ile149 and Leu398) for specific placement of the methyl-30 group in the favorable position for enzyme reaction (**Figure 8**). Shortening the side chains of amino acid residues #149 and #398 increased the volume of the active site cavity, which allowed positioning of the substrate in multiple orientations (**Figure 8A**). Thus, C-30 product isomers could be produced by nonspecific positioning of methyl-29, methyl-30, and methylene-21 functional groups in proximity to mononuclear iron (**Figure 8B**). Interestingly, amino acid residues #149 and #398 were nonpolar amino acids (Ile, Leu, Val, Ala), except for Gly and Thr. This suggested that hydrophobic interactions may be important for positioning of the substrate in the most favorable orientation for enzyme catalysis. To explain the effects of 20 possible amino acid residues and the binding mode of the enzyme and substrate, mutagenesis studies and crystal structure analysis of CYP72A63 are required.

A recent study showed that CYP72A69 from soybean, *G. max*, catalyzed oxidation at the C-21β position in soyasapogenol A biosynthesis (Yano et al., 2017). *G. max* C-21 oxidase has a short side chain amino acid at residues #149 and #398 (i.e., Ala149 and Gly398, respectively; numbering based on CYP72A63). Similarly, *M. truncatula* CYP72A65v2, which catalyzed oxidation at putative C-21 (peak **1**), also has a short side chain amino acid at both residues #149 and #398 (i.e., Ala149 and Val398, respectively). The substitutions Leu149Ala and Leu398Val in CYP72A63 did not alter the product regioselectivity resembling CYP72A65v2 (data not shown), suggesting that C-21β product regioselectivity may be controlled by amino acid residues other than residues #149 and #398. The amino acid residues involved in product regioselectivity of CYP72A65v2 remain to be determined; doing so would lead to a better understanding of the evolutionary diversification of product regioselectivity of tandem array on chromosome 8.

Although we successfully altered the product regioselectivity of CYP72A62v2 from the C-29 position toward the C-30 position, by substituting Val149 to Leu149 and Val398 to Leu398, the mutant CYP72A62v2V149L/V398L produced only trace amounts

are indicated. Potential rotation of β-amyrin is indicated by the red dashed line. The distance of C-30 from the heme iron is indicated by gray dashed line.

of C-30 carboxylated product (**Figure 3B**). However, the original enzyme, CYP72A63, produced the carboxylated product, suggesting that additional amino acid residues, other than #149 and #398, may be required for successive oxidation to produce the C-30 carboxylated product. Komori et al. (2013) showed that an amino acid residue located in the loop of SRS6, Ser479, in CYP71V1 is important for successive oxidation of amorpha-4,11-diene to produce carboxylated product (artemisinic acid) in the biosynthesis of the antimalarial sesquiterpenoid, artemisinin, in *Artemisia annua*. This suggested that the amino acid residues involved in C-30 carboxylated product specificity may also be located in SRS regions. Further investigations are required to identify the amino acid residues involved in C-30 successive oxidation to produce the C-30 carboxylated product.

We identified the key amino acid residues, Leu149 and Leu398, controlling C-30 product regioselectivity in CYP72A63. The results reported here will enable us to improve the product specificity and produce desired compounds more efficiently. Rational engineering of C-30 oxidase by site saturation mutagenesis amino acid residues #149 and #398 may be useful for fine tuning of the methyl-30 functional group to a favorable position for enzyme catalysis, to improve product specificity and production yield. Alternatively, the results presented here suggested that it may be possible to redirect the product regioselectivity of *Va*CYP72A694, a high carboxylated product producer (up to 70% accumulation), toward the C-30 position by rational protein engineering. The application of protein engineering in combination with metabolic engineering has been shown to significantly improve the production of natural products. Our findings will provide opportunities to further enhance the production of the valuable triterpene glycyrrhizin through rational protein engineering of C-30 oxidase.

# DATA AVAILABILITY STATEMENT

The nucleotide sequences isolated in this study have been submitted to the GenBank at NCBI, ncbi.nlm.nih.gov/genbank/.

# REFERENCES


# AUTHOR CONTRIBUTIONS

MF, EF, and SS designed experiments. MF, SS, JT, MI, HSu, KO, and HSe performed experiments. MF, EF, SS, KO, and HSe wrote the article. EF, KS, and TM supervised the research. All authors discussed the results and approved the article.

# FUNDING

This study was supported in part by the Grants-in-Aid for Scientific Research of the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number JP19H02921; the Scientific Technique Research Promotion Program for Agriculture, Forestry, Fisheries, and Food Industry, Japan; The Program for Promotion of Basic and Applied Researches for Innovations in Bio-oriented Industry (BRAIN); The Special Fund from the Director of RIKEN Yokohama Institute; The RIKEN Rijicho Fund; and the Monbukagakusho Scholarship.

# ACKNOWLEDGMENTS

We thank Toshio Aoki (Nihon University) for his valuable discussions and technical advice, Mareshige Kojoma (Health Sciences University of Hokkaido, Japan) for providing *G. glabra*, Makio Shibano (Osaka University of Pharmaceutical Sciences, Japan) for providing *G. lepidota* and *G. macedonica*, David R. Nelson (University of Tennessee, USA) for the naming of P450s, Kyoko Inoue (Osaka University, Japan) for technical assistance with NMR analysis.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at https://www.frontiersin.org/articles/10.3389/fpls.2019.01520/ full#supplementary-material


temperature and Ca2+ ion as environmental stress based on field investigation. *J. Fac. Agr. Kyushu Univ.* 56, 2.


legume *Medicago truncatula. BMC Genomics* 15, 312. doi: 10.1186/1471-2164- 15-312


assemblies of 15 *Medicago* genomes. *BMC Genomics* 18, 261. doi: 10.1186/ s12864-017-3654-1

Zhu, M., Wang, C. X., Sun, W. T., Zhou, A. Q., Wang, Y., Zhang, G. L., et al. (2018). Boosting 11-oxo-β-amyrin and glycyrrhetinic acid synthesis in *Saccharomyces cerevisiae via* pairing novel oxidation and reduction system from legume plants. *Metab. Eng.* 45, 43–45. doi: 10.1016/j.ymben.2017.11.009

**Conflict of Interest:** Authors HSu and SS were employed by company Tokiwa Phytochemical Co., Ltd.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Fanani, Fukushima, Sawai, Tang, Ishimori, Sudo, Ohyama, Seki, Saito and Muranaka. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Evolution of Structural Diversity of Triterpenoids

#### *Pablo D. Cárdenas†, Aldo Almeida† and Søren Bak\**

Department of Plant and Environmental Science, University of Copenhagen, Frederiksberg, Denmark

Plants have evolved to produce a blend of specialized metabolites that serve functional roles in plant adaptation. Among them, triterpenoids are one of the largest subclasses of such specialized metabolites, with more than 14,000 known structures. They play a role in plant defense and development and have potential applications within food and pharma. Triterpenoids are cyclized from oxidized squalene precursors by oxidosqualene cyclases, creating more than 100 different cyclical triterpene scaffolds. This limited number of scaffolds is the first step towards creating the vast structural diversity of triterpenoids followed by extensive diversification, in particular, by oxygenation and glycosylation. Gene duplication, divergence, and selection are major forces that drive triterpenoid structural diversification. The triterpenoid biosynthetic genes can be organized in non-homologous gene clusters, such as in Avena spp., Cucurbitaceae and Solanum spp., or scattered along plant chromosomes as in Barbarea vulgaris. Paralogous genes organized as tandem repeats reflect the extended gene duplication activities in the evolutionary history of the triterpenoid saponin pathways, as seen in B. vulgaris. We review and discuss examples of convergent and divergent evolution in triterpenoid biosynthesis, and the apparent mechanisms occurring in plants that drive their increasing structural diversity within and across species. Using B. vulgaris' saponins as examples, we discuss the impact a single structural modification can have on the structure of a triterpenoid and how this affect its biological properties. These examples provide insight into how plants continuously evolve their specialized metabolome, opening the way to study uncharacterized triterpenoid biosynthetic pathways.

Keywords: triterpenoid saponins, structural diversity, convergent evolution, plant specialized metabolism, unlinked versus clustered pathways

# INTRODUCTION

Plants have evolved the strategy of constantly diversifying the chemical structures they produce, evolving into an astonishing number of so-called plant specialized metabolites—of which the majority are thought to be involved in plant defense. The general conception has been that divergent evolutionary processes is the major driving force. However, biochemical and molecular knowledge of how these compounds have evolved has demonstrated that convergent evolution is surprisingly common (Pichersky and Lewinsohn, 2011).

Triterpenoids are one of the largest groups of plant specialized metabolites with over 14,000 structures known (Hamberger and Bak, 2013; http://dnp.chemnetbase.com). In the current model of triterpenoid biosynthesis, the isoprenoid precursor 2,3-oxidosqualene is cyclized by signature enzymes oxidosqualene cyclases (OSCs) to a number of cyclical triterpene scaffolds. Subsequently,

#### Edited by:

Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan

#### Reviewed by:

Tetsuo Kushiro, Meiji University, Japan Tessa Moses, University of Edinburgh, United Kingdom

> \*Correspondence: Søren Bak bak@plen.ku.dk

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 19 July 2019 Accepted: 01 November 2019 Published: 17 December 2019

#### Citation:

Cárdenas PD, Almeida A and Bak S (2019) Evolution of Structural Diversity of Triterpenoids. Front. Plant Sci. 10:1523. doi: 10.3389/fpls.2019.01523

1 **293** these structures are oxygenated by various cytochromes P450 (P450s) and finally glycosylated by UDP-glycosyltransferases (UGTs) (**Figure 1**). Plant genomes contain about a dozen OSC, about 250 P450s and about 110 UGTs, which are involved in multiple pathways. The sheer size of these gene families makes it a challenge to determine the specific genes coding for enzymes involved in generating the diversity of triterpenoids based on sequence alone. The OSC step constitutes the key branching point leading to biosynthesis of either sterols and steroidal saponins or to non-steroidal triterpenoids (e.g., triterpenoid saponins) (Thimmappa et al., 2014) (**Figure 1**). It is widely accepted that the cyclization of 2,3-oxidosqualene into sterols, occurs *via* the protosteryl cation and proceeds in a "chair-boat-chair" conformation. On the other hand, triterpenoid biosynthesis, *via* the dammarenyl cation, involves "chair-chair-chair" conformation of 2,3-oxidosqualene. In plants, more than 100 different triterpene scaffolds are derived from 2,3-oxidosqualene by OSCs, with lupeol and β-amyrin being among the most wide spread and studied regarding their biosynthesis and role in plants (Xu et al., 2004; Stephenson et al., 2019).

The increasing availability of genome sequencing of evolutionary distant plant species and in-depth studies of triterpenoid pathways in recent years, has revealed an emerging picture that triterpenoid pathways are evolving recurrently and thus represent an interesting case of both divergent and convergent evolution. Typically, it is the same gene families (OSCs, P450s, and UGTs) that are constantly recruited and afterwards expanded through gene duplications leading to tandem repeats. Here, we review the current knowledge about five triterpenoid scaffolds (and their derivatives) as examples of both divergent and convergent evolution in specialized metabolism and discuss the selection pressure on the genes, their organization, and how plants cope with the toxic triterpenoids they produce.

# CONVERGENT EVOLUTION IN BIOSYNTHESIS OF TRITERPENOID SCAFFOLDS

# Dammarenyl Cation-Derived Triterpenoids Have Arisen Multiple Times During Angiosperm Evolution

The majority of the known triterpenoids arise from the dammarenyl cation (Xiong et al., 2005). Resins from trees of the Dipterocarpaceae are known as dammar, hence the first triterpenoids isolated from dammar resins were coined dammaranediol I and II (Mills and Werner, 1955; Mills, 1956). Currently, the most commercially-valuable representatives of this class of compounds are the diverse ginsenosides. They can be obtained from mature six-year-old rhizomes of *Panax ginseng* and because of their anticancer activities some ginsenosides are used for chemotherapy treatment (Leung and Wong, 2010). Tansakul et al. (2006) were the first to clone a Dammarenediol-II synthase (PgDDS) and showed that this OSC from *P. ginseng* catalyzes the first committed step of ginsenoside biosynthesis. The only other DDS characterized thereafter belongs to *Centella asiatica* (Kim et al., 2009). *P. ginseng* and *C. asiatica* belong to the Apiales and phylogenetic analysis showed that both DDS grouped in the same branch suggesting the DDS in these species evolved from a common ancestor. As DDSs have not been elucidated from the phylogenetically distant Dipterocarpaceae, future work on this family will shed some light on the evolutionary history of DDSs and could indicate if they arose from convergent or divergent evolution.

Lupeol and β-amyrin are prevalent pentacyclic triterpenoids derived from the dammarenyl cation and they are ubiquitously found in many different plant species (**Figure 1**). Nevertheless, phylogenetic analysis have shown that the genes producing these scaffolds group distinctively in different clades. Shibuya et al. (1999) first distinguished two clades of lupeol synthases in plants; one which is composed of specific lupeol synthases and another which is composed of multi-functional OSCs producing α-, β-amyrin, and lupeol (Thimmappa et al., 2014; Khakimov et al., 2015). Site-directed mutagenesis experiments have shown that a single amino acid replacement could convert a lupeol synthase into a β-amyrin synthase (and conversely), indicating the noticeable role of specific residues may have played in the evolution of OSC product specificity and generation of triterpenoid diversity (Kushiro et al., 1999; Kushiro et al., 2000). Furthermore, phylogenetic analysis of both monocot and dicot OSCs by Xue et al. (2012) and Augustin et al. (2011), additionally distinguish two distinctive clades of β-amyrin synthase in monocots and dicots.

Lupeol and β-amyrin can be present in plants as unmodified compounds typically found in resins or waxes (Szakiel et al., 2012) or they have a major role as precursors for other specialized triterpenoid metabolites, usually involved in plant defense and development. Lupeol is involved in nodule formation in *Lotus japonicus* through regulation of *ENOD40* gene expression (Delis et al., 2011). Lupeol is also part of the cuticular wax surface of castor bean plant (*Ricinus communis*) where it was suggested to have a physiological role in protection against dehydration (Guhling et al., 2006). Additionally, lupeol is the precursor for betulinic acid, a triterpenoid that accumulates in the bark of birch tree and has a potent anticancer activity (Pisha et al., 1995; Kumar et al., 2018). β-amyrin is a precursor of glycyrrhizin, a triterpenoid saponin found in the roots and stolons of liquorice (*Glycyrrhiza glabra*) well known for its pharmaceutical properties and as a natural sweetener (Hayashi et al., 2002). β-amyrin is also precursor for the antifungal saponin avenacin A-1 in oat and for the potent insect-feeding deterrent hederagenin glycosides in the Brassicaceae *Barbarea vulgaris* (Kuzina et al., 2009; Nielsen et al., 2010; Khakimov et al., 2016; Liu et al., 2019);. Additionally, β-amyrin seems to play a role in root development in oat (Kemen et al., 2014) and in *Lotus japonicus* (Krokida et al., 2013), suggesting that triterpenoids like lupeol and β-amyrin are not exclusively involved in plant defense.

# α-Onocerin—A Seco-Triterpenoid (sensu lato) That Evolved Convergently

*Seco*-triterpenoids are characterized by the absence of a C-C bond that would normally form one of the consecutive rings

FIGURE 1 | Simplified representation of the biosynthesis of sterols and triterpenoids in plants. (A) OSC signature enzymes catalyze the cyclization of 2,3-oxidosqualene, and in more rare cases bis-oxidosqualene, into several triterpenoid scaffolds. These structures can be further modified by tailoring enzymes, including oxygenation by P450s, glycosylation by UGTs, acylation by ACT, and methylation by MT. Selected structures are depicted and discussed in more detail in the text. Dashed arrows represent multiple biosynthetic reactions whereas solid arrows represent a single step. (B) Biosynthesis of plant triterpenoids can be mediated by non-homologous clustered genes or through non-linked genes. In Avena spp., a cluster of five genes are involved in the biosynthesis of avenacin A-1. In Arabidopsis thaliana, two clusters have been reported: thalianol cluster with four genes (up) and marneral cluster with three genes (down). In Cucumis sativus, six genes associated with cucurbitacin biosynthesis are located in a cluster in chromosome 6, while four other genes are elsewhere in the genome. The core genes for biosynthesis of SGAs are clustered in chromosome 7 and 12 of S. tuberosum and S. lycopersicum, respectively. The key genes for biosynthesis of the insect-feeding deterrent hederagenin cellobioside are distributed along B. vulgaris genome in tandem repeats located at different pseudomolecules (PM). OSC, oxidosqualene cyclase; P450, cytochrome P450; UGT, UDP-glycosyltransferase; ACT, acyltransferase; MT, methyltransferase.

in a triterpenoid scaffold. They are known to be distributed across the plant kingdom. The *seco*-triterpenoid (*sensu lato*) α-onocerin consists of two bicyclic systems connected by a two-carbon linkage; its occurrence appears to be limited to Lycopod species and some species within the *Ononis* genus in the Fabaceae. Lycopods and the Fabaceae originated in very distant evolutionary times (Garratt et al., 1984; Giraud and Lejal-Nicol, 1989), which implies that the α-onocerin trait evolved convergently in Lycopods and in the *Ononis* genus. The biological function of α-onocerin still remains unknown.

On a biochemical level, α-onocerin biosynthesis differs from other triterpenoids as it is biosynthesized from 2,3;22,23-oxidosqualene (bis-oxidosqualene) instead of the typical triterpenoid precursor 2,3-oxidosqualene (**Figure 1**). In *Lycopodium clavatum*, biosynthesis of α-onocerin is carried out in two sequential steps by two paralogous OSCs (Araki et al., 2016). Pre-α-onocerin synthase (LcLCC), initiates cyclization from one of the epoxide bonds of bis-oxidosqualene. The cyclization terminates after formation of the A and B rings through the generation of pre-α-onocerin. Subsequently α-onocerin synthase (LcLCD) carries out the cyclization of the D and E rings from the remaining epoxide ring, to yield α-onocerin. Our group demonstrated that *Ononis spinosa* α-onocerin is biosynthesized by a single OSC (OsONS1) (Almeida et al., 2018). In *O. spinosa* a neofunctionalized squalene epoxidases (OsSQEs) provide the OSCs with the necessary bis-oxidosqualene. Fluorescence imaging microscopy experiments demonstrated protein-protein interactions between OsONS1 and the neofunctionalized OsSQEs (Almeida et al., 2018).

Phylogenetic analysis revealed that OsONS1 branches off from lupeol synthases, and is phylogenetically distant from the two *L. clavatum* α-onocerin synthases which branch off directly from sterol biosynthesis; thus, α-onocerin biosynthesis evolved convergently in these two plant species (Almeida et al., 2018). In addition, molecular docking simulation showed that OsONS1 alone produces α-onocerin in two cyclization steps, as opposed to the specialized LcLCC and LcLCD in Lycopods.

# Protosteryl Cation-Derived Triterpenoids Are Found Across Angiosperm Phylogeny

Cucurbitacins are highly oxygenated tetracyclic triterpenes initially discovered in members of the Cucurbitaceae family and well known for their bitterness and toxicity (Metcalf et al., 1980). Cucurbitacins have been reported in 17 taxonomically distant related families in eudicots and some monocots. They consist of hundreds of derivatives from 20 main cucurbitacin molecules named from cucurbitacin A to T (Chen et al., 2005). At least 100 species in 30 genera of the Cucurbitaceae have been shown to contain cucurbitacins (Raemisch and Turpin, 1984), with cucurbitacin B (**Figure 1A**) being present in 91% of the species (Chen et al., 2005).

Cucurbitacins are extremely toxic to mammals (David and Vallance, 1955) and function as feeding deterrents against several insects (Nielsen et al., 1977; Bigger and Chaney, 1998) but they have also been found to be feeding stimulant for leaf beetles belonging to the Luperini tribe of the Chrysomelidae (Metcalf, 1986). Cucurbitacin B can displace the insect steroidal hormone 20-hydroxyecdysone, thus affecting morphological changes in *Drosophila melanogaster* (Dinan et al., 1997).

The cucurbitacin biosynthesis pathway has mainly been studied in members of the Cucurbitaceae. The OSC cucurbitadienol synthase from *Cucurbita pepo* (CpCPQ) cyclizes 2,3-oxidosqualene to cucurbitadienol (**Figure 1A**) (Shibuya et al., 2004). The CpCPQ gene evolved after the divergence of dicots and monocots when the ancestral cycloartenol synthase gene duplicated, creating two clades of cycloartenol synthases (CASI and CASII) of which CpCPQ evolved from the *C. pepo* cycloartenol synthase (CpCPX) in the CASII clade (Xue et al., 2012).

Cucurbitacins are also present outside the Cucurbitaceae; for example *Iberis amara* belonging to the Brassicaceae contain cucurbitacins (Nielsen et al., 1977). The elucidation of the cucurbitacin biosynthetic pathway in *Iberis* will help to clarify the evolutionary history of Cucurbitacin biosynthesis.

Steroidal glycoalkaloids (SGAs) are triterpene-derived compounds found in major dicot Solanaceae crops such as tomato and potato. SGAs contain a nitrogen incorporated on their steroidal scaffold and they provide the plant with a barrier against a broad range of herbivores and pathogens, but are bitter and considered anti-nutritional compounds for humans (Cárdenas et al., 2015). SGAs and non-nitrogenous steroidal saponins are also present in distantly related monocot Liliaceae plant species. In these plant species, they have had special attention due to their pharmacological properties. For instance, the Liliaceae *Veratrun californicum* produces the potent anticancer molecule cyclopamine (Augustin et al., 2015). Biochemical and phylogenetic analysis of the enzymes involved in the biosynthesis of *Solanum* and *Veratrum* SGAs suggested their convergent origins and their partial recruitment from primary phytosterol metabolism (Augustin et al., 2015; Sonawane et al., 2016).

# GENOME ORGANIZATION OF TRITERPENOID PATHWAYS

The biosynthetic genes for triterpenoids display different genome organizations across plant species (**Figure 1B**). While in some species they are arranged in clusters of non-homologous genes (e.g., avenacins in oat, Qi et al., 2004; cucurbitacins in cucumber, Shang et al., 2014); the biosynthesis of other triterpenoids is mediated by genes scattered along the plant genome organized typically in tandem repeats (e.g., saponins in *Barbarea vulgaris*, Khakimov et al., 2015; Erthmann et al., 2018; Liu et al., 2019; and mogrosides in Siraitia grosvenorii, Itkin et al., 2016).

# Clustered Genes Mediating Biosynthesis of Avenacins, Cucurbitacins and Steroidal Glycoalkaloids

Organization of genes in operon-like clusters has typically been associated with fungal genomes and operons present in bacteria. Nevertheless, nearly 20 metabolic gene clusters for plant specialized metabolites have been reported in multiple plant species (Boycheva et al., 2014). The avenacin, thalianol, and marneral triterpenoid biosynthetic pathways were among the first to be reported being organized in non-homologous gene clusters in oat and *Arabidopsis thaliana*, respectively (**Figure 1B**, Qi et al., 2004; Field and Osbourn, 2008; Field et al., 2011) The organization of metabolic genes in clusters is thought to provide both co-inheritance, co-regulation and avoiding accumulation of toxic intermediates and may thus be beneficial for securing stable inheritance of functional chemical defense pathways in a dynamic ecological context of natural populations (Osbourn, 2010; Takos and Rook, 2012).

Triterpenoid saponins are found mainly in dicotyledonous species. *Avena* spp. is the only known triterpenoid saponin producing monocotyledon. *Avena* spp. accumulates avenacins which are saponins produced in the roots of oat species and provide the plant with a potent barrier against soil-borne fungi (Papadopoulou et al., 2002) (**Figure 1**). SAD1, a β-amyrin synthase, catalyzes the first committed step in avenacin biosynthesis (Qi et al., 2004), and is present in the oat genome in an operon like cluster together with genes coding for the tailoring enzymes required for avenacin biosynthesis, including: SAD2, the P450 CYP51H10 that oxidizes β-amyrin; and three genes that act together in the acylation steps of avenacin: SAD9, a N-methyltransferase; SAD10, a UGT74H5 glycosyltransferase; and SAD7, a serine carboxypeptidase-like acyltransferase (Mugford et al., 2013). CYP51H10 (SAD2) evolved from a predisposition in sterol metabolism of the orthologous CYP51 catalyzing the conserved 14α-demethylation reaction in sterol metabolism in eukaryotes (Geisler et al., 2013).

In Cucurbitaceae, part of the cucurbitacin pathway is similarly clustered; in cucumber the cucurbitadienol synthase is flanked by four P450s and an ACT gene out of which two P450s and the ACT have been functionally characterized (Shang et al., 2014). The P450s in this cluster belong to different subfamilies, indicating this cluster was not formed by tandem duplications but by genome reorganization. Later Zhou et al. (2016) uncovered the same conserved syntenic loci in melon and watermelon by comparative analyses of their genomes. While the core cluster is overall syntenically conserved, the CYP88A60 catalyzing the hydroxylation at position C19 and the CYP87D20 catalyzing hydroxylations at C11 and C20 both lie outside of the nonhomologous gene cluster.

In Solanaceae, the core genes required for the biosynthesis of steroidal glycoalkaloids are part of a metabolic operon-like gene cluster (**Figure 1B**). Itkin et al. (2013) showed that in tomato, six SGA genes are arranged in a cluster on chromosome 7 (UGT73L5: GAME1; UGT93: GAME17; UGT94: GAME18; UGT73L4: GAME2; Dioxygenase: GAME11; and CYP72A188: GAME6), whereas two other genes are located next to each other on chromosome 12 (CYP88B1: GAME4; and Transaminase: GAME12). Similarly in potato, four SGArelated genes are located in chromosome 7 (UGT73: SGT3; Dioxygenase: GAME11; CYP72A188: GAME6; and UGT73: SGT1) and two in chromosome 12 (Transaminase: GAME12; and CYP88B1: GAME4).

# Tandem Repeats of Triterpenoid-Biosynthetic Genes and Triterpenoid Diversity in B. vulgaris, Medicago truncatula and Glycyrrhiza uralensis

Triterpenoid saponin biosynthesis has evolved recurrently in evolution and thus the organization of the genes may not be conserved. In recent years, the *Barbarea* genus has appeared as a unique plant model as it is the only genus in the agronomically important cabbage family (Brassicaceae) known to accumulate saponins (Khakimov et al., 2016). Some of these saponins (e.g., hederagenin cellobioside) are highly deterrent to *Brassica* specialist herbivores such as flea beetles (*Phyllotreta nemorum*) and the diamondback moth (*Plutella xylostella*) (Kuzina et al., 2009; Kuzina et al., 2011; Liu et al., 2019).

Genome (Byrne et al., 2017) and genetic (Kuzina et al., 2009; Kuzina et al., 2011) analysis showed that the genes mediating the biosynthesis of the deterrent triterpenoid saponins in *B. vulgaris* are not linked, but are present in tandem repeats spread along the genome (**Figure 1B**). The OSCs, P450s, and UGTs in the pathway are generally characterized to be rather substrate and product promiscuous enzymes which may facilitate that more than 49 different saponin structures can be generated in *B. vulgaris* with a limited number of genes (Khakimov et al., 2016). QTLs for flea beetle resistance and accumulation of saponins co-localize in *B. vulgaris* (Kuzina et al., 2009; Kuzina et al., 2011; Khakimov et al., 2015). Two unlinked QTLs containing two OSC (i.e., LUP2 and LUP5) and a tandem repeat of eight P450s (CYP72As), respectively, were identified. *In vitro* and *in planta* assays shown that LUP5 was preferentially expressed in the insect-deterrent *B. vulgaris* G-type. LUP5 leads to predominant accumulation of β-amyrin, the precursor for the deterrent hederagenin cellobioside. Conversely, LUP2 was preferentially expressed in the P-type plants (insect-susceptible), and produces mainly lupeol, the precursor for lupeol-derived saponins that appear not to be deterrent. Of the eight CYP72As, only CYP72A552 oxidizes oleanolic acid at the C23 position leading to the formation of insect deterrent hederagenin glycosides (Liu et al., 2019). CYP716A80 and CYP716A81 catalyze C28 carboxylation and two additional hydroxylation (Khakimov et al., 2015) while

a tandem repeat of at least five UGT73Cs (Augustin et al., 2012; Erthmann et al., 2018) are also involved in the pathway. Both the CYP716As and the UGT73Cs are not linked to QTLs for neither insect resistance nor saponin accumulation. All members of the UGT73C tandem repeat accepted both hederagenin and oleanolic acid as substrates, but generated different glycosylated products, creating an spectrum of mono and bisdesmosidic saponins (Erthmann et al., 2018). If the *B. vulgaris* saponin non-linked gene organization is part of facilitating the opportunity to create multiple end products with a number of limited genes is an open question, but interestingly other pathways (e.g., anthocyanins and glucosinolates) also producing a multiple range of end products, are also known to be unlinked.

As in *B. vulgaris*, the genome of *Medicago truncatula* and the draft genome of *Glycyrrhiza uralensis* did not reveal operonlike gene cluster of OSCs, P450s, and UGTs involved in neither soyasaponins nor glycyrrhizin biosynthesis, but rather the gene candidates are organized as tandem repeats of each of these gene families (Naoumkina et al., 2010; Mochida et al., 2017). However, while most of the genes in these tandem repeats remain functionally uncharacterized, UGT73K1 was proven to glycosylate soyasapogenol B and E at the C3 position (Achnine et al., 2005).

# WHAT DETERMINES TOXICITY VERSUS AUTOTOXICITY OF PLANT TRITERPENOID SAPONINS?

Saponins are generally assumed to target and disrupt vital functions common among organisms, such as the integrity of the cell membrane. In this regard, a fundamental paradox in plant evolution is that many defense metabolites may be harmful for the plant themselves. Therefore, plants have had to evolve strategies to both biosynthesize and accumulate defense molecules without causing autotoxicity. Some of the strategies used by plants to avoid autotoxicity are the compartmentalization of defense compounds in specialized structures and chemical modifications leading to decreased toxicity (i.e., glycosylation). For instance, avenacin A-1 is normally localized in the vacuoles of root epidermal cells. Mutants for the *sad3* glucosyltransferase accumulate partially glycosylated avenacin A-1, having root epidermis defects and affected saponin cellular distribution (Mylona et al., 2008), illustrating the detrimental effects of pathway intermediates.

Even though saponins are among the largest classes of natural products, their mode of action is not really understood this may partly be attributed to the often complex mixtures of saponins in plants. Saponins have been shown to cause increased insect mortality, lower food intake, weight reduction, and developmental problems among others (Agerbirk et al., 2003; Christensen et al., 2019); and due to their amphiphilic nature, their most studied effect is regarding their membrane permeabilizing properties. Early studies using artificial lipid bilayers have shown that avenacin A-1 induces permeabilization in a cholesterol-dependent manner and requires the presence of an intact glycoside moiety in the C3 position of the triterpenoid scaffold (Armah et al., 1999). The dependence of the permeabilizing activity on the presence of cholesterol in the membrane and the sugar moiety also was shown for the steroidal saponin digitonin and glycoalkaloids. For example, α-chaconine had strong lytic activity of cholesterol-containing lipid vesicles compared to α-solanine, which has the same aglycone structure but different types of glycosides (Nishikawa et al., 1984; Keukens et al., 1995; Keukens et al., 1996).

Consequently, a prominent role in the avoidance of autotoxicity can be attributed to modifications in the cell membrane composition and further modifications on the saponin structure. Saponins are also present in marine organisms such as sea stars and sea cucumbers (Yendo et al., 2014). Interestingly, recent studies on how sea cucumbers tolerate the toxic saponins they produce, suggest sea cucumbers have replaced cholesterol by other sterols (e.g., Δ7 and Δ9 sterols) in their membranes, presumably in order to modulate the lytic action of its own saponins (Popov, 2003). A detailed analysis of the sterols present in sea cucumbers showed that although cholesterol and Δ7 sterols have essentially the same chemical formula and molecular weight (**Figure 2**), the double bound present in Δ7 sterols has a dramatic effect on its 3D structure, possibly affecting its molecular interaction with sea cucumbers own saponins (Claereboudt et al., 2018). A similar effect on the 3D structure has been speculated to play a role in the biological activity of saponins from *B. vulgaris* (**Figure 2**). The hydroxylation of oleanolic acid at position C23 catalyzed by CYP72A552, leading to the formation of the highly-deterrent hederagenin causes a rotation of the monoglucoside at C3 of *c.* 90°, relative to the plane of the aglycone. When a carboxyl group is introduced instead as in gypsogenic acid, the glucose is situated about the same plane as in oleanolic acid. Gypsogenic and oleanolic acid glycosides are much less toxic to larvae of both *Manduca sexta* and *Plutella xylostella* than they are compared to hederagenin monoglucoside (Liu et al., 2019). Accordingly, the rotation of the sugar moiety in hederagenin glycosides may be responsible for changes on its physiochemical properties with membrane sterols, and this could lead to alter its biological role.

# CONCLUDING REMARKS

Increasing characterization of biosynthetic pathways for plant specialized metabolites is revealing that convergent evolution is surprisingly common (Pichersky and Lewinsohn, 2011). Nevertheless, we lack a biological understanding of how identical classes of specialized metabolites evolved recurrently across lineages and even across Kingdoms of Life. Biochemical analysis of the enzymes combined with molecular phylogenetic analysis have until now given detailed insights of how the enzymes for these compounds might have evolved. Further studies are needed to elucidate which biological pressures drive the recurrent emergence of the same or similar specialized metabolites across lineages of plants.

Genome analysis have revealed that some triterpenoid pathways are unlinked while others are organized in operon like clusters. The driving forces behind the genome organization of triterpenoid pathways is currently not understood and difficult to address experimentally. However, as more plant genomes become readily available this key question in evolution of triterpenoid diversity might become addressable.

# AUTHOR CONTRIBUTIONS

All the authors conceived and wrote the manuscript.

# REFERENCES


# FUNDING

PC work was supported by the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 752437. SB and AA were supported by grants from the Independent Research Fund Denmark grant No. 7017-00275B and Novo Nordisk Foundation grant No. NNF17OC0027646.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Cárdenas, Almeida and Bak. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# The Evolution of Flavonoid Biosynthesis: A Bryophyte Perspective

Kevin M. Davies 1\*, Rubina Jibran<sup>1</sup> , Yanfei Zhou<sup>1</sup> , Nick W. Albert <sup>1</sup> , David A. Brummell <sup>1</sup> , Brian R. Jordan<sup>2</sup> , John L. Bowman<sup>3</sup> and Kathy E. Schwinn<sup>1</sup>

<sup>1</sup> The New Zealand Institute for Plant and Food Research Limited, Palmerston North, New Zealand, <sup>2</sup> Faculty of Agriculture and Life Sciences, Lincoln University, Christchurch, New Zealand, <sup>3</sup> School of Biological Sciences, Monash University, Melbourne, VIC, Australia

The flavonoid pathway is one of the best characterized specialized metabolite pathways of plants. In angiosperms, the flavonoids have varied roles in assisting with tolerance to abiotic stress and are also key for signaling to pollinators and seed dispersal agents. The pathway is thought to be specific to land plants and to have arisen during the period of land colonization around 550–470 million years ago. In this review we consider current knowledge of the flavonoid pathway in the bryophytes, consisting of the liverworts, hornworts, and mosses. The pathway is less characterized for bryophytes than angiosperms, and the first genetic and molecular studies on bryophytes are finding both commonalities and significant differences in flavonoid biosynthesis and pathway regulation between angiosperms and bryophytes. This includes biosynthetic pathway branches specific to each plant group and the apparent complete absence of flavonoids from the hornworts.

### Edited by:

Kazuki Saito, RIKEN Center for Sustainable Resource Science (CSRS), Japan

#### Reviewed by:

Stefan Martens, Fondazione Edmund Mach, Italy Brenda S.J. Winkel, Virginia Tech, United States

#### \*Correspondence:

Kevin M. Davies kevin.davies@plantandfood.co.nz

#### Specialty section:

This article was submitted to Plant Metabolism and Chemodiversity, a section of the journal Frontiers in Plant Science

Received: 29 October 2019 Accepted: 07 January 2020 Published: 04 February 2020

#### Citation:

Davies KM, Jibran R, Zhou Y, Albert NW, Brummell DA, Jordan BR, Bowman JL and Schwinn KE (2020) The Evolution of Flavonoid Biosynthesis: A Bryophyte Perspective. Front. Plant Sci. 11:7. doi: 10.3389/fpls.2020.00007 Keywords: anthocyanin, auronidin, dirigent, polyphenol oxidase, transcription factor

# OPENING COMMENTS

The flavonoid pathway is a core component enabling land plants to interact with their environment. Flavonoids have been demonstrated to assist in tolerance to both abiotic and biotic stresses, and their production can be induced by cold, UV-B light (UVB) or strong white light, nutrient deprivation, desiccation, salinity, metal toxicity, and pest and pathogen attack (Agati and Tattini, 2010; Cheynier et al., 2013; Landi et al., 2015; Davies et al., 2018). In angiosperms, flavonoids are also key for signaling to pollinators and seed dispersal agents. Flavonoids are formed within the larger phenylpropanoid pathway, starting with chalcones as the first flavonoids. During plant evolution, flavonoid pathway diversity has increased greatly, with more than 8,000 different structures reported from the relatively small number of plant species studied to date (Andersen and Markham, 2006). This multiplicity of structures and functions is thought to have assisted land plants to colonize the wide range of environments they now occupy. The over 8,000 compounds are grouped into relatively few classes of flavonoids, based on the core structure and/or biosynthetic origin. The major flavonoid classes are the flavones, flavonols, isoflavonoids, aurones, 3 deoxyanthocyanins, anthocyanins, proanthocyanidins (condensed tannins), and the recently reported "auronidins" (Berland et al., 2019) (Figures 1 and 2). However, there are also notable groups of related non-flavonoid compounds produced with the same starting precursors as used for chalcone formation, such as the stilbenes and bibenzyls. Most flavonoids are targeted to the vacuole

as water-soluble glycosides, although some are transported to the cell wall or are released from the plant to the environment. Many can absorb light in the UV-spectrum, while anthocyanins and auronidins provide colored pigments that can screen in the visible part of the light spectrum (Lee and Gould, 2002; Landi et al., 2015; Berland et al., 2019).

Flavonoid biosynthesis is frequently considered unique to land plants. Some charophycean algae do tolerate terrestrial conditions (Karsten and Holzinger, 2014; Holzinger and Pichrtová, 2016) and can produce polyphenolics in response to abiotic stresses such as UVB, salinity, or dehydration. Indeed, one hypothesis is that land plants arose from algal ancestors that were already terrestrial (Harholt et al., 2016) and therefore may have had some of the biosynthetic pathways characteristic of land plants. However, there are as yet no convincing reports of the flavonoid pathway branch existing outside of land plants, with the possible exception of some fungi. A common adaptation for UVB tolerance in extant algae is the production of mycosporine-like amino acids (MAA). However, while MAA production is frequently reported for red algae (Rhodophyta)

and other marine organisms, there are few reports for the Chlorophyta (a green algal clade that with Streptophyta form the Viridiplantae) and none for the charophyte algae that is thought to be most closely related to the land plant ancestor. Although genes for the initial steps of the phenylpropanoid pathway may be present in the genomes of extant Rhodophyta, Glaucophyta, Chlorophyta, and charophytes (Labeeuw et al., 2015; de Vries et al., 2017; Davies et al., 2020), there are no substantiated examples of flavonoid-specific genes being identified. Indeed, many metabolite studies on algae report only amounts of "total flavonoids" measured using general assays. Nevertheless, there are some reports detailing specific flavonoid structures in algal preparations, including compounds such as chalcones, flavones, flavonols, isoflavonoids, and proanthocyanidins (Klejdus et al., 2010; Goiris et al., 2014; El Shoubaky et al., 2016; Agregán et al., 2017; Ben Saad et al., 2017). Although the concentrations of flavonoid compounds reported in most examples are extremely low compared with those commonly found in land plants (ng gDW−<sup>1</sup> amounts compared with mg gDW−<sup>1</sup> ), these reports mean that the presence of a biosynthetic pathway to flavonoids in algae cannot be ruled out.

The phenolics most commonly produced by algae are the phlorotannins, diverse oligomers derived from phloroglucinol that are found in brown algae (Heterokonts) (Imbs and Zvyagintseva, 2018). In Ectocarpus siliculosus, the phloroglucinol precursor is formed by the condensation of malonyl-CoA by a polyketide synthase (PKS) (Meslet-Cladière et al., 2013). This is analogous to the action of CHALCONE SYNTHASE (CHS) in the condensation of malonyl-CoA with a p-coumaroyl-CoA starter molecule for flavonoid biosynthesis. However, given the very large phylogenetic distance between brown algae and land plants this may be an example of parallel evolution. Notably, in fungal species there have been recent reports of both the production of flavonoids and the presence of genes with significant sequence similarity to those of the phenylpropanoid pathway of land plants. Findings include the detection of a range of phenylpropanoids, including flavonoids, in Fusarium (Bilska et al., 2018) and detection of flavonoids and candidate genes for stilbene production in Alternaria (Lu et al., 2019). The significance of phenylpropanoid biosynthesis being present in fungi for the current proposals on the evolutionary origins of flavonoid biosynthesis has yet to be addressed. The separation of the fungi and the algae/plant ancestors is thought to be an ancient event, preceding the divergence fungi and animals (Burki, 2014; Burki et al., 2020).

# Origins and Vegetative Functions of the Flavonoid Pathway

Regardless of whether the ancestral genes for flavonoid biosynthesis were present in algal ancestors, the flavonoid pathway we see in extant land plants is hypothesized to have arisen when the land plant ancestors were first colonizing the land about 550–470 million years ago (MYA) (Markham, 1988; Stafford, 1991; Jorgensen, 1994; Koes et al., 1994; Kenrick and Crane, 1997; Rozema et al., 2002). Two major hypotheses have been proposed for the initial role of flavonoids. Firstly, that flavonoids may have helped in coping with the additional abiotic stresses resulting from a terrestrial lifestyle, in particular increased exposure to UVB, but potentially also drought and extreme temperature fluctuations (Markham, 1988; Jorgensen, 1994; Kenrick and Crane, 1997; Cockell and Knowland, 1999; Rozema et al., 2002; Ligrone et al., 2012; Mouradov and Spangenberg, 2014; Demarsy et al., 2017; Davies et al., 2018; de Vries and Archibald, 2018; Rensing, 2018). The alternative proposal is that flavonoids arose as physiological regulators or chemical messengers. This was outlined in Stafford (1991) for the regulation of auxin action, with signaling to mycorrhizal and symbiotic fungi proposed as possible additional communication functions. It was argued that flavonoids would probably have been present at only low concentrations when the pathway first evolved, limiting their efficacy as UVB-screening compounds. This was in the context of other arguments against the need for flavonoids as UVB-screens, such as the effective UVB-screening properties of non-flavonoid phenylpropanoids like the hydroxycinnamic acids (HCAs). More recently, arguments for the early functions of flavonoids being other than UVBscreening have been extended by discoveries on their antioxidant properties and possible signaling actions through the redox pathway or by affecting H2O2 retrograde signals between the chloroplast and nucleus (Taylor and Grotewold, 2005; Agati and Tattini, 2010; Pollastri and Tattini, 2011; Agati et al., 2012; Brunetti et al., 2018; Foyer, 2018; Muhlemann et al., 2018; Brunetti et al., 2019).

That flavonoids affect auxin transport has now been demonstrated in several angiosperm species, including Arabidopsis, apple, and tomato, by analysis of mutants or transgenic lines with reduced flavonoid biosynthesis (Brown et al., 2001; Buer and Muday, 2004; Taylor and Grotewold, 2005; Peer and Murphy, 2007; Dare and Hellens, 2013; Maloney et al., 2014). Altered developmental traits in such plants include dwarfing, loss of pollen fertility and altered root development and gravitropic responses (Van der Meer et al., 1992; Napoli et al., 1999; Brown et al., 2001; Dare and Hellens, 2013; Maloney et al., 2014; Muhlemann et al., 2018). The phenotypes observed vary between species, for example the complete loss of flavonoids in the Arabidopsis chs mutant affects root patterning but not pollen viability (Burbulis et al., 1996; Ylstra et al., 1996). To date, the great majority of data are for angiosperms, and studies on other plant groups are required to determine whether these flavonoid functions are shared across land plants and so may have a common evolutionary origin in the early land plant ancestor. Differential distribution of flavonols and auxin has been observed accompanying stem reorientation in a gymnosperm (Ramos et al., 2016), supporting a conserved function in auxin transport within seed plants. If the genetic mechanisms involved are also conserved, then that would support an evolutionary origin before 350–300 MYA. However, while definitive experiments on flavonoids and hormone function have not been conducted in bryophytes, indications are that flavonoids are not necessary for normal development in this plant group. No flavonoids have been detected to date in hornworts, and a genetic mutant of the liverwort Marchantia polymorpha lacking flavonoids has normal developmental patterns (Clayton et al., 2018). Addition of phenylpropanoids or phenylpropanoid pathway inhibitors can alter bryophyte development in culture, but whether this is because of altered hormone action has not been tested (Chattopadhyay et al., 2018). Even with additional data it may be difficult to determine the most probable option between developmental roles for flavonoids having been acquired in seed plants since the last common ancestor, or having being present but then lost during subsequent evolution of the bryophytes. Nevertheless, establishing whether flavonoids regulate auxin action in bryophytes is an important goal.

Fossils, such as those found in the Rhynie Chert in Scotland (about 410 MYA) (Ligrone et al., 2012), provide detail of the structure of early land plants but little information on their specialized/secondary metabolism. The presence of specialized biosynthetic or storage structures in fossils, such as possible equivalents to the terpenoid-accumulating oil bodies of extant liverworts (Labandeira and Currano, 2013), can support the presence of specialized metabolite pathways but not provide details of the specific compounds produced. To generate hypotheses on the origins and subsequent evolution of the flavonoid pathway we need to compare the genetics and biochemistry of the pathway across diverse extant plant groups, as this can identify conserved pathway components that may have originated with the last common ancestor. In this respect, bryophytes are of key importance (Figure 3). "Bryophytes" is the collective name for non-vascular land plants, comprising the liverworts (Marchantiophyta, approximately 9,000 species), hornworts (Anthocerotophyta, approximately 300 species) and mosses (Bryophyta, approximately 12,000 species). Evidence such as morphological comparison with the fossil record has placed liverworts as the "sister" group to extant land plants—that is, at the base of the land plant evolutionary tree, making bryophytes paraphyletic. However, DNA sequencing data have suggested alternatives, in particular, either paraphyletic bryophytes with hornworts as a sister group to all other land plants, or a single monophyletic bryophyte clade that is sister to the vascular plants (Wickett et al., 2014; Puttick et al., 2018). It is generally accepted that land plants evolved from an ancestral charophyte, and the extant algal sister group is probably the order Zygnematales or a clade of the Zygnematales and Coleochaetales together (Zhong et al., 2014; Delwiche and Cooper, 2015; de Vries and Archibald, 2018).

# Flavonoids and Tolerance to Ultraviolet B Light

The origin of the flavonoid pathway for providing tolerance to UVB has been supported by recent studies on the liverwort species Marchantia. ("Marchantia" is used in this article to refer to M. polymorpha subsp. ruderalis, which is the model experimental species.) Marchantia is an excellent research model: it is small; has a rapid growth rate; can asexually reproduce in large numbers through single-cell-derived clonal gemmae; and, has a small genome (approximately 220 Mb) which, although larger than that of Arabidopsis (at 135 Mb), contains significantly fewer genes (around 19,000 gene models compared with around 28,000 protein-coding genes in Arabidopsis) (Ishizaki et al., 2015; Bowman et al., 2016; Shimamura, 2016; Bowman et al., 2017). It also offers efficient CRISPR/Cas9 mutagenesis in a dominant haploid gametophytic generation (Sugano et al., 2018).

Significantly, the UVB response of Marchantia has many components in common with that of Arabidopsis. Flavonol Oglycosides are key for UVB tolerance of Arabidopsis (Kusano et al., 2011; Morales et al., 2013; Yin and Ulm, 2017), while the related flavone O-glycosides contribute to Marchantia UVB tolerance (Clayton et al., 2018). In both Arabidopsis and Marchantia, mutants with reduced flavonoid production are more easily damaged by UVB, while mutants or transgenics with increased flavonoid content have increased UVB tolerance. Moreover, the signaling pathway for flavonoid pathway induction through the UV RESISTANCE LOCUS8 (UVR8) photoreceptor, the bZIP transcription factor (TF) ELONGATED HYPOCOTYL5 (HY5),

and the modifiers of protein stability such as CONSTITUTIVELY PHOTOMORPHOGENIC1 (COP1) and REPRESSOR OF UVB PHOTOMORPHOGENESIS1 (RUP1) is also conserved between the species (Clayton et al., 2018; Kondou et al., 2019). Equivalent functional studies are lacking on other major basal plant groups, such as mosses and lycophytes. However, phenolics do seem to be important for UVB tolerance in mosses (Clarke and Robinson, 2008; Wolf et al., 2010; Waterman et al., 2017; Soriano et al., 2018b); and UVB exposure of the Antarctic moss Pohlia nutans increased transcript abundance for genes of the UVR8 and flavonoid pathways (Li et al., 2019). This suggests that the core UVB protection mechanism of UVR8-induced flavonoid production may already have been established in the last common ancestor of bryophytes and angiosperms. While some of the same genetic components, such as UVR8 and HY5-like genes have been identified in distantly related groups within the Viridiplantae (Allorent et al., 2016; Bowman et al., 2017; Clayton et al., 2018; Soriano et al., 2018a; Kondou et al., 2019), and algae can produce purple phenolic pigments in response to abiotic stress (Aigner et al., 2013; Holzinger and Pichrtová, 2016), UVB-induced flavonoid production has not been characterized in algae. Thus, the UVR8-induction pathway for UVB-absorbing flavonoids (potentially flavone glycosides as the first compounds) may have been a character rapidly acquired during the water-toland transition.

There are variations among land plants to the UVR8/ flavonoid system for providing tolerance to UVB exposure, and further research would be beneficial to establish whether these may also have an early evolutionary origin. In Arabidopsis, the HCA compounds sinapate esters have an important role in UVB-screening. Comparison of Arabidopsis mutants at different biosynthetic steps found sinapate esters have a protective role comparable to, or perhaps more important than, that of flavonols (Li et al., 1993; Landry et al., 1995). Furthermore, a genetic screen for UVB-tolerance genes identified the transcriptional repressor AtMYB4 as being downregulated in response to UVB to facilitate increased sinapate ester production (Jin et al., 2000). The principal absorption maxima from 290 to 330 nm makes HCAs particularly effective UVB screening compounds, and would be a more carbon-efficient screen than the flavonoids. In addition to flavonol glycosides, mosses have been shown to produce biflavonoids or cell wall-bound phenolics in response to UVB exposure (Clarke and Robinson, 2008; Wolf et al., 2010; Waterman et al., 2017; Soriano et al., 2018b). Biflavonoid induction is a notable feature of the moss Ceratodon purpureus, a species found from Antarctica to hot desert environments (Waterman et al., 2017). The production of cell wall-localized phenolics as part of the UVB screening capacity may be a common feature of bryophytes, including for the flavonoid-lacking hornworts (Monforte et al., 2018; Soriano et al., 2018c).

# Pigmented Flavonoids and Tolerance to Abiotic Stress

The other major group of flavonoids shown to be involved in tolerance to abiotic stress is the 3-hydroxyanthocyanins (typically cyanidin-derivatives) that are found in gymnosperms and almost all angiosperms (Lee and Gould, 2002; Landi et al., 2015). Under the physiological conditions commonly found in vegetative tissues, these provide red pigmentation. The structurally similar 3-deoxyanthocyanins have been extensively characterized in ferns, and also reported for lycophytes and mosses (Andersen and Markham, 2006). Although the color of 3-deoxyanthocyanins is shifted toward orange compared to the equivalent 3-hydroxyanthocyanins, in vegetative tissues similar red pigmentation typically occurs from both compound types. Additionally, cell wall-bound red flavonoids have been reported from liverworts and mosses: riccionidin (an auronidin) and sphagnorubin, respectively (Vowinkel, 1975; Kunz et al., 1993; Berland et al., 2019). Until recently, riccionidin and sphagnorubin were considered anthocyanidins (the nonglycosylated anthocyanin core molecule) with additional rings. Thus, an evolutionary path could be envisioned of cell-wall bound anthocyanidin being the basal state that could have been present in the last common ancestor, then a progression to 3-deoxyanthocyanins, and then 3-hydroxyanthocyanins. However, now we know that riccionidins represent a separate flavonoid class unrelated to anthocyanins (see later sections for biosynthetic details) it is less clear as to what the class of red flavonoid pigment (if any) was present in the early land plants. Auronidin and anthocyanin biosynthesis may be derived characters in each lineage, or the missing pathway in each linage may have been lost during evolution. Furthermore, no proanthocyanidins have been reported from bryophytes or lycophytes, suggesting this branch of flavonoid biosynthesis probably arose later in vascular plant evolution, since proanthocyanidins are found in extant ferns, gymnosperms, and angiosperms.

With regard to function, given the lack of flowers or enclosed seeds in the early land plants, any initial functions of pigmented flavonoids were probably unrelated to animal interactions. In vascular plants anthocyanins have been linked to improving tolerance to a range of abiotic stresses, as well as some biotic challenges. However, how anthocyanins achieve this in the different stress situations, and whether there is a single mechanism or specific functional variations, is the subject of much research and debate. Anthocyanins, auronidins and sphagnorubins can screen out white light to reduce photooxidative damage, and it has been estimated that anthocyanins could absorb over 40% of photosynthetically active radiation in the range containing the most damaging wavelengths for photoinhibition (Merzlyak and Chivkunova, 2000; Pietrini et al., 2002). However, anthocyanins can simultaneously reduce the cellular stress associated with photooxidation through quenching reactive oxygen species (ROS). The relative importance of these two mechanisms is unresolved, even for extensively studied situations such as the appearance of red anthocyanins during autumn senescence of leaves of deciduous trees. Localization of the pigmented flavonoids in the cell wall and vacuole could be thought to argue for a screening mechanism, as ROS are principally generated in plastids and mitochondria. However, it has been suggested that this does not rule out an antioxidant primary function (Agati et al., 2012). Many of the same arguments for or against light screening versus antioxidant primary functions can also be applied to the function of flavones/flavonols in UVB tolerance (Agati and Tattini, 2010; Agati et al., 2012; Davies et al., 2018). Further complicating the development of a unifying theory for anthocyanin abiotic stress function are alternative hypotheses that involve neither light screening nor ROS scavenging, such as drought tolerance through decreased osmotic potential, increasing light absorption to help warm leaves, providing camouflage against insect herbivores, "honest" signaling to herbivores that leaves contain antifeedant compounds and/or are about to be shed, and making leaves more noticeable to insect predators (anti-crypsis) (Gould et al., 1995; Lee and Gould, 2002; Manetas, 2006; Archetti, 2009; Hughes, 2011; Agati et al., 2012; Landi et al., 2015; Davies et al., 2018). Additionally, as mentioned earlier, there is also evidence supporting flavonoid roles as signaling molecules (Taylor and Grotewold, 2005; Agati and Tattini, 2010; Agati et al., 2012; Foyer, 2018).

The cell wall-bound nature of auronidins and sphagnorubins complicates the theory of an antioxidant or signaling role for red flavonoids. The cell wall-localization of the red pigments of mosses and liverworts, termed "tissue fixed", has been known from early studies on Sphagnum and various liverworts (Nagai, 1915; Rudolph et al., 1977). Nevertheless, that the environmental stimuli inducing cell wall-bound red pigments in liverworts and mosses appear similar to those that trigger anthocyanin production in vascular plants, was also noted as far back as 1915 (Nagai, 1915). Kny (1890) and Cesares Gil (1902) noted that M. polymorpha or Reboulia hemisphaerica, respectively, grown in sunny locations produced more pigment than those growing in shady environments. Nagai (1915) was able to show that limiting nitrogen and phosphorus supply intensified the pigmentation of M. polymorpha and M. paleacea, but upon transfer to nutrient-rich media the newly developed tissue lacked pigmentation. Moreover, combinations of stresses that can cause oxidative stress are strong signals for red flavonoid biosynthesis in many bryophytes. Thus, as with angiosperms, in liverworts and mosses cold and light can individually induce reddening, but strong sunlight in cold conditions induces much stronger pigmentation, whether this be at altitude, in the Antarctic, or during the cold nights and bright days of autumn (Gerdol, 1996; Gerdol et al., 1998; Newsham et al., 2005; Hooijmaijers and Gould, 2007; Glime, 2007; Bonnett et al., 2010).

Detailed studies on Marchantia (Albert et al., 2018; Kubo et al., 2018) and Ricciocarpos natans (Kunz and Becker, 1995) found nitrogen deprivation and increased white light exposure induced auronidin accumulation, as has also been shown for anthocyanin accumulation in Arabidopsis and apple (Rubin et al., 2009; Wang et al., 2018). The signaling pathways for anthocyanin induction by nitrogen and phosphorus deficiency are well-characterized for Arabidopsis, with R2R3MYBs being the key activating transcription factors (Lillo et al., 2008; Rubin et al., 2009). Induction of auronidin in Marchantia by nitrogen and phosphorus also requires an R2R3MYB (Kubo et al., 2018), suggesting signaling components may be conserved. For Antarctic liverworts and mosses UVB exposure also induced production of red flavonoids, which most commonly were cell wall-bound (Newsham et al., 2005; Waterman et al., 2018). UVB induces anthocyanin production in some angiosperms, but it is much less common a response than induction of flavones/ flavonols. Flavones and flavonols are more effective at screening UVB than anthocyanins, although aromatic acylation can give anthocyanins absorbance maxima in the UV range. The induction of anthocyanins by UVB has thus been suggested to be more for ROS scavenging and/or screening of white light than for UVB screening. In the case of the non-acylated cell wall-bound flavonoid pigments of mosses and liverworts, it seems probable that production is induced to screen white light and prevent further ROS generation, especially as the summer conditions in the Antarctic present a combination of stresses from continuous white light, cold, and drought.

There are other red/purple plant pigments besides the flavonoids able to screen in photosynthetically active wavelengths. Notable among these are the betacyanins, which are produced in the many species of the core Caryophyllales that do not produce anthocyanins (Polturak and Aharoni, 2019), and the phenolic pigments of algae. Zygogonium ericetorum is a charophyte green alga that can grow in alpine environments and when exposed to abiotic stress produces vacuolar-localized purple pigments, thought to be polymers of glucose and gallic acid, which can absorb in both UVB and photosynthetically active wavelengths (Aigner et al., 2013). In brown algae, phlorotannins can accumulate to more than 15% of dry weight (Imbs and Zvyagintseva, 2018). Phlorotannins are highly hydrophilic polymers, and may be cell wall-bound, stored intracellularly in vesicles, or exported.

Progress on determining the biological roles of cell wallbound pigments in bryophytes has been limited by the lack of genetic systems, and the difficulty of extracting the pigments. However, genetic tools are now available in Marchantia that will allow tests of the functions of the pigments in abiotic or biotic stress tolerance. Mutants are available that have loss of auronidin pigmentation but retain flavone production (Albert et al., 2018; Kubo et al., 2018), have loss of flavone production but retain auronidin pigmentation, or have reduced amounts of both compounds. These have been used for physiological studies with respect to UVB tolerance (Clayton et al., 2018) and pathogen attack (Carella et al., 2019). In angiosperms, a range of flavonoids are localized to the cell wall (Agati et al., 2012), including rare examples of cell wall-bound anthocyanins (Philpott et al., 2009), although the physiological roles of these are generally unclear. The cell wall localization of other phenylpropanoids, in particular HCA derivatives, is common in angiosperms. These may contribute to lignin formation or be accumulated as monomers or dimers in the wall. Besides having structural roles these polymers may also contribute to physical barriers to pathogens (Zhao and Dixon, 2014). Although lignin is thought to be absent from non-vascular plants, cinnamic acid derivatives such as rosmarinic acid and (neo)lignans are common in bryophytes (Asakawa et al., 2013; Asakawa, 2017), and may be cell wall-localized (Wang et al., 2013). In Sphagnum moss, oxidative derivatives of sphagnum acid, phydroxyacetophenone, hydroxybutenolide, and phydroxybenzoic acid, as well as the phenolics p-coumaric acid and trans-cinnamic acid, were predominantly bound to the cell wall (Verhoeven and Liefveld, 1997). It seems a strong possibility that the red flavonoid pigments of bryophytes contribute, along with the cinnamic acid derivatives, to forming a physical barrier against pathogens. The recent study of Carella et al. (2019) demonstrated that the production of auronidin in Marchantia greatly enhanced resistance to Phytophthora palmivora infection, with a lack of hyphae penetration into the highly pigmented regions of plants. In relation to the mechanism of action, it would be of much interest to determine the nature of the incorporation of auronidin and sphagnorubins into the wall and whether polymerization occurs. Dimers of auronidin/riccionidin A have been isolated (termed riccionidin B) (Kunz et al., 1993), providing a basis for polymerization.

Several thalloid liverwort genera, and many moss genera, have species with considerable drought tolerance, with examples in both plant groups of individuals withstanding continuous desiccation for more than 20 years (Breuil-Sée, 1993; Stark et al., 2017). As the plants of liverwort genera such as Riccia and Targionia dry out, the sides of the thallus roll over the dorsal surface so that it is covered by the darkly pigmented ventral scales and rhizoids [see Figure 4 and Reeb et al. (2018) for examples]. This forms a "capsule" that can recover and renew growth even after extended periods without additional water. The function of the very strong pigmentation of the ventral scales, presumably by cell wall-bound auronidin, is not known. It may provide protection of the DNA against UVB damage during a period when DNA repair mechanisms are not active, given that auronidin accumulation is induced by UVB in some Antarctic species. Alternatively, the modification of the cell wall could prevent pathogen ingress, as demonstrated for Marchantia (Carella et al., 2019), or reduce water loss. Plants of desiccation-tolerant species in leafy liverwort genera such as Herbertus and Cephaloziella also often have dark red pigmentation (Vitt et al., 2012). A related but little studied example first described in the 1890s (Campbell, 1896) is the formation of "tubers" by some liverwort species, notably Geothallus tuberosus. G. tuberosus can form thickened inner regions of the thallus that are presumed to store carbohydrates. As the tubers form, the associated cells become strongly dark red pigmented with "thick walls". The tubers can become buried in the soil and, although the surrounding plant may die, the thallus and associated meristem survive the long dry season of the Southern Californian regions to which the species is native.

The ventral scales of many thalloid liverworts, frequently strongly pigmented by auronidin, often extend around the apex of the thallus to provide a barrier layer between the meristem and the soil (Figure 5). Protection of the meristem from physical damage and pathogen ingress could explain this pigmentation. This suggestion could be extended to include protection against herbivory. The extended ventral scales of the aquatic form of R. natans also have strong auronidin-based purple pigmentation (Figure 4), so perhaps auronidins contribute to aquatic herbivore deterrence in these cells.

There are few studies on the biological functions of the 3 deoxyanthocyanins that are common in mosses and ferns, but there is evidence they also are involved in plant defense. Greater amounts of 3-deoxyanthocyanins in fronds of the aquatic fern Azolla correlated with increased feeding deterrence to snails and the tadpoles of frogs (Cohen et al., 2002a). In the same species, 3 deoxyanthocyanins also may promote the establishment of the symbiosis with the cyanobacterium Nostoc (Cohen et al., 2002b).

This suggests there are specific biological functions of the flavonoid pigments in different bryophyte and fern species, although the induction of 3-deoxyanthocyanin production in ferns by general abiotic stresses also indicates a general light screening/antioxidant function in common with that of 3 hydroxyanthocyanins in angiosperms. Perhaps the cell wallbound pigments of the bryophytes have evolved to have elegant multi-functionality, providing abiotic stress tolerance through antioxidant and/or light screening actions and altering physical properties of the cell for biotic stress resistance.

# The Evolutionary Significance of the Occurrence of Different Flavonoid Structural Groups

The identification of different flavonoid groups across land plants has been conducted for many years, both to further understand the evolutionary significance of flavonoid distribution by chemotaxonomy and for the discovery of novel bioactives (Markham, 1988; Asakawa, 2017; Jiang et al., 2016; de Vries et al., 2017; Yonekura-Sakakibara et al., 2019). Flavones and/or flavonols are almost ubiquitous across land plants (Berim, 2016) but variations in the specific types of flavonols or flavones produced have occurred during evolution, for example resulting in the rarity of flavone O-glycosides in leafy liverworts (Markham, 1988) or of polymethoxylated flavones in gymnosperms (Berim, 2016). Overall, it is probable that their biosynthesis was acquired very early during land plant evolution as an important stress adaptation. The remarkable exception is the hornworts. Hornworts produce polyphenolics, notably rosmarinic acid and lignan-like compounds (e.g., anthocerotonic acid and megacerotonic acid) (Petersen and Simmonds, 2003; Soriano et al., 2018b) (Figure 1), but there is no report of any flavonoid being found (Markham, 1988). Thus, either hornworts diverged from the last common land plant ancestor before the evolution of the flavonoid pathway, or the ability to make flavonoids was subsequently lost in this lineage. The completion of a hornwort genome sequence (Szövényi et al., 2015) and transcriptomic studies examining land plant evolution (Wickett et al., 2014; Puttick et al., 2018) may provide the data to help in resolving this question. Analysis of the transcriptomic (SRA PRJEB21674) and genomic (SRA ERR771108 and SRR1278954 for Anthoceros agrestis and Anthoceros punctatus, respectively) data currently available on GenBank can identify with confidence hornwort deduced sequences corresponding to the early steps of the phenylpropanoid and flavonoid pathway, including for PHENYLALANINE AMMONIA LYASE (PAL), CINNAMATE 4-HYDROXYLASE (C4H), 4-COUMARATE-COA LIGASE (4CL), CHS, and CHALCONE ISOMERASE-LIKE (CHIL) (Supplementary Table 1). However, additional analysis is required to show whether these produce functional enzymes. No clear CHALCONE ISOMERASE (CHI)-encoding sequence is present in the data, but mosses can make flavonoids without a gene corresponding to the typical CHI (Cheng et al., 2018). Rosmarinic acid is also found in some algae (Agregán et al., 2017), but comparison of the biosynthetic pathways between land plants and algae has not been made.

Until the recent clarification of the riccionidin structures as auronidins, anthocyanins were thought to be present in all extant lineages of land plants except hornworts. A progression in anthocyanin complexity was suggested, with liverworts producing "primitive" anthocyanidins (the non-glycosylated anthocyanin core), mosses and ferns 3-deoxyanthocyanins, seed plants the 3-hydroxyanthocyanins, and angiosperms a great range of substituted anthocyanins (including 5'-hydroxylation

and variation in glycosylation, acylation, and methylation). However, as mentioned earlier, it is now more difficult to speculate on the possible red pigments present in the last common ancestor of land plants. Riccionidin A has been reported from the root cultures of the angiosperm Rhus chinensis (syn. Rhus javanica) (Taniguchi et al., 2000), but it has not been examined whether this is synthesized via an aurone intermediate route.

In addition to the core flavonoid pathway found across most land plants, there are groups of flavonoids prevalent in specific taxonomic groups, such as the isoflavonoids typical of legumes. There are also flavonoid types that occur sporadically, such as aurones that are found in liverworts and some angiosperms. For aurones this may well represent convergent evolution, as even within angiosperms there are alternative biosynthetic mechanisms (Boucherle et al., 2017). New metabolomic technologies combined with additional genome sequences for non-angiosperm species should help clarify the distribution across the land plants of different flavonoid types and the associated biosynthetic genes (Yonekura-Sakakibara et al., 2019).

# The Phenylpropanoid Biosynthetic Pathway in Bryophytes

The core steps of the phenylpropanoid pathway through to the first flavonoids (the chalcones) are conserved across land plants (Tohge et al., 2013), including the presence of PAL, C4H, 4CL, CHS, and CHIL gene sequences in hornworts. Sequences relating to some of these genes are present in the genome sequences of charophyte and chlorophyte algae (Labeeuw et al., 2015; de Vries et al., 2017), but without functional assays the conclusions that can be drawn are limited. Most phenylpropanoid pathway enzymes are thought to have evolved from primary metabolism enzymes (Tohge et al., 2013; Yonekura-Sakakibara et al., 2019), and so related sequences might be expected to be present. For PAL, whether it arose during land plant evolution or is an ancestral gene from algae has yet to be resolved (de Vries et al., 2017). It was suggested that PAL was acquired by the land plant ancestor via a horizontal gene transfer event (Emiliani et al., 2009), but genes related to PAL are present in the charophyte Klebsormidium flaccidum and could have been acquired by endosymbiotic gene transfer from cyanobacteria to algal ancestors of land plants (de Vries et al., 2017). C4H, which belongs to the CYP73A sub-family of cytochrome P450 monooxygenases (Cyp450s), shows strong sequence conservation across land plants, including characteristic motifs and residues, but no authentic gene sequences are apparent in chlorophyte genomes (Tohge et al., 2013; Davies et al., 2020). In contrast, sequences with similarity to 4CL do occur in rhodophyte and chlorophyte genomes (Labeeuw et al., 2015; de Vries et al., 2017), suggesting the existence of this enzyme in a shared ancestor of land plants and algae before the ancestral divergence of the red algae (Labeeuw et al., 2015). A further aspect yet to be addressed is the presence in fungi of genes with significant sequence similarity to those of the phenylpropanoid pathway (Bilska et al., 2018; Lu et al., 2019). As the separation of fungi and plants is thought to have occurred during the early stages of eukaryote divergence (Burki, 2014; Burki et al., 2020), it is possible that these may represent cases of convergent evolution.

The type III PKS superfamily that contains CHS is present in all plant genomes examined to date (Pandith et al., 2020). PKS genes are found also in fungi, and some bacteria and algae, and the plant PKS genes contain conserved structural elements with the bacterial PKS genes involved in primary metabolism. Across plants there is a wide variety of PKS enzymes with close sequence similarity to CHS but which either use alternative substrates (such as acridone synthases and pyrone synthases) or catalyze different cyclisation reactions using the same starter molecules (notably, STILBENE SYNTHASE, STS). It is thought that STS has independently evolved from CHS several times in the course of evolution (Yonekura-Sakakibara et al., 2019; Pandith et al., 2020). It is probable that there are many novel PKS activities still to be discovered in plants, including bryophytes. This may include steps in bibenzyl biosynthesis, a group of liverwort phenylpropanoid compounds related to plant defense that includes cannabinoid-like structures (Hussain et al., 2018). The presence of at least 24 PKS genes in the Marchantia genome suggests potential biosynthetic diversity, and at least one gene (Mapoly0014s0122) is closely related to the anther-specific chalcone synthase-like enzymes (ASCLs) involved in the biosynthesis of sporopollenin in angiosperms (Bowman et al., 2017). However, the majority of the annotated MpPKS genes appear to have resulted from a strongly conserved duplication of a CHS/PAL gene pair (Bowman et al., 2017).

The occurrence and function of CHI and CHIL in basal plants is proving to be an interesting question. Liverworts have both types of gene, and knockout chi mutants of Marchantia completely lose production of flavones (Clayton et al., 2018). Thus, in Marchantia, as in angiosperms examined, CHI is an essential in planta activity for flavanone production. However, no gene sequences for CHI have been found in moss or hornwort genome sequences or transcriptomes (Ngaki et al., 2012; Cheng et al., 2018; Berland et al., 2019). Although spontaneous closure to form the C-ring to produce flavanones from chalcones has been shown to occur in vitro, comparative studies on the spontaneous and enzyme catalyzed reactions suggest this is unlikely to be significant in planta (Jez and Noel, 2002). Studies on mutants for chi in Arabidopsis (tt5), carnation (i), and rice (gh1) found that flavonoid biosynthesis was not fully prevented (Stich et al., 1992; Hong et al., 2012; Jiang et al., 2015), suggesting some spontaneous conversion. However, in the case of carnation at least, the residual production of flavanones in the chi mutant has been found to be due to a second, weakly expressed, CHI gene (Miyahara et al., 2018). Thus, how flavonoid biosynthesis occurs in mosses is an open question. CHI and CHIL are thought to be examples of the rare event of catalytic activity arising in a noncatalytic scaffold protein (Kaltenbach et al., 2018). The mechanism of action of CHIL is unclear, and it may have differing activities across land plants, perhaps based on the promotion of activity of different biosynthetic enzymes through protein-protein interaction. In hop (Humulus lupulus), HlCHIL2 enhances the activities of CHS and an aromatic prenyltransferase (HlPT1L) through protein–protein interaction (Ban et al., 2018), and the promotion of flavonol and proanthocyanidin biosynthesis in Arabidopsis is proposed to be through direct interaction of CHIL and CHI (Jiang et al., 2015). In Marchantia, CHIL may interact with CHS or more than one phenylpropanoid pathway enzyme, since the production of both flavones and auronidins in chil mutants is only about 10% of wild-type amounts (Clayton et al., 2018). Thus, one possibility is that in mosses and hornworts CHIL can replace CHI. However, the moss CHIL genes assayed to date do not have CHI activity (Cheng et al., 2018), making this less probable.

Two major hydroxylase groups, the Cyp450s and 2 oxoglutarate dioxygenases (2OGDs, divided into the three classes DOXA, B, and C), contribute several enzymes to the phenylpropanoid pathway of angiosperms. Cyp450s include C4H, FLAVONOID 3'-HYDROXYLASE (F3'H), and FLAVONE SYNTHASE II (FNSII). 2OGDs include the

FLAVANONE 2-HYDROXYLASE (F2H), FLAVANONE 3- HYDROXYLASE (F3H), FLAVONOL SYNTHASE (FLS), FLAVONE SYNTHASE I (FNSI), and ANTHOCYANIDIN SYNTHASE/LEUCOANTHOCYANIDIN DIOXYGENASE (ANS). The evolutionary aspects of these gene families with regard to flavonoid biosynthesis were recently reviewed by Yonekura-Sakakibara et al. (2019). C4H is conserved in bryophytes, and the presence of all the other enzymes in liverworts and/or mosses would be expected based on the compounds produced. However, the close similarity of the sequences within the Cyp450 and 2OGD enzyme groups means that assignments based only on sequence similarity to the angiosperm genes should be treated with caution, and conclusive identification of other genes requires functional analysis. FNSI, F2H, and F3H have high sequence similarity and are in the DOXC28 clade and FLS and ANS are close in sequence and in the DOXC47 clade. A review of the two clades, and possible evolutionary timing of the origin of each, is given in Yonekura-Sakakibara et al. (2019).

Based on the occurrence of flavones in liverworts and mosses, it is expected that F3H and FNS activities evolved early in land plants, and two DOXC28 genes have increased transcript abundance during UVB-induced flavone production in Marchantia (Clayton et al., 2018). However, to date, the only functional characterization is for a F2H that may contribute to flavone biosynthesis in the liverwort Plagiochasma appendiculatum (Han et al., 2014). The biosynthesis of flavones illustrates the difficulties of making assumptions about gene function, as a variety of alternative routes to flavone O- and Cglycosides have evolved in angiosperms (Jiang et al., 2016). A further complication is that the 2OGD enzymes (particularly the FLS and ANS) show promiscuous and sometimes overlapping activities when assayed in vitro (reviewed in Martens et al., 2010). Studies with Arabidopsis have shown that these "secondary" activities can also be present in planta, as ANS can contribute to (relatively weak) flavonol biosynthesis in the Arabidopsis fls-1 mutant (Martens et al., 2010).

As yet, it is not clear precisely what phenylpropanoid biosynthetic activities may be present in bryophytes but not found in other plant groups. There are certainly some major pathway branches prevalent in bryophytes that are absent or rare in other groups, such as those for bibenzyls, auronidins, and sphagnorubins. Corresponding evolutionary divergence of specialized metabolic pathways would be expected to underpin the occurrence of the differing compound types. Phylogenetic analysis of the 148 Cyp450, 38 2OGD, and 41 Family-1 UDPglycosyltransferase (the UGT family containing the "plant secondary product glycosyltransferase" motif) genes of

PPO type. RNA-seq data from Berland et al. (2019) were used to check for expression. The top structure (not to scale) is standard for plant PPOs. Transit/signal peptide (P), copper binding domain TYR : PFAM 00264 (CuA/CuB), PPO1\_DWL: PFAM 12142 domain (DWL), tyrosine motif (YxY), PPO1\_KFDV : PFAM12143 domain (KFDV). (B) Weblogo display of conserved residues of DWL and KFDV domains based on all gene models having an intact region (generated using https:// weblogo.berkeley.edu/logo.cgi). DWL, tyrosine, and KFDV core motifs are underlined in black. Residues for the regions identified by Tran et al. (2012) with core motifs of EEEVLV (left) and EFAGSF (right) are underlined in blue.

Marchantia found that the majority formed individual clades that also suggested substantial lineage-specific diversification of specialized metabolism (Bowman et al., 2017). Moreover, the emerging transcriptome and genome sequence information from bryophytes is suggesting expanded functionality may have occurred for other classes of enzymes involved in phenylpropanoid biosynthesis. In the next section we examine two specialized metabolism gene families that show unexpectedly large gene family sizes in the Marchantia genome: those for POLYPHENOL OXIDASE (PPO) and DIRIGENT (DIR) proteins.

# Liverworts May Have Expanded Functional Roles in Specialized Metabolism for Polyphenol Oxidase and Dirigent Proteins

PPO genes are found throughout land plants, as well as in bacteria, fungi, and animals, but are absent from algae. PPOs are type-III-copper proteins and the name PPO covers two major enzyme types: tyrosinases, which hydroxylate para-substituted monophenols to ortho-diphenols (monophenolase activity) and use molecular oxygen to oxidize ortho-diphenols to orthoquinones (diphenolase activity); and the catechol oxidases, which have only the diphenolase activity. However, it has been recently proposed that monophenolase activity could be a widespread feature of PPOs, but that the activity has remained cryptic because activity assays usually use tyrosine rather than the natural substrates, which are often not known (Molitor et al., 2016). PPOs are commonly thought of as plant defense enzymes that oxidize and/or polymerize a range of phenolic substrates with which they come into contact during cell disruption, resulting in the familiar browning reactions following tissue damage, for example in cut apples or potatoes. However, in addition to these general activities, some PPOs can conduct cross-linking reactions in biosynthetic pathways, such as latex formation; and new specific roles for PPOs have emerged in recent years (Figure 6). The published PPO gene family size in plants varies from zero (e.g., Arabidopsis) to 13 in Physcomitrella patens (Tran et al., 2012). Several angiosperm species examined have only a single PPO gene, but 11 genes have been found in genome sequences of Glycine max (the legume soybean), Populus trichocarpa (poplar), and Selaginella moellendorffii (a lycophyte) (Tran et al., 2012). However, we found a much larger PPO gene family in the Marchantia genome: there are 64 candidate PPO genes (including gene fragments and unresolved gene models). Excluding those having partial gene models, 46 of the 64 PPO genes were represented in the RNA-seq data of Berland et al. (2019) and so are actively transcribed. Given the relatively small number of total gene models in the draft genome sequence of Marchantia, this represents a significant gene family, larger than the annotated 2OGD and UGT families.

Plant PPOs characterized to date are produced in a latent state as proteins of about 64–68 kDa. Besides the N-terminal targeting peptide (usually for plastid localization), PPOs contain a catalytically active domain of about 40 kDa, and a C-terminal domain of about 19 kDa that shields the active site and is later cleaved off to release the active protein. The C-terminal domain is ubiquitous in plant PPOs examined to date. Based on predicted amino acid sequences, PPOs with this typical structure are found in Marchantia; however, there are also members of the PPO family that lack this C-terminal domain (Figure 7), including the auronidin-related Mapoly0021s0041. These "short" type PPOs have also been found in fungi and bacteria (Huber et al., 1985; Shuster and Fishman, 2009; Gasparetti et al., 2010). Only a few of this short type, from the bacteria Streptomyces and Bacillus, have been extensively studied. The Streptomyces PPO is thought to be initially in an inactive form that is bound with a "caddie" protein. The caddie protein subsequently transfers copper to the PPO and disassociates to release an active PPO (Chen et al., 1992; Matoba et al., 2006). In contrast, the PPO from Bacillus does not need a caddie protein (Sendovski et al., 2011).

The first PPO found to have an unexpected role in plant specialized metabolism was the AUREUSIDIN SYNTHASE (AUS) that converts chalcone 4'-O-glucosides to aurone 6-Oglucosides in Antirrhinum majus (Nakayama et al., 2000; Davies et al., 2006; Ono et al., 2006; Elumalai and Liu, 2011). AmAUS differed from previously characterized PPOs in three important aspects: it was vacuole localized (Ono et al., 2006), it was a glycoprotein, and it lacked activity against common PPO substrates such as tyrosine or 3,4-dihydroxyl L-phenylalanine (L-DOPA). AmAUS conducts oxygenation of the B-ring of the chalcone, which is followed by cyclization into the aurone (Nakayama et al., 2000) (Figure 6). Although it can use chalcone aglycones in vitro, in planta aurone production in A. majus requires the activity of the CHALCONE 4'-O-GLUCOSYLTRANSFERASE (C4'GT) since only the glucoside is transported into the vacuole (Ono et al., 2006; Bradley et al., 2017). Subsequently, PPOs that form aurones were identified in other species, with the AURONE SYNTHASE of Coreopsis grandiflora that makes 4-deoxyaurones being studied in detail (Molitor et al., 2015; Molitor et al., 2016). In contrast to AmAUS, the CgAUS has the N-terminal chloroplast transit peptide and thylakoid transfer domain characteristic of plastid-localized PPOs involved in browning reactions, and uses chalcone aglycones to make aurone aglycones (Kaintz et al., 2014; Molitor et al., 2015). A PPO (Mapoly0021s0041) is strongly upregulated by MpMYB14 in association with auronidin production in Marchantia (Berland et al., 2019). Loss-offunction mapoly0021s0041 mutants have greatly reduced amounts of auronidin, suggesting it too may encode an aurone biosynthetic activity, or is involved in later steps of auronidin biosynthesis and/or polymerization. Aurones have been found across land plant groups, but with sporadic occurrence. This suggests that their biosynthesis may have arisen independently on a number of occasions. Although both the aurone biosynthetic enzymes characterized to date are PPOs, the sequences are phylogenetically distinct and have differing activities and sub-cellular localization. Additionally, the biosynthesis of the aurone hispidol in Medicago truncatula may be conducted by a peroxidase rather than a PPO (Farag et al., 2009). Besides AUS, PPOs have been implicated in tyrosine or phenylpropanoid biosynthetic pathways of walnut (Araji et al., 2014) and creosote bush (Larrea tridentata) (Cho et al., 2003).

Forty of the Marchantia PPO genes (66%) occur as tandem repeats or small gene clusters (local tandemly arrayed genes; TAGs). Although this figure may be either an under- or overestimate as it is based on the initial scaffold assembly of the genome (Bowman et al., 2017), it is nevertheless a much higher value than the overall percentage of Marchantia TAGs estimated on the same basis, which at 5.9% is near the lower end of the range observed in flowering plants (Bowman et al., 2017). The TAG percentage is also relatively high for some of the other characterized specialized metabolite gene families of Marchantia. For example, there are 18 occurrences of neighboring PAL and/ or CHS genes. TAGs are notable in some angiosperm species that have prominent specialized metabolic characteristics—such as the terpenoid pathways of the tree species Eucalyptus grandis and teak (Tectona grandis). Teak has at least 14 TAGs for the terpene synthase gene family (Zhao et al., 2019). E. grandis has the largest number of genes in tandem repeats reported among sequenced plant genomes, at 34% of total genes (Myburg et al., 2014). For the Marchantia phenylpropanoid biosynthetic pathway, 10 multigene families have expanded, mostly through tandem duplication, to result in a total of 174 genes. In angiosperms, gene diversification is a result of a combination of local duplication events and whole genome duplications, but it is probable that no whole-genome duplication events have occurred during liverwort evolution (Bowman et al., 2017). Therefore, although there is no overall increase in the frequency of TAGs in liverworts (at least for Marchantia), local gene duplication events are likely to have been a common mechanism for generating gene neofunctionalization in specialized metabolism. Whether this is typical of other liverworts requires the completion of further genome sequences. However, BLAST analysis of the Lunularia cruciata transcriptome (www.polebio.lrsv.ups-tlse.fr/Luc\_v1/Luc\_v1.fa) identified more than 20 sequences with the conserved features of PPOs (data not shown), suggesting a large gene family in this species also.

Dirigent proteins (DIR) are small (~16–18 kDa) cell walllocalized proteins that may control the regio- and stereospecific outcome of phenoxy radical coupling in lignin and lignan polymerization reactions (Davin et al., 1997; Gang et al., 1999). The polymerization reactions also require the activity of laccase or peroxidase to produce electron oxidative capacity to generate the phenoxy radical. In vascular plants, lignins are complex, amorphous heteropolymers involved in wall strengthening and pathogen resistance, with species-specific composition produced by polymerization of coniferyl, sinapyl, and p-coumaryl alcohols. A role for DIRs in directing the reactions leading to the formation of lignin has been proposed but not definitively established, although there is strong genetic evidence in support of some specific cases (Hosmani et al., 2013).

The role of DIRs in determining stereospecificity has been best described in the formation of lignans, a class of 8-8' linked C6C3 phenylpropanoid dimers involved in pathogen resistance, for example in the production of (+)- or (−)-pinoresinol compounds in flax (Linum usitatissimum) and pea (Pisum sativum) (Kim et al., 2015; Corbin et al., 2018). The X-ray crystal structure of PsDRR206 involved in (+)-pinoresinol formation suggested that the active protein had a trimeric structure (Kim et al., 2015). Recent work has suggested that at least some DIRs may do more than the hypothesized positioning of phenoxy radicals prior to coupling, and may themselves possess enzymatic activity. The crystal structure of Arabidopsis AtDIR6 identified potentially catalytic residues including aspartic acids that were essential for activity, and it was proposed that this protein catalyzed the cyclization of the bisquinone methide intermediate during (+)- or (–)-pinoresinol formation (Gasper et al., 2016). Also, a recombinantly expressed DIR from Glycyrrhiza echinata was found to possess isoflavanol dehydratase activity and carry out the final ring-closure step of the biosynthesis of the anti-microbial phytoalexin (–)-pterocarpan (Uchida et al., 2017).

DIR gene families can be quite large, with 26 genes in Arabidopsis (Paniagua et al., 2017) and 44 genes in flax (of which seven appeared to be gene fragments or result from chromosomal rearrangements; Corbin et al., 2018). Of the 37 genes with classical DIR structure in flax, 15 paralogous gene pairs were identified. Kubo et al. (2018) identified 52 dirigent-like predicted proteins in the Marchantia genome sequence. Our analysis for this article found that at least 35 of these occur as TAGs. However, the deduced protein sequences of the family members are diverse, and the functionality of the proteins has not yet been established. Our BLAST analysis of the Marchantia genome and transcript resources with the 24 annotated Arabidopsis DIR genes gave us 60 initial candidate gene models, with strong evidence of some very recent gene duplications giving groups of adjacent genes with highly similar or identical deduced amino acid sequences.

Phenylpropanoid biosynthesis and lignification are common plant responses to biotic and abiotic stress (Zhao and Dixon, 2014; Paniagua et al., 2017) and consequently, as a component of lignification, DIRs have been implicated in responses to pathogen and drought stress (e.g., Thamil Arasan et al., 2013; Paniagua et al., 2017). In P. patens, fungal infection resulted in increased incorporation of phenolic compounds into the wall and up-regulation of a DIR gene (Reboledo et al., 2015). In Marchantia, abiotic stresses such as UVB irradiation, N deficiency and salinity (Albert et al., 2018; Kubo et al., 2018), and pathogen attack (Carella et al., 2019) increased the expression of MpMYB14. MpMYB14 promotes auronidin production and up-regulates transcript abundance for at least three DIR genes (Mapoly0006s0216, Mapoly0006s0217, Mapoly0078s0058) (Albert et al., 2018; Kubo et al., 2018). The deduced protein products of these genes possess predicted signal peptides (SignalP 5.0, http://www.cbs.dtu.dk/services/SignalP/), indicative of secretion to the vacuole or, extracellularly, to the cell wall. However, prediction of subcellular localization using WoLFPSORT (https://wolfpsort.hgc.jp) indicated with low confidence different compartments for the three predicted protein products of the genes: extracellular (Mapoly0006s0217), vacuolar ( Mapoly0006s0216 ), and cytoplasmic (Mapoly0078s0058). Intracellular coupling of monolignol radicals has been described in Arabidopsis (Dima et al., 2015). There have been no studies on DIR genes in hornworts. However, as lignans are prominent specialized metabolites of hornworts, and DIRs have roles in lignan biosynthesis in angiosperms, this could be a worthwhile area to investigate.

# Evolution of the Transcriptional Regulation of the Phenylpropanoid Pathway

In angiosperms and gymnosperms, the key regulatory complex consists of R2R3MYB and bHLH TFs joined with a WD-Repeat (WDR) protein, a composition of proteins known as an "MYBbHLH-WD repeat (MBW)" complex. The MBW complex that activates anthocyanin and proanthocyanidin production contains R2R3MYB proteins from sub-group (SG) 5 or 6, and commonly promotes transcription of the biosynthetic genes throughout the pathway. The action of the MBW complex is modified by a WRKY class activator TF and a series of proteins with repressor actions (Lloyd et al., 2017). In particular, R2R3MYBs from SG4 can join an activating MBW complex and turn it into one that represses target gene transcription, and R3MYBs can bind the bHLH to prevent it from forming the MBW complex, thus competitively inhibiting activation (Albert et al., 2014). The SG4 R2R3MYBs are characterized by the presence of an ethylene response factor (ERF)-associated amphiphilic repression (EAR) motif (LxLxL or DLNxxP) or a TLLLFR motif in the C terminus that mediates transcriptional repression (Chen et al., 2019a; Chen et al., 2019b; Ma and Constabel, 2019). For the activation of the flavonol and flavone branches, a SG7 R2R3MYB acts without being part of the complex. There is also regulation upstream of the flavonoid pathway, as HY5 activates the production of the SG7 R2R3MYB. Additionally, in Arabidopsis it has been shown that HY5 directly activates transcription of some flavonoid biosynthetic genes, such as CHS. The conservation of HY5 function in the UVB responses of both bryophytes and angiosperms was mentioned earlier, although its target gene set has yet to be resolved.

The expansion of TF families during evolution has been a driver of diversity in land plants, as a consequence of multicellularity and increased organismal complexity and/or for coping with the increased stress of a sessile land-based lifestyle. The MYB gene family is one of the largest TF families in plants, with Arabidopsis having 137 R2R3MYB genes (Feller et al., 2011). This includes one SG5, four SG6, and three SG7 genes in Arabidopsis for proanthocyanidin, anthocyanin, and flavonol production, respectively. The presence of small gene families for sub-groups regulating specialized metabolic pathways is common for angiosperms, and has enabled subfunctionalization and diversification of flavonoid temporal and spatial regulation in flowers, seeds, and vegetative tissues. The bHLH and WDR components are less specific in their regulatory targets, and can regulate other characters as well as flavonoid biosynthesis, such as epidermal cell differentiation.

The great majority of information on the transcriptional regulation of specialized metabolite pathways is available from studies on angiosperms, with only a small number of studies on gymnosperm, fern, or bryophyte species. Identifying the genetic components for flavonoid pathway regulation in these other plant groups will help establish a model for how regulation of specialized metabolism may have changed during evolution. For bryophytes, notable questions relating to flavonoid pathway regulation include: are R2R3MYB and bHLH genes the key direct activators? If MYBs are the direct activators, which SGs are present in bryophytes and do small gene families occur for each SG? Does a MBW complex form in bryophytes? Do repressor TFs modify pathway regulation? Characterizing these aspects in species such as M. polymorpha, P. patens, and the lycophyte model S. moellendorffii should indicate which aspects of flavonoid regulation are conserved across land plants, and thus may have been present in the early land plant ancestor, and which aspects may have arisen as part of evolutionary diversification of the different land plant groups.

Compared with angiosperms, the characterized bryophytes and lycophytes have small TF families. There are only 22, 49, and 62 R2R3MYBs in the genomes of M. polymorpha, P. patens, and S. moellendorffii, respectively (Feller et al., 2011; Bowman et al., 2017). For Marchantia genes, a phylogenetic comparison of this gene family shows that MpMYB02 and MpMYB14 fall basal to a clade that contains all the phenylpropanoid-related R2R3MYB genes of Arabidopsis (SGs 4, 5, 6, 7, 15, and 44) (Bowman et al., 2017). Concluding whether these correspond to descendants of the flavonoid regulatory R2R3MYBs of the ancestral land plant requires further study, although both MpMYB02 and MpMYB14 activate phenylpropanoid biosynthetic genes. MpMYB02 is required for production of bibenzyls (Kubo et al., 2018) while MpMYB14 is essential for auronidin production and promotes the production of flavone O-glycosides (Albert et al., 2018; Clayton et al., 2018; Kubo et al., 2018). The profiles of transcripts up-regulated by MpMYB02 and MpMYB14 include DIR genes (Albert et al., 2018; Kubo et al., 2018; Berland et al., 2019). For MpMYB14, this includes the three DIR genes discussed earlier as well as other DIR genes that have been shown to be direct targets (Kubo et al., 2018). Co-expression analysis in flax found that MYB TFs were up-regulated along with DIR genes during secondary wall biosynthesis (Corbin et al., 2018), suggesting that MYB proteins could control DIR expression in both angiosperms and bryophytes.

MpMYB14 must act redundantly with other uncharacterized TFs for flavone production, as Mpmyb14 mutants still show the induction of flavones in response to UVB (Clayton et al., 2018), nutrient stress, or high-irradiance white light (Albert et al., 2018). Flavone production is reduced in Mphy5 mutants, so it is possible that HY5 is a direct activator of flavonoid biosynthetic genes as in Arabidopsis, but there may also be HY5-independent activation pathways for flavone production (Kondou et al., 2019). Analysis of changes in transcriptomes in response to UVB treatment does not present any alternative R2R3MYB candidate for flavone regulation (Clayton et al., 2018). Thus, Marchantia may lack the equivalent of the angiosperm SG7 activators of flavonol and flavone biosynthesis.

The PabHLH gene of the liverwort P. appendiculatum is a probable activator of bibenzyl biosynthesis (Wu et al., 2018). Over-expression of PabHLH in P. appendiculatum increased bibenzyl concentration and up-regulated transcript abundance from known phenylpropanoid biosynthetic genes (PAL, 4CL) and candidate bibenzyl biosynthetic genes, whereas RNA interference-induced suppression down-regulated the same genes and reduced bibenzyl accumulation. Phylogenetically, PabHLH falls within clades containing the flavonoid MBW bHLH sequences of angiosperms (within bHLH subgroup IIIf), suggesting it may be homologous to them. In Marchantia, MpbHLH12 is the gene with the highest sequence identity to PabHLH and the flavonoid-related bHLHs of angiosperms, and transcriptomic analysis of MpBHLH12 overexpression transgenics suggests it may also be involved in flavonoid regulation (Arai et al., 2019). However, although R2R3MYB and bHLH genes do regulate flavonoid biosynthesis in liverworts, and there are conserved WDR sequences in the genome (Bowman et al., 2017), there is no answer yet on whether the MBW complex exists in bryophytes. A flavonoidrelated MBW complex has been characterized in the gymnosperm Norway spruce (Picea abies) (Nemesio-Gorriz et al., 2017), supporting an origin for the MBW complex in the plant lineage prior to the last common ancestor of gymnosperms and angiosperms, around 350–300 MYA. However, although the conserved amino acid motif ([D/E]Lx2[R/K]x3Lx6Lx3R) identified as necessary for R2R3MYB proteins to bind the bHLH partners (Zimmermann et al., 2004) is present in the S. moellendorffii sequence SmXP002978781, it is lacking in bryophyte R2R3MYBs studied to date. The closest matches in P. patens (PpXP001752936) and Marchantia (MpMYB02 and MpMYB14) lack one and two deduced amino acid residues, respectively.

Whether bryophytes possess MYB genes with a repressive action in phenylpropanoid regulation, either the R2R3MYB active repressors that form part of the MBW complex or the R3MYBs that are thought to "compete" for the bHLH proteins, is also an open question. We were unable to identify (known) repression motifs in any of the Marchantia R2R3MYB sequences. Both P. patens and S. moellendorffii have R2R3MYB genes with putative EAR motif sequences (LxLxL), but the possible function of these in regulating phenylpropanoid biosynthesis has not been examined. Analysis of the auxin signaling pathway of Marchantia has identified an orthologue of TOPLESS, which in angiosperms interacts with the EAR motif to mediate transcriptional repression (Flores-Sandoval et al., 2015). The Marchantia genome contains an expanded R3MYB gene family (Bowman et al., 2017), but no analysis of these with regard to flavonoid biosynthesis has been published.

In summary, based on the evidence from Marchantia, it seems probable that the ancestral R2R3MYB regulators of phenylpropanoid metabolism were activators acting outside of an MBW complex. R2R3MYB-repressive TFs and the MBW complex probably evolved after the last common ancestor of liverworts and gymnosperms/angiosperms. As the flavone pathway probably evolved prior to anthocyanin biosynthesis, it could be expected that R2R3MYBs most similar to SG7 might be the ancestral type. However, the specific flavone activators of Marchantia have yet to be identified. MpMYB02 and MpMYB14 may correspond to the ancestral phenylpropanoid pathway activators, and like SG7 probably act outside the MBW complex, but it is difficult to state which is the most closely related SG because of the extent of sequence divergence, with no conservation of sequence outside the MYB domains themselves. Furthermore, additional data are required from other bryophyte species, as the evolutionary path to Marchantia will have resulted in extensive genetic changes and the loss of characters that were present in the last common ancestor.

# CONCLUDING COMMENTS

The commonality of phenylpropanoid biosynthetic genes between bryophytes and angiosperms, and the conserved functions of flavonoids in assisting in tolerance to stresses such as UVB and pathogen attack, support the proposal that the pathway arose before the last common ancestor of these land plant groups, relatively early during the process of land colonization. The exception to this is the hornworts, which lack flavonoids. Unless the divergence of hornworts occurred before the pathway arose, the hornwort ancestor must have acquired mutations that caused loss of the biosynthetic or regulatory capacity. This may be analogous to the loss of anthocyanin biosynthesis in some lineages of the Caryophyllales, where they are replaced by betalains. As the main red pigments of angiosperms (soluble anthocyanins) and bryophytes (cell wall-bound auronidins and sphagnorubins) differ in structure and cellular properties, it is difficult to suggest what the original common ancestor may have possessed with regard to red pigments. Establishing which components of anthocyanin biosynthesis are present or lacking in bryophytes may help in this regard.

# REFERENCES

Agati, G., and Tattini, M. (2010). Multiple functional roles of flavonoids in photoprotection. New Phytol. 186, 786–793. doi: 10.1111/j.1469- 8137.2010.03269.x

The diversification of both specialized metabolite biosynthesis and the transcription factors that regulate the pathways are thought to be important contributors to the evolution of plants to occupy the varied ecological niches offered on land (Pichersky and Gang, 2000). To date, much of our understanding of the genetic basis of the diversification process has been based on studies of flowering plants. However, the completion of the first genome sequences for a moss (P. patens) and liverwort (M. polymorpha) has started to reveal the details of the specialized metabolite gene families, such as for phenylpropanoid biosynthesis. Notably, the TF families thought to regulate the phenylpropanoid pathway are much smaller in Marchantia than in flowering plants. However, there are relatively large Marchantia gene families for enzymes that are often involved in specialized metabolism, such as the Cyp450, 2OGD, and UGT families. Moreover, Marchantia has large PPO and DIR gene families compared to angiosperms, suggesting these enzyme groups may make a greater contribution than previously anticipated to phenylpropanoid and other specialized metabolite biosynthesis in the liverworts. Thus, in liverworts some of the gene families involved in the biosynthesis of specialized metabolites appear to have undergone more gene duplication (allowing consequent suband neofunctionalization for particular family members) than the TFs that regulate the same pathways. Expansion of the regulatory TF families through duplication and sub/neofunctionalization is seen in the angiosperms, probably reflecting increased organismal complexity.

# AUTHOR CONTRIBUTIONS

KD, RJ, YZ, NA, DB, BJ, JB and KS reviewed literature, formulated ideas and wrote the manuscript. KD and KS prepared the figures. KD, RJ, YZ, DB and KS conducted bioinformatic analysis.

# FUNDING

Financial support was provided by the Marsden Fund of New Zealand Grant PAF1701.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpls.2020.00007/ full#supplementary-material

Agati, G., Azzarello, E., Pollastri, S., and Tattini, M. (2012). Flavonoids as antioxidants in plants: Location and functional significance. Plant Sci. 196, 67–76. doi: 10.1016/j.plantsci.2012.07.014

Agregán, R., Munekata, P. E. S., Franco, D., Dominguez, R., Carballo, J., and Lorenzo, J. M. (2017). Phenolic compounds from three brown seaweed species using LC-DAD–ESI-MS/MS. Food Res. Int. 99, 979–985. doi: 10.1016/ j.foodres.2017.03.043


Marchantia polymorpha and flowering plants. Plant J. 96, 503–517. doi: 10.1111/tpj.14044


the liverwort Marchantia polymorpha. PloS Genet. 11, e1005207. doi: 10.1371/ journal.pgen.1005207


polyphenol oxidase with unique characteristics. Planta 242, 519–537. doi: 10.1007/s00425-015-2261-0


Dianthus caryophyllus L. (carnation). Planta 187, 103–108. doi: 10.1007/ BF00201630


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Davies, Jibran, Zhou, Albert, Brummell, Jordan, Bowman and Schwinn. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.