# OMICS AND SYSTEMS APPROACHES TO STUDY THE BIOLOGY AND APPLICATIONS OF LACTIC ACID BACTERIA

EDITED BY : Konstantinos Papadimitriou, Jan Kok, Pierre Renault and Kimberly Kline PUBLISHED IN : Frontiers in Microbiology

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-717-1 DOI 10.3389/978-2-88963-717-1

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# OMICS AND SYSTEMS APPROACHES TO STUDY THE BIOLOGY AND APPLICATIONS OF LACTIC ACID BACTERIA

Topic Editors:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece Jan Kok, University of Groningen, Netherlands Pierre Renault, Institut National de la Recherche Agronomique (INRA), France Kimberly Kline, Nanyang Technological University, Singapore

The economic importance of lactic acid bacteria (LAB) for the food industry and their implication in health and disease has rendered them attractive models for research in many laboratories around the world. Over the past three decades, molecular and genetic analysis of LAB species provided important insights into the biology and application of starter and probiotic LAB and in the virulence of LAB pathogens. The knowledge obtained prepared LAB researchers for the forthcoming opportunities provided by the advent of microbial genomics. Today, developments in next-generation sequencing technologies have rocketed LAB genome research and the sequences of several hundreds of strains are available.

This flood of information has revolutionized our view of LAB. First of all, a detailed picture has emerged about the evolutionary mechanisms allowing LAB to inhabit the very diverge ecological niches in which they can be found. Adaptation of LAB to nutrient-rich environments has led to degenerative evolution processes that resulted in shortening of chromosomes and simplified metabolic potential. Gene acquisition through horizontal transfer, on the other hand, is also important in shaping LAB gene pools. Horizontally acquired genes have been shown to be essential in technological properties of starters and in probiosis or virulence of commensals. Progress in bioinformatics tools has allowed rapid annotation of LAB genomes and the direct assignment of genetic traits among species/strains through comparative genomics. In this way, the molecular basis of many important traits of LAB has been elucidated, including aspects of sugar fermentation, flavor and odor formation, production of textural substances, stress responses, colonization of and survival in the host, cell-tocell interactions and pathogenicity. Functional genomics and proteomics have been employed in a number of instances to support in silico predictions. Given that the costs of advanced next-generation methodologies like RNA-seq are dropping fast, bottlenecks in the in silico characterization of LAB genomes will be rapidly overcome.

Another crucial advancement in LAB research is the application of systems biology approaches, by which the properties and interactions of components or parts of a biological system are investigated to accurately understand or predict LAB behavior. Practically, systems biology involves the mathematical modeling of complex biological systems that can be refined iteratively with wet-lab experiments. High-throughput experimentation generating huge amounts of data on the properties and quantities of many components such as transcripts, enzymes and metabolites has resulted in several systems models of LAB. Novel techniques allow modelling of additional levels of complexity including the function of small RNAs, structural features of RNA molecules and post-translational modifications. In addition, researchers have started to apply systems approaches in the framework of LAB multispecies ecosystems in which each species or strain is considered as a part of the system. Metatransciptomics, metaproteomics and metametabolomics offer the means to combine cellular behavior with population dynamics in microbial consortia.

Citation: Papadimitriou, K., Kok, J., Renault, P., Kline, K., eds. (2020). Omics and Systems Approaches to Study the Biology and Applications of Lactic Acid Bacteria. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-717-1

# Table of Contents

*06 Editorial: Omics and Systems Approaches to Study the Biology and Applications of Lactic Acid Bacteria*

Konstantinos Papadimitriou, Kimberly Kline, Pierre Renault and Jan Kok *10 Early Transcriptome Response of* Lactococcus lactis *to Environmental Stresses Reveals Differentially Expressed Small Regulatory RNAs and tRNAs*

Sjoerd B. van der Meulen, Anne de Jong and Jan Kok *25 Comparative Genomics of* Lactobacillus acidipiscis *ACA-DC 1533 Isolated From Traditional Greek Kopanisti Cheese Against Species Within the* 

Lactobacillus salivarius *Clade* Maria Kazou, Voula Alexandraki, Jochen Blom, Bruno Pot, Effie Tsakalidou and Konstantinos Papadimitriou


Shruti Gupta, Adriána Fečkaninová, Jep Lokesh, Jana Koščová, Mette Sørensen, Jorge Fernandes and Viswanath Kiron

*100 Corrigendum: Lactobacillus Dominate in the Intestine of Atlantic Salmon Fed Dietary Probiotics*

Shruti Gupta, Adriána Fečkaninová, Jep Lokesh, Jana Koščová, Mette Sørensen, Jorge Fernandes and Viswanath Kiron


Joseph R. Spangler, Scott N. Dean, Dagmar H. Leary and Scott A. Walper

*162 Systems Biology – A Guide for Understanding and Developing Improved Strains of Lactic Acid Bacteria*

Jianming Liu, Siu Hung Joshua Chan, Jun Chen, Christian Solem and Peter Ruhdal Jensen


Alessandra Fontana, Irene Falasconi, Paola Molinari, Laura Treu, Arianna Basile, Alessandro Vezzi, Stefano Campanaro and Lorenzo Morelli

*225 Genomic Insights Into the Distribution and Evolution of Group B Streptococcus*

Swaine L. Chen


Voula Alexandraki, Maria Kazou, Jochen Blom, Bruno Pot, Konstantinos Papadimitriou and Effie Tsakalidou


Shane Thomas O'Donnell, R. Paul Ross and Catherine Stanton

# Editorial: Omics and Systems Approaches to Study the Biology and Applications of Lactic Acid Bacteria

Konstantinos Papadimitriou<sup>1</sup> \*, Kimberly Kline<sup>2</sup> , Pierre Renault <sup>3</sup> and Jan Kok <sup>4</sup>

<sup>1</sup> Department of Food Science and Technology, University of Peloponnese, Kalamata, Greece, <sup>2</sup> Singapore Centre on Environmental Life Sciences Engineering, School of Biological Sciences, Nanyang Technological University, Singapore, Singapore, <sup>3</sup> Micalis Institute, INRAE, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France, <sup>4</sup> Department of Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, Netherlands

Keywords: genomics, transcriptomics, proteomics, metagenomics, metabolomics, meta-omics, evolution, niche

#### **Editorial on the Research Topic**

#### **Omics and Systems Approaches to Study the Biology and Applications of Lactic Acid Bacteria**

#### Edited by:

John R. Battista, Louisiana State University, United States

#### Reviewed by:

Marcus Vinicius Canário Viana, Federal University of Pará, Brazil

#### \*Correspondence:

Konstantinos Papadimitriou k.papadimitriou@us.uop.gr; kostas.papadimitriou@gmail.com

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 12 May 2020 Accepted: 08 July 2020 Published: 20 August 2020

#### Citation:

Papadimitriou K, Kline K, Renault P and Kok J (2020) Editorial: Omics and Systems Approaches to Study the Biology and Applications of Lactic Acid Bacteria. Front. Microbiol. 11:1786. doi: 10.3389/fmicb.2020.01786 Early definitions classified lactic acid bacteria (LAB) as Gram-positive, non-sporulating, microaerophilic but aerotolerant, catalase, and oxidase negative bacteria that produce lactic acid. LAB were mostly related to foods as starters, non-starters (NSLAB) and less frequently as spoilers. These definitions necessitated some common evolutionary traits but still remained rather technical and beyond a true evolutionary perspective. For example, diverse bacteria were considered LAB, like members of Bifidobacteriaceae and Lactobacillaceae families, the first belonging to the Actinobacteria and the second to the evolutionarily distant Firmicutes.

The initial technical definitions of LAB also distort our understanding of the ecological niches they occupy and their functionality with all concomitant negative implications. There are many articles in which LAB are presented to be solely found in the food environment. The use of LAB to produce the vast majority of fermented foods around the world and their unproblematic consumption have led to the notion that LAB are safe sensu lato. Furthermore, many LAB are commensals to humans and have probiotic properties. Based on these facts several studies have inaccurately suggested that all LAB are probiotics.

The problems deriving from a technical definition of LAB were always apparent, but systematically neglected. Researchers in the field are familiar with the description of LAB genera related to food. These traditionally included Lactobacillus, Lactococcus, Streptococcus, Enterococcus, Pediococcus, Leuconostoc, Oenococcus, Tetragenococcus, Carnobacterium, and Weissella. However, even a superficial examination would reveal that not all of these genera are truly food related. In the Streptococcus genus, which contains more than 70 species, there is currently only one species acknowledged to be food related, i.e., Streptococcus thermophilus. The majority of streptococci are commensals including notorious opportunistic pathogens leading to the death of many humans worldwide every year, like Streptococcus pyogenes (Group A Streptococcus, GAS), Streptococcus agalactiae (Group B Streptococcus, GBS), and Streptococcus pneumoniae. Thus, it is at least a mishap to characterize the Streptococcus genus as food related. A detailed view reveals that important pathogenic bacteria may be present in the LAB genera and have been associated with many different diseases in humans and animals. It should be emphasized that some LAB species are opportunistic pathogens yet found in food, as is the case of Enterococcus faecalis, which is a common member of the microbiota of traditional dairy fermented foods while it can cause some of the most serious nosocomial infections.

One way to address these discrepancies is to adopt a definition of LAB based on an evolutionary perspective. Such an approach has been attempted in a number of studies, but their review is beyond the scope of this editorial. Nevertheless, we would like to point out the first chapter of the latest edition of the book entitled "Lactic Acid Bacteria: Biodiversity and Taxonomy" edited by Holzapfel and Wood (2014) where LAB are defined as those bacteria that comprise the order Lactobacillales within the class Bacilli of the phylum Firmicutes. The order Lactobacillales currently includes six families, i.e., Aerococcaceae, Carnobacteriaceae, Enterococcaceae, Lactobacillaceae, Leuconostoccaceae, Streptococcaceae with 40 genera and a continually increasing number of species (>400). Other bacteria, which are phylogenetically diverse from Firmicutes or the order Lactobacillales but may have common features with LAB (mostly the ability to produce lactic acid), are called "LAB-related." Such bacteria include members of the Bifidobacterium, Bacillus, and Sporolactobacillus genera. This strict evolutionary definition may have a major impact on the way we perceive LAB since it circumvents any technical criterion that was used to delineate this group of bacteria. Firstly, LAB as sole members of Lactobacillales acquire a "true" phylogenetic relationship beyond that which derives mostly from the ecological niche that they occupy. For example, it should come as no surprise that the dairy S. thermophilus is phylogenetically closer to pathogenic streptococci than to Lactococcus lactis, which is also mostly a dairy species. Secondly, we may better understand the biology of an LAB member in relation to the entire group. Again, S. thermophilus may have traits that are conserved in pathogenic streptococci, yet differences between the two may suggest how the former became avirulent and adapted to the food environment or how the latter inflict virulence to the host. Thirdly, we will be able to objectively address the ecological distribution of LAB. The economic importance and health relevance of starter, probiotic and commensal/pathogenic LAB have attracted most attention up to now. However, we should not overlook that LAB are major food spoilers or that they are extensively associated with plants and plant material and that they have been isolated from unexpected niches, such as soil, marine, and fresh waters, dust, air, etc. With this Research Topic, we would like to raise the awareness of the importance of addressing the shortcomings of technical definitions of LAB while presenting some of the latest developments of LAB research in the era of omics and systems biology.

Over the past three decades, molecular and genetic analysis of LAB species provided important insights into the biology and application of starter and probiotic LAB and in the virulence of LAB pathogens. The knowledge obtained prepared LAB researchers for the forthcoming opportunities provided by the advent of microbial genomics. Today, developments in nextgeneration sequencing technologies have rocketed LAB genome research and the sequences of several thousands of strains are available. This flood of information has revolutionized our view of LAB. First of all, a detailed picture has emerged about the evolutionary mechanisms allowing LAB to inhabit the diverse ecological niches in which they can be found. Adaptation of LAB to nutrient-rich environments has led in many instances to degenerative evolution processes that resulted in the reduction of the size of their chromosome and the simplification of their metabolic potential. On the other hand, gene acquisition through horizontal transfer is also important in shaping LAB gene pools. Horizontally acquired genes have been shown to underpin niche specialization and technological, probiotic or virulence properties. Progress in bioinformatics tools has allowed rapid annotation of LAB genomes and the direct assignment of genetic traits among species/strains through comparative genomics. In this way, the molecular basis of many important traits of LAB has been elucidated, including aspects of sugar fermentation, flavor formation, production of textural substances, stress responses, colonization of and survival in the host, cell-to-cell interactions, and pathogenicity. Functional genomics and proteomics have been employed in a number of instances to support in silico predictions. Given that the costs of advanced sequencing methodologies like RNA-seq are dropping fast, bottlenecks in the in silico characterization of LAB genomes will be rapidly surpassed by experimentally verified functional data.

Another crucial advancement in LAB research is the application of systems biology approaches, by which the properties and interactions of components or parts of LAB are investigated to accurately understand or predict their behavior. Practically, systems biology involves the mathematical modeling of complex biological systems that can be refined iteratively with wet-lab experiments. High-throughput experimentation, generating huge amounts of data on the properties and quantities of many components, such as transcripts, enzymes and metabolites, has resulted in several system models of LAB. Novel techniques allow modeling of additional levels of complexity including the function of small RNAs, structural features of nucleic acid molecules, and post-translational modifications. In addition, researchers have started applying systems approaches on LAB multispecies ecosystems in which each species or strain is considered as a part of the system. Metagenomics, metatransciptomics, metaproteomics, and metametabolomics offer the means to combine cellular behavior with population dynamics in microbial consortia.

In this Research Topic we have collected five different review articles. One of the reviews concerns the distribution and functionality of LAB in distinct ecological niches including recent metagenomics evidence for their presence in the human and animal microbiome. George et al. highlight many of the aspects discussed above, including the wide presence of LAB in natural ecosystems beyond the traditional ones. The authors clearly support that in some sites of the human microbiome (e.g., the small intestine or the colon) LAB are certainly subdominant and they exhibit relatively low abundances. This fact may contradict the commonly accepted hypothesis that LAB are of major importance for the human gut microbiome. Taking a step forward, the authors discuss that the presence of LAB should not be considered as advantageous for the host at all times.

The next two review articles focus on how developments in omics technologies are already being employed to study LAB and describe novel applications yet to come in the field. More specifically, in the first review by O'Donnell et al. the authors present a detailed overview of omics methodologies, which have shaped our knowledge in the context of LAB. The need to integrate data from multi-omics experiments to assist systems level understanding of LAB is clearly described. Additionally, the problems arising from integrating and interpreting the multifaceted data produced by multi-omics approaches are also addressed and specific strategies to by-pass such difficulties are proposed. The second review by Liu et al. concerns the application of systems biology, which may lead to improved LAB strains acting as cell factories for the production of food additives and therapeutics, but also as starter cultures. The authors highlight the application of adaptive laboratory evolution (ALE). It is a non-GMO strategy that may result to industrially important phenotypes taking advantage of the inherent genetic variation in a bacterial population exposed to appropriately selected conditions. Mutants deriving from such methods are not the result of single gene manipulations but rather of complex genomic adaptations. In contrast, mutants obtained through metabolic engineering at the systems level are rationally designed by approaches, such as optimization of an enzyme/pathway activity, synthetic biology and in silico simulations. Beyond their application as cell factories or improved starters, ALE mutants or mutants produced by systems metabolic engineering can be studied by multi-omics approaches to reveal/verify the mechanisms underlying the observed phenotypes.

The final two reviews describe the most important developments in genomics of specific streptococcal pathogens. Chen reviews how genomics was employed to tackle questions related to virulence mechanisms and epidemiology of S. agalactiae, a GBS. As the author underlines, S. agalactiae holds an important milestone in bacterial genomics since it was the first bacterium for which a pangenome of eight strains was analyzed. This allowed identifying conserved and variable regions in the genomes. Variable regions could be related to virulence, host and disease specification. Today, more than 7,000 GBS strains have been sequenced, providing a wealth of information concerning the molecular epidemiology of human, cattle and fish isolates. Chen also presents cases of emerging GBS diseases including data suggesting that a foodborne route exists for the transmission of these pathogens, as in the 2015 outbreak in Singapore, which was associated with the consumption of contaminated raw fish. In the final review, Hiller and Sá-Leão focus on the genomic diversity and plasticity of S. pneumoniae (pneumococcus). The bacterium is under strong selective pressure from antibiotics and pneumococcal multivalent conjugate vaccines, nonetheless remains a major human pathogen. S. pneumoniae colonizes sites of the upper respiratory tract and middle ear in the form of multi-strain biofilms. Studies of the pneumococcal pangenome suggest that more than 20% of the genes in any strain belongs to the accessory genome and, thus, an important number of genes is differentially distributed across S. pneumoniae strains. This gene pool may be transferred among pneumococcal strains in the multi-species biofilms through genetic exchange, aiding their prevalence over competitors, their survival of the host immune system and their resistance to antibiotics and vaccines. The authors discuss the striking ability of S. pneumoniae of genetic exchange and its implications for the pneumococcal pangenome and biology.

In this Research Topic, we also collected 12 research articles and one method article. Seven of them largely concern the genomics of LAB species. Wels et al. employ comparative genomics in combination with machine learning to resolve the disparity in Lactococcus lactis of strains exhibiting a cremoris genotype while displaying a lactis phenotype. The authors manage to identify orthologous groups of protein sequences that can be used to predict either the taxonomic position or the source of isolation of L. lactis strains. In another study, Kelleher et al. define the plasmidome of L. lactis. The authors proceed with comparative analysis of newly sequenced and publically available plasmids to identify technologically important traits they may carry. Alexandraki et al. performed comparative genomics of several fully sequenced genomes of S. thermophilus to reveal aspects of the evolution, biology and technological properties of the species. The study provides evidence for the existence of lineages within S. thermophilus based on the distribution of several genomic traits among its strains. Fontana et al. analyzed the phenotypic and genomic traits of six newly sequenced Lactobacillus helveticus strains. Pangenome analysis among many Lb. helveticus strains led to identification of specific genes (e.g., genes involved in folate biosynthesis and maltose-degradation as well as multiple copies of 6-phospho-β-glucosidase genes) that may support adaptation to the gut environment and a probiotic potential for some of the new strains. Kazou et al. present the first comprehensive comparative genomics study of Lactobacillus acidipiscis isolated from traditional Greek Kopanisti cheese. Lactobacillus acidipiscis does not seem to own probiotic genomic traits similar to those found in the phylogenetically relevant Lactobacillus salivarius, but it carries pathways for the production of major volatile compounds during the catabolism of amino acids, which may contribute to cheese flavor. Leong et al. performed whole-genome analyses on more than three hundred vancomycin-resistant Enterococcus faecium (VREfm) isolates from Tasmania's hospitals. They applied different in silico multi-locus sequence typing (MLST) techniques and singlenucleotide polymorphic (SNP) analysis of the isolated strains and determined the relatively rapid spread of a specific sequence type (ST1421) to multiple hospitals in an Australian state. Finally, Ayala et al. present some potential probiotic strains with antimicrobial activity against foodborne pathogens. The strains were characterized by typical wet experiments and were sequenced to uncover relevant genes by in silico methods.

The Research Topic also includes functional studies of LAB based on transcriptomics or proteomics data. van der Meulen et al. investigate the early transcriptome response of L. lactis to environmental stresses using RNA sequencing. Data presented suggests the involvement of many novel previously uncharacterized stress-related genes and small regulatory RNAs that warrants further investigation of how these genomic traits contribute to stress resistance under technologically relevant conditions. Prechtl et al. performed a label-free, quantitative proteomics approach to determine the sucroseinduced response of Lactobacillus sakei in relation to dextran formation. Proteomics data in combination with comparative genomics suggest that Lb. sakei typically isolated from fermented meat products also exhibits adaptation to plant environments and may have a role as a starter in sucrose-based fermented foods. Spangler et al. test the hypothesis that Lactobacillus plantarum may respond to the quorum sensing molecule N- (3-oxododecanoyl)-L-homoserine lactone (3OC12) produced by Pseudomonas aeruginosa. Transcriptomics as well as proteomics data indicated that Lb. plantarum could sense and respond to 3OC12 by up-regulating genes and proteins of which some were related to its native quorum sensing system.

Finally, three studies have to do with metagenomics. Melkonian et al. present a computational approach that allows identifying properties that distinguish an LAB species within a microbial community using genomic and metagenomic information as well as gene annotation. The authors demonstrate the applicability of their approach by relating the dominance of Lb. kefiranofaciens in kefir and the thriving of Lb. plantarum in wine fermentation to specific genomic traits of the two species. Seol et al. present a methodological paper that attempts to tackle the issue of the precise determination of probiotic species in food products and probiotic supplements. The authors propose a new sequence coverage-based pipeline that relies on a reference database and can precisely determine probiotic species

#### REFERENCES

Holzapfel, W. H., and Wood, B. J. (Eds.). (2014). "Introduction to the LAB." in Lactic Acid Bacteria: Biodiversity and Taxonomy (Hoboken, NJ: John Wiley & Sons), 1–12. doi: 10.1002/9781118655252.ch1

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

in products, independently of the sequencing platform used. Last, Gupta et al. use 16S rDNA metagenomics to assess alterations on the distal intestinal microbial communities of Atlantic salmon (Salmo salar) during administration of two different probiotic LAB. Each LAB changed the microbial composition of the community, resulting in distinct co-occurrence networks not only against the control salmon population but also between them.

The studies presented in this Topic Editorial represent major contributions made by a wide range of authors from different present-day technological and scientific approaches on the various and contrasted facets through which LABs interfere with food and human health.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

#### ACKNOWLEDGMENTS

We would like to thank the editorial staff at Frontiers in Microbiology for their initial invitation and support throughout.

Copyright © 2020 Papadimitriou, Kline, Renault and Kok. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Early Transcriptome Response of *Lactococcus lactis* to Environmental Stresses Reveals Differentially Expressed Small Regulatory RNAs and tRNAs

#### Sjoerd B. van der Meulen1, 2, Anne de Jong1, 2 and Jan Kok 1, 2 \*

*<sup>1</sup> Department of Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, Netherlands, <sup>2</sup> Top Institute Food and Nutrition, Wageningen, Netherlands*

Bacteria can deploy various mechanisms to combat environmental stresses. Many genes have previously been identified in *Lactococcus lactis* that are involved in sensing the stressors and those that are involved in regulating and mounting a defense against the stressful conditions. However, the expression of small regulatory RNAs (sRNAs) during industrially relevant stress conditions has not been assessed yet in *L. lactis*, while sRNAs have been shown to be involved in many stress responses in other bacteria. We have previously reported the presence of hundreds of putative regulatory RNAs in *L. lactis*, and have used high-throughput RNA sequencing (RNA-seq) in this study to assess their expression under six different stress conditions. The uniformly designed experimental set-up enabled a highly reliable comparison between the different stress responses and revealed that many sRNAs are differentially expressed under the conditions applied. The primary stress responses of *L. lactis* NCDO712 was benchmarked to earlier work and, for the first time, the differential expression was assessed of transfer RNAs (tRNAs) and the genes from the six recently sequenced plasmids of NCDO712. Although, we only applied stresses for 5 min, the majority of the well-known specific stress-induced genes are already differentially expressed. We find that most tRNAs decrease after all stresses applied, except for a small number, which are increased upon cold stress. Starvation was shown to induce the highest differential response, both in terms of number and expression level of genes. Our data pinpoints many novel stress-related uncharacterized genes and sRNAs, which calls for further assessment of their molecular and cellular function. These insights furthermore could impact the way parameters are designed for bacterial culture production and milk fermentation, as we find that very short stress conditions already greatly alter gene expression.

Keywords: RNA-Seq, sRNAs, transcriptomics, environmental stress, *L. lactis*

### INTRODUCTION

Bacteria display many general as well as specific molecular responses to environmental changes. Sudden alterations in the environment can be of physical or chemical nature and can threaten the lifespan of a microbial cell, especially if the stress condition is too intense in time or intensity. The metabolic activity of bacteria used in industrial fermentations is altered upon stress in their for

#### *Edited by:*

*Rachel Susan Poretsky, University of Illinois at Chicago, United States*

#### *Reviewed by:*

*Sinead M. Waters, Teagasc, The Irish Agriculture and Food Development Authority, Ireland Dave Siak-Wei Ow, Bioprocessing Technology Institute (A*∗*STAR), Singapore*

> *\*Correspondence: Jan Kok jan.kok@rug.nl*

#### *Specialty section:*

*This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology*

*Received: 23 May 2017 Accepted: 23 August 2017 Published: 14 September 2017*

#### *Citation:*

*van der Meulen SB, de Jong A and Kok J (2017) Early Transcriptome Response of Lactococcus lactis to Environmental Stresses Reveals Differentially Expressed Small Regulatory RNAs and tRNAs. Front. Microbiol. 8:1704. doi: 10.3389/fmicb.2017.01704*

**10**

example acidification rates and flavor formation (Xie et al., 2004; Taïbi et al., 2011). Gaining insights in the effects of stress could improve the predictability, quality, and safety of fermentations.

The lactic acid bacterium (LAB) Lactococcus lactis is of eminent importance in the dairy industry, where it is used worldwide for the production of a large variety of cheeses and of buttermilk. The main function during milk fermentation of LAB such as L. lactis is to convert lactose into lactic acid. The consequent lowering of the pH leads to an increased shelf life of the fermented products as it prevents outgrowth of spoilage or pathogenic organisms. In addition, L. lactis provides texture, flavors, and aromas to the end products (Marilley and Casey, 2004; Smit et al., 2005). During their preparation as a starter culture, as well as in the actual fermentation process, fluctuations in temperature, osmolarity, pH, and nutrient availability cause significant stress to the L. lactis cells. They have evolved different response systems to sense and respond to potential lethal conditions and to defend themselves accordingly, in order to survive. Many of these mechanisms and the regulatory systems involved have been identified in L. lactis, on the basis of homology to proteins with known functions in other organisms and/or by experimental validation (Sanders et al., 1999; Smith et al., 2010). The potential role in stress of small non-coding regulatory RNAs (sRNAs) has not yet been assessed in L. lactis, while sRNAs have been shown to play an important function in a variety of stress conditions in other bacteria (Gottesman et al., 2006; Romby and Charpentier, 2010; Hoe et al., 2013).

Bacterial regulatory RNAs are generally non-coding transcripts that modulate gene expression post-transcriptionally (Waters and Storz, 2009). They are usually classified on whether or not genes are encoded on the strand opposite to the strand from which they derive. Non-coding RNAs that are located within intergenic regions (IGRs) are referred to as small RNAs (sRNAs) and roughly contain between 50 and 350 nucleotides. They are very heterogeneous in size and structure, and act in trans to target mRNAs (Gottesman and Storz, 2011; Storz et al., 2011). Transcripts that overlap in an antisense fashion with mRNAs from the opposite strand are called antisense RNAs (asRNAs) (Thomason and Storz, 2010; Georg and Hess, 2011). sRNAs and asRNAs with proven functions in regulating other RNAs and/or proteins can be considered regulatory RNAs. Base-pairing between a regulatory RNA and its target mRNA(s) can affect mRNA stability as well as translation, the latter by influencing the accessibility of the ribosomal binding site (RBS) on the target transcript (Morita and Aiba, 2011; Prevost et al., 2011; Bandyra et al., 2012; Papenfort and Vanderpool, 2015). Since the discovery of the first regulatory RNAs, starting with antisense RNAs involved in plasmid copy number control (Stougaard et al., 1981; Tomizawa et al., 1981) and the first genomically encoded sRNA MicF (Mizuno et al., 1984), many others have been described especially since recent advances have been in high-throughput RNA sequencing (RNA-seq) as well as in high-density tilling arrays (Nicolas et al., 2012). Various mechanisms of action have been elucidated since but determining the functions and mechanisms of action of novel sRNAs is still the major challenge. This may be illustrated by the abundant non-coding RNA 6S, which was discovered as early as 1967 (Hindley, 1967) but of which the function in modifying RNA polymerase activity was uncovered only in 2000 (Wassarman and Storz, 2000).

Several examples exist of sRNAs that perform a crucial role in the adaptation and survival of bacteria during stressful conditions. For instance, the sRNA RybB from E. coli and Salmonella is activated by the extracytoplasmic stress sigma factor, σ E . Accumulation of misfolded outer membrane proteins (OMPs), for example in the stationary phase, can cause cell envelope stress. RybB downregulates many different omp mRNAs in order to prevent the synthesis of OMPs and to restore envelope homeostasis (Johansen et al., 2006; Papenfort et al., 2006; Papenfort and Vogel, 2009). Another striking example of how effective and diverse the scope of one sRNA can be is the bifunctional sRNA SgrS from E. coli and Salmonella. The transcription factor SgrR is triggered upon glucose-phosphate stress and activates SgrS transcription. The stress is immediately relieved by detoxification of the phosphosugar. This is mediated by SgrS base-pairing with yigL, blocking yigL degradation by RNase E and resulting in increased YigL expression (Papenfort et al., 2013). To decrease phosphosugar accumulation in the cell by PtsG and ManXYZ, SgrS base-pairs with and blocks translation of the mRNAs of manXYZ and ptsG, which are eventually degraded (Bobrovskyy and Vanderpool, 2014). The 5′ end of SgrS contains an ORF, sgrT. This SgrT peptide interacts with PtsG in such a way that it blocks the uptake of glucose (Wadler and Vanderpool, 2007).

Recently, we have re-annotated the genome of L. lactis MG1363 by extensive mining of differential RNA sequencing (dRNA-seq) data of samples taken at various points during growth as a batch culture (van der Meulen et al., 2016). This has led to the addition of 186 small non-coding RNAs and 60 antisense RNAs to the L. lactis genome annotation. Here we have treated the parent strain of L. lactis MG1363, L. lactis NCDO712, which carries a number of plasmids with industrial relevance (Tarazanova et al., 2016), to a number of industrial stresses. Specifically, the strain was exposed for a relatively short period of time of 5 min to cold, heat, acid, osmotic, oxidative or starvation stress to explore the organism's early transcriptome responses by strand-specific RNA-seq. The expression of recently annotated sRNAs and asRNAs was assessed and a significant number of them were observed to be differentially expressed under the various stress conditions. Moreover, extensive data mining allowed pinpointing many genes that are involved in the investigated, industry-related stress conditions.

### MATERIALS AND METHODS

### Bacterial Strains and Media

Lactococcus lactis subsp. cremoris NCDO712 is an industrial strain containing six plasmids of which one, pLP712, carries lactose and casein utilization genes (Tarazanova et al., 2016). L. lactis MG1363 is its plasmid-free derivative (Gasson, 1983). L. lactis NCDO712 was grown as a standing culture at 30◦C in M17 broth (Difco, Becton Dickinson, Le Pont de Claix, France) containing 0.5% (w/v) glucose (GM17), and on GM17 agar plates.

#### Stress Treatments

All stress conditions were applied in two biological replicates by starting with two single colonies of L. lactis NCDO712 grown on GM17 agar plates. These were each used to inoculate 10 ml of fresh GM17 media and grown overnight. Each overnight culture was diluted 100-fold in a bottle with 500 ml GM17 and grown until an optical density at 600 nm (OD600) of 0.9 was reached. The content of each bottle was divided over seven 50-ml Greiner tubes and centrifuged at 4,000 g for 1.5 min. Subsequently, the cultures from each bottle were subjected to all conditions tested. The cell pellets were re-suspended in fresh GM17 containing 0.25% glucose (G∗M17) and the following properties: control (G∗M17 at 30◦C), cold (G∗M17 at 10◦C), heat (G∗M17 at 42◦C), acid (G∗M17 set at pH 4.5 with lactic acid), osmotic stress (G∗M17 containing 2.5% NaCl), and oxidative stress (G∗M17, shaking at 250 rpm). The cold stress was applied in an incubator while the heat stress was performed in a water bath to maintain the cold and heat conditions stable. For starvation stress, the cell pellets were re-suspended in filter-sterilized phosphate-buffered saline (PBS). The stress conditions were applied for 5 min, after which the cells were harvested by centrifugation at 10,000 rpm for 1 min, snap frozen in liquid nitrogen and stored at −80◦C prior to RNA isolation.

#### RNA Isolation

RNA was isolated as described previously (van der Meulen et al., 2016). All procedures were executed at 4◦C unless otherwise stated and all solutions were DEPC-treated and subsequently autoclaved. Frozen cell pellets were re-suspended in 400 µl TEbuffer (10 mM Tris-HCl, 1 mM EDTA, pH 7.4) and added to 50 µl 10% sodium dodecyl sulfate (SDS), 500 µl phenol/chloroform (1:1 v/v), and 0.5 g glass beads (75–150 µm, Thermo Fischer Scientific, Rockford, IL, United States). The cells were disrupted by shaking 2 times for 45 s in a Biospec Mini-BeadBeater (Biospec Products, Bartlesville, OK, United States) with cooling on ice for 1 min between the shaking steps. Subsequently, the cell suspension was centrifuged at 14,000 rpm for 10 min. The upper phase containing the nucleic acids was treated with 500 µl chloroform and centrifuged as above. Nucleic acids in the water phase were precipitated by sodium acetate and ethanol. The nucleic acid pellet was re-suspended in 100 µl buffer consisting of 82 µl MQ, 10 µl 10x DNase I buffer, 5 µl RNase-free DNase I (Roche Diagnostics GmbH, Mannheim, Germany), and 3 µl RiboLock RNase inhibitor (Fermentas/Thermo Scientific, Vilnius, Lithuania), and treated for 30 min at 37◦C. The RNA was then purified using standard phenol/chloroform extraction and sodium acetate/ethanol precipitation. RNA pellets were resuspended in 50 µl elution buffer from the High Pure RNA Isolation Kit (Roche Diagnostics, Almere, the Netherlands) and stored at −80◦C.

### RNA Treatment, Library Preparation, and RNA Deep Sequencing

RNA concentration was measured with a Nanodrop ND-1000 (Thermo Fischer Scientific). As a measure of RNA quality, the integrity of the 16S/23S rRNA and the presence of any DNA contamination were assessed by using an Agilent 2100 Bioanalyser (Agilent Technologies, Waldbronn, Germany). cDNA library was prepared by employing a ScriptSeqTM Complete Kit for Bacteria (Epicentre, Madison, WI, United States) including Ribo-ZeroTM for rRNA removal. The cDNA libraries were sequenced at Otogenetics Corporation (Norcross, GA, United States) on an Illumina HiSeq2000.

#### Data Analysis

Raw sequence reads were analyzed for quality and trimmed with a PHRED score >28. Read alignment was performed on the genomic DNA of L. lactic NCDO712 (nucleotide sequences of the chromosome and all six plasmids of NCDO712; Tarazanova et al., 2016) using Bowtie 2 (Langmead and Salzberg, 2012). RKPM values were used as an input for the T-REx analysis pipeline (de Jong et al., 2015) together with a text file describing the factors, contrasts, and classes. In the class file, genes from the NCDO712 plasmids were colored green while sRNAs were colored red (see **Supplementary Information S1**). T-REx was used to perform all statistical analyses (de Jong et al., 2015).

#### Data Access

The RNA-seq data is publically available in GEO under accession number GSE98499.

### RESULTS

#### RNA-Seq Reveals That Starvation Has a Large Impact on the Transcriptome of *L. lactis*

The transcriptomic response of L. lactis subsp. cremoris NCDO712 (hereafter named L. lactis NCDO712) after 5 min of cold, heat, acid, osmotic, oxidative, or starvation stress was determined by high-throughput RNA sequencing. This resulted in a total of 246M of reads of which 209M reads (85%) were successfully mapped on the genome and plasmids of L. lactis NCDO712. The libraries varied between 11M and 19M reads per individual sample (**Figure 1A**). The data was normalized using the T-REx software (de Jong et al., 2015) and plotted in a box plot of normalized signals for all samples (**Figure 1B**). A Principle Component Analysis (PCA) shows that the stress conditions are statistically well-distributed from each other (**Figure 1C**). The transcriptome of cells exposed to osmotic stress or to starvation were most different from that of the control. The absolute numbers of differentially expressed genes underpin these observations; the largest transcriptome changes were observed after starvation (756 genes involved), while oxidative stress had the least impact (91 affected genes). **Table 1** gives an overview of the counts of the differentially expressed genes and **Figure 2** shows the distribution of affected genes for each stress condition. To gain insight in the distribution of the stress-responsive genes, those of which the expression changed highly (fold change ≥ 5 and p ≤ 0.01) under all conditions were visualized in a heatmap. T-REx was used to pinpoint nine clusters that vary strongly in size and in the functions of the constituting genes. One cluster is rich in stress-related genes, while another one contains predominantly sRNAs. See **Figure 3** for the heatmap and the complete list of cluster descriptions. Genes that were affected by a fold change ≥ 10-fold and p ≤ 0.01) are listed in **Table 2** and are discussed in more detail below.

### sRNAs Are Highly Affected after 5 Min of Stress Induction

We assessed the expression of the 186 sRNAs that have recently been identified in the genome of L. lactis MG1363 (van der Meulen et al., 2016). The majority of the sRNAs that were significantly changed after applying acid, oxidative, or starvation stress showed a decrease in expression. In contrast, after cold stress more sRNAs were upregulated instead TABLE 1 | Absolute numbers of differentially expressed genes after 5 min of exposure to the indicated stress.


\**Genes with a fold change* ≥ *5 and a p* ≤ *0.01.*

*#Genes with a fold change of* ≥ *2 and a p* ≤ *0.05.*

circles: sRNA genes, gray circles: all other genes.

of downregulated (see **Figure 4**). Of the 186 sRNAs, the expression of 110 was significantly changed after applying at least one stressor, while 42 sRNAs were differentially expressed under at least three stress conditions (**Table 3**). This list of sRNAs was restricted to those with a logCPM value >1. The expression of some of these 110 sRNA genes was highly affected under only one specific stress condition. For example, LLMGnc\_152 (10.7-fold) and LLMGnc\_153 (6.0-fold) were increased specifically after cold stress, while LLMGnc\_176 (6.3 fold) showed a higher expression after salt stress. The sRNA LLMGnc\_092 (−9.2-fold) was only decreased after starvation, while LLMGnc\_025/064/065/073 were downregulated in all conditions.

### Most Transfer RNAs Decrease Rapidly after a Short Pulse of Stress

Transfer RNAs (tRNAs) play a crucial role in the translation of mRNAs and it is important for cells to balance their tRNA levels, as well as to ensure optimal utilization of amino acids. Most of the L. lactis tRNAs are downregulated under all of the stresses applied. A number of tRNAs, such as NCDO\_2402 (Val-CAG) and NCDO\_2022 (Lys-TTT), are upregulated under some of the conditions. A distinct tRNA response is observed upon cold stress; seven tRNAs are upregulated by at least 2-fold. Exposure to acid or starvation stress induced the most severe changes in tRNA expression. See **Figure 4** for a complete overview of tRNA expression under the various stress conditions.

#### Cold Stress Induces a Zinc Uptake System

During a 5-min cold stress at 10◦C, 412 genes were differentially expressed (**Table 1**). The most highly upregulated transcripts include those from the zit operon, a gene cluster involved in Zn2<sup>+</sup> uptake and regulation (Llull and Poquet, 2004). Also, expression of the gene sugE, encoding a presumed multidrug resistance protein, was increased (∼18-fold), as well as genes involved in nucleotide synthesis (pur and pyr operons) and the fab and acc operons for saturated fatty acid biosynthesis. Transcripts encoding Cold Shock Proteins A, B, C, and D were upregulated at least 4-fold, as would be expected in cells under cold stress (Wouters et al., 1998). The gene of an uncharacterized protein (Llmg\_1848) with high sequence similarity to a bacteriocin in other L. lactis species, is located downstream of cspA and was also upregulated by a factor of ∼10. Downregulation was observed e.g., for the fructose utilization fru operon (∼18-fold), the lysinespecific permease lysP gene (∼8-fold) and the ribosomal RNA 5S (∼9-fold). Thirty-nine sRNAs from intergenic regions were significantly affected, of which 12 were changed at least 5-fold (see **Figure 4** and **Table 3**).

#### Heat Stress Induces, Next to Protein Chaperone Genes, the *Arc* Operon

At high temperatures, proteins may be at risk of denaturation and cells may encounter difficulties in the synthesis of new proteins (Nguyen et al., 1989; Parsell and Lindquist, 1993). The genes of several protein chaperones, such as GroEL, GroES, DnaJ, DnaK, and GrpE are usually upregulated after heat stress. The chaperones aid in protein folding and maturation. Their genes were upregulated ∼14–18 times (except dnaJ, which was increased ∼5-fold) after L. lactis was placed for 5 min at 42◦C (see **Table 2**). The Clp protease genes clpE, clpP, and clpB were also upregulated, as was the gene with an unknown function upstream of clpE, llmg\_0527. Surprisingly, expression of genes belonging to the arginine deaminase pathway (arc operon) were increased ∼11-fold, while this operon was downregulated after 30 min of incubation of L. lactis IL1403 at 42◦C in previous work (Xie et al., 2004). An operon predicted to be involved in the utilization of maltose (llmg\_0485-llmg\_0490), the ribose operon (rbsABCDK) and llmg\_1210, predicted to encode the drug resistance transporter EmrB, were all upregulated. We also observed that the lacR-lacABCDEF gene cluster, located on the largest plasmid pLP712 of L. lactis NCDO712 (Wegmann et al., 2012) and involved in lactose utilization, was upregulated ∼3-fold. The heat treatment caused a decrease by a factor of 2–3 of several rRNA species, as well as of most tRNA-transcripts (**Figure 5**). An operon (llmg\_2513-llmg\_2515) containing a gene for the universal stress protein A (UspA) and two uncharacterized genes was decreased most severely after heat shock, followed in severity of fold change by the multidrug resistance protein B gene (lmrB, llmg\_1104). In total, 37 sRNAs were differentially expressed (**Table 3**). Expression of two of them was decreased over 5-fold: LLMGnc\_065 (−6.0) and LLMGnc\_079 (−6.2).

#### TABLE 2 | Genes differentially expressed as a result of various stress conditions.


#### TABLE 2 | Continued


*Genes are shown in this table when they are differentially expressed* ≥*10-fold, with a p* ≤ *0.05 and logCPM* ≤ *1 in at least one of the stress conditions, or previously reported in the literature during the relevant stress condition. sRNA and tRNA genes are excluded from this table as they are reported separately.*

#### Acid Stress Induces Nucleotide Biosynthesis and Cysteine/Methionine Metabolism

Genes for de novo synthesis of pyrimidines (pyr) and purines (pur) were highly upregulated after 5 min of exposure to pH 4.5. Expression of these operons was also seen during starvation and, to a lesser extent, upon cold shock. Among the top upregulated genes were metCcysK encoding a cystathionine gamma-lyase and cysteine synthase, respectively, involved in cysteine and methionine metabolism. The genes llmg\_0333-0340 were all affected by the acid stress applied, although they are not transcribed from the same operon. Among these genes are those of a putative methionine ABC transporter system (llmg\_0335-0340 upregulated), and thiT (llmg\_0334) encoding the thiamine transporter (Erkens and Slotboom, 2010) and a putative transcriptional regulator gene (llmg\_0333) (both downregulated). The fab operon, responsible for the biosynthesis of saturated fatty acids (SFAs) for membrane phospholipids, was upregulated. Expression of the operon llmg\_2513-2515 was increased, while these genes were strongly downregulated after heat (see above) and salt stress (see further). The putative Mn2+/Fe2<sup>+</sup> transporter gene mntH, which is located on plasmid pNZ712, was upregulated 2.5-fold under pH 4.5 stress. Interestingly, the chromosomal gene mgtA, putatively specifying Mg2<sup>+</sup> transport, was downregulated 3.7-fold. Upregulation was furthermore observed for llmg\_1915-1917 (ykgEFG) and for the following genes with predicted functions; llmg\_1066 (unknown), llmg\_1133 (exonuclease), and llmg\_1702 (glutathione reductase). The fruAKR operon for fructose transport, conversion of fructose 1-phosphate to fructose 1,6-bisphosphate and their regulation by the FruR repressor (Barriere et al., 2005), was downregulated on average ∼28-fold. The fhuABCD operon for putative ferric

siderophore transport was downregulated by a factor 3–7. Of the 51 sRNA genes of which the expression was significantly changed, only 8 were upregulated; among these was LLMGnc\_121, which was exclusively upregulated after acid stress (**Table 3**).

### Osmotic Stress Induces Chaperones and a Putative Stress-Responsive Regulator

Osmotic stress was induced by adding 2.5% NaCl to the cell culture for 5 min. Among the highest upregulated genes are those of an operon (llmg\_2163 - llmg\_2164) specifying a putative stress-responsive transcriptional regulator with a PspC domain (Llmg\_2163). Both genes are ∼10-fold upregulated; it was also induced upon overproduction in L. lactis of the membrane protein BcaP (Pinto et al., 2011) and after exposure to the bacteriocin Lcn972 (Martínez et al., 2007). A deletion mutant of llmg\_2164 was shown to be very sensitive to NaCl (Roces et al., 2009). In E. coli, the psp operon is induced after application of various types of stresses including salt stress (Brissette et al., 1991). As expected, induction was seen of the genes specifying the glycine betaine ABC transport system BusAA-BusAB (∼7- to 8-fold). Some of the responses observed during the exposure to the high concentration of salt were similar to those seen after heat stress. In particular, transcripts encoding the chaperones GroEL, GroES, DnaJ, DnaK, and GrpE were upregulated in the same fold change range. Induction of these proteins has been reported previously for both heat and salt stress (Kilstrup et al., 1997). Both operons for oligopeptide transport were also upregulated under the salt stress applied here.

Genes encoding transporters for various substrates were strongly downregulated, among which the PTS IIA component genes fruA and ptcA, that of the IIB PTS component, bglP, lmrB specifying multidrug resistance protein B (both llmg\_0967 and llmg\_1104), msmK, encoding a multiple sugar ABC transporter and amtB involved in ammonium transport. Also, the gene encoding the glycerol uptake facilitator protein GlpF2 was downregulated by ∼10-fold, as was the lac operon on pLP712. As shown before (Xie et al., 2004), the potABCD operon involved in spermidine/putrescine transport and the fatty acid biosynthesis operons fab and acc were downregulated under osmotic stress. As mentioned above, the llmg\_2513-2515 operon was downregulated. From the 44 significantly affected sRNA genes, the expression of LLMGnc\_036 (214-fold down), LLMGnc\_118 (75-fold down), and LLMGnc\_176 (6.3-fold up) was specifically only changed after salt stress using a threshold of 5-fold change.

### Shaking of a Culture of *L. lactis* Triggers Saturated Fatty Acid Biosynthesis Genes

From all stress conditions tested, oxidative stress applied by shaking of the culture resulted in the least number of differentially expressed genes (**Table 1**). The ones that did change did so with relatively minor fold changes. Among the few upregulated genes were those of the pathway for saturated fatty acid biosynthesis, including fabT, the transcriptional repressor of this route (Eckhardt et al., 2013). Downregulation was seen of arcABD1C1C2, llmg\_1915-1917 (ykgEFG), and of genes involved in the uptake and/or utilization of maltose, trehalose, and lactose. The heat shock chaperone genes groEL and groES were downregulated ∼2-fold. Unexpectedly, the gene for the manganese superoxide dismutase SodA, an enzyme well-known for its role in oxidative stress, was decreased 2-fold relative to the unstressed control. Twenty sRNAs were affected by at least 2-fold. LLMGnc\_019 was downregulated 9.4-fold, while it was upregulated in all the other stress conditions (see **Table 3**).

### Starvation in PBS Resulted in the Most Dramatic Transcriptome Changes

Incubation of the cells for 5 min in PBS greatly affected the transcriptome of L. lactis, as witnessed by the large number of 150 genes of which the expression had changed significantly, and at least by 5-fold. Different biological functions were switched on or off in response to the sudden absence of basically all nutrients. For example, the operons for the de TABLE 3 | Differential expression of sRNA genes under industrially relevant stress conditions.


#### TABLE 3 | Continued


*The 110 sRNA genes that are differentially expressed in at least one stress condition, with a cut-off fold change* ≥ *2, p* ≤ *0.05 and logCPM* > *1 are indicated. Blue: upregulated, red: downregulated, color intensity is a measure of fold change. The names of the sRNAs are those that have been previously annotated in L. lactis MG1363, the plasmid-free derivative of L. lactis NCDO712 (van der Meulen et al., 2016).*

novo biosynthesis of nucleotides, pyr, pur, and car, were all highly upregulated. Moreover, the genes of several transport systems were upregulated in an apparent attempt to import a (any) carbon source (lactose: lac operon, multiple sugar ABC transporter: msmK, ribose: rbs operon, cellobiose/glucose: ptcC), ions (manganese: mtsA, zinc: zit operon), amino acids (arginine: arc, methionine: metQ, branched-chain amino acid transporter: ctrA/bcaP), and vitamins (riboflavin: rib operon). On the other hand, some uptake system genes were downregulated, such as thiT, specifying the thiamin transporter, and the fru operon for fructose uptake and utilization. Strong downregulation was also observed for the genes of all six cold shock proteins. Dozens of (predicted) transcriptional regulators were affected upon the starvation stress applied here, among which all of the spxA genes. Interestingly, the expression profiles were very different. For instance, spxA with locus tag llmg\_0640 was decreased ∼19-fold, while expression of llmg\_1703 and llmg\_1130 was increased 12.5- and 53-fold, respectively. Notably, the gene for the putative transcriptional repressor CadC, which is located on plasmid pSH73 (Tarazanova et al., 2016), was 4.9-fold upregulated. Besides a strong decrease in the expression of different tRNA genes, also transcripts for ribosomal proteins were downregulated. Starvation changed the expression of 75 sRNAs, of which the majority was downregulated. The nine sRNA genes that were upregulated after starvation were also increased under at least one of the other conditions tested here, with the exception of LLMGnc\_060, which was only affected after starvation.

#### DISCUSSION

High-throughput RNA sequencing was used in this study to examine the transcriptome changes caused by various industrially relevant stress conditions applied to L. lactis NCDO712. Previous studies using DNA microarray- and proteomics technologies have identified genes and proteins involved in the various environmental stress responses in L. lactis. However, little to no insight has been obtained so far as to which small regulatory RNAs (sRNAs) and antisense transcripts (asRNAs) are affected by stress, and to what extent. The strand specificity of DNA microarray probes does not allow detection using this technology of antisense transcripts. Also, conventional DNA microarrays usually do not carry tRNA probes and, of course, no probes for as yet undefined transcripts; high-density tilling arrays can detect unknown transcripts. RNA sequencing can be used to uncover all transcripts in an organism at a specific moment in time. It also provides a higher dynamic range for quantitative gene expression analysis than DNA microarrays, provides single-base resolution and suffers less from background noise signals (Wang et al., 2009). In the present study, we have applied a size cut-off of 50 nt to detect sRNAs.

Bacterial cells employed during starter culture or cheese production encounter stress conditions that are similar to the ones applied here, albeit not always as instant and short-lived (5 min induction in our experiments). We chose such a very brief duration of the stressors because we were interested in the ensuing very first transcriptional responses, while in previous studies stress conditions were applied for 10 min to up to 4 h (Sanders et al., 1999). The longer the exposure time, the more secondary effects are activated that obscure the actual first response. The expression of stress-induced genes can quickly build up to a certain level, after which it decreases again. This is, for instance, observed for the L. lactis hrcA, groESL, dnaJ, and dnaK genes, which all reach a maximum expression level after 15 min of heat shock (Arnau et al., 1996). A study in Salmonella typhimurium shows a detailed overview of sRNA expression over time, by sequencing of Hfq-bound transcripts. While some sRNAs were expressed throughout growth, others were only dominant at one specific growth phase (Chao et al., 2012). Growth-phase and stress-dependent expression of sRNAs in L. lactis were only conducted for a selection of sRNAs in our previous work (van der Meulen et al., 2016). Here, we focused our analyses on the changes in expression levels of sRNA genes and of tRNA genes. Albeit that the current study was of a fundamental nature, some of the results presented here might ultimately be used in starter culture production

or milk fermentation by applying short pulses of stress to the bacteria.

We observed that a number of operons and individual genes were highly affected by three or more stress conditions. For example, the pur and pyr operons for the de novo synthesis of purines and pyrimidines were upregulated after cold, acid and starvation stress. The fruAKR was downregulated upon cold, acid, osmotic, and starvation stress. The fru operon was previously reported to be upregulated in response to cell envelope stress by the bacteriocin Lcn972 in L. lactis (Martínez et al., 2007), and upregulated in Lactococcus garvieae after cold stress (Aguado-Urda et al., 2013). Therefore, we could argue that fru is highly reactive to different stressors. The metC-cysK operon was upregulated under all stress conditions applied here except salt addition. The highest effect was observed in acid stress. The Lactobacillus plantarum metC-cysK genes were upregulated after exposure to 0.1% porcine bile (Bron et al., 2006), but downregulated by p-coumaric acid (Reverón et al., 2012). Genes from the saturated fatty acid biosynthesis pathway (fab and acc) were upregulated after oxidative, cold and acid stress, while they were downregulated after starvation, heat and salt stress. For cold stress, however, one would have expected to see a decrease in SFAs, to maintain membrane fluidity at lower temperatures (Tsakalidou and Papadimitriou, 2011). The differences observed during oxidative stress might not be caused by higher levels of oxygen, but rather by the shaking that was used to induce oxidative stress. Previously, it was reported that cell envelope stress caused by membrane protein overproduction also affects fab. The exact response in this study depended on the identity of the overproduced membrane protein, leading to either an increase or a decrease in fab expression (Marreddy et al., 2011). We therefore propose that fatty acid biosynthesis in L. lactis is highly adaptive to various stressors and might rapidly fluctuate in time after the stress has been applied.

Major groups of defense mechanisms were significantly induced, despite the short exposure times employed. These include transcripts encoding protein chaperones such as GroEL, GroES, DnaK, DnaJ, and GrpE, which were induced during heat and salt stress. These conditions also induced protease genes such as clpE, clpP, and clpB. Cold shock induces cspA, cspB, cspC, and cspD specifying the major cold shock proteins binding to DNA or RNA (Ermolenko and Makhatadze, 2002). During cold stress, the zit operon for the uptake of Zn2<sup>+</sup> was highly upregulated, suggesting that zinc ions play an important role during cold stress, possibly as a cofactor for cold stressrelated proteins, and/or have an effect on membrane fluidity. Osmotic stress expectedly induced the expression of the genes of the transport proteins BusAA-BusAB. However, no significant induction of gadCB was observed after osmotic and heat stress. Strong downregulation of glpF2 was observed specifically after osmotic stress, while it was slightly increased after starvation. The glycerol uptake facilitator protein GlpF2 from in L. plantarum was shown to facilitate the diffusion of water, dihydroxyacetone and glycerol (Bienert et al., 2013), and could be an important factor for osmotic homeostasis in the cell. The upregulation of the arc operon after incubation for 5 min at 42◦C was unexpected, since it has been reported before that arc decreases upon 30 min of heat stress at 42◦C (Xie et al., 2004). This might be explained by differences in the heat exposure times and/or the specific strains used. Also, the arc operon is under complex regulation, with several protein regulators being involved [CcpA, ArgR/AhrC

and, possibly, CodY (den Hengst et al., 2005; Zomer et al., 2007; Larsen et al., 2008)], which apparently leads to a rather dynamic expression profile (J.P. Pinto, PhD thesis, Groningen, 2015). Another unexpected result was the decrease of sodA expression during oxidative stress, while this gene is usually reported to be strongly induced under these circumstances (Sanders et al., 1995; Miyoshi et al., 2003). The short time of shaking (5 min) did perhaps not allow building up stressful oxygen levels. A more pronounced effect could have been generated with baffled shake flasks or by active addition of oxygen to the flasks, although these conditions would be far from industrial reality. Stressrelated operons and genes reported before and observed here are summarized in **Figure 6**.

The plasmids of L. lactis NCDO712 have recently been sequenced and annotated, allowing examining the responses of their genes to the various stresses applied here. Indeed, various plasmid genes were affected under stress, such as the lactose utilization lac operon on pLP712, the putative Mn2+/Fe2<sup>+</sup> transporter specified by mntH (pNZ712; acid stress) and the gene for a possible transcriptional repressor,cadC (pSH73; starvation). Since we applied a very short time of 5 min of stress exposure it is unlikely that these differences were the result of a change in plasmid copy numbers.

Transfer RNAs (tRNAs) are crucial components in translation. The rate of translation of a certain codon is directly coupled to the amount of cognate tRNA in the cell (Varenne et al., 1984), which is a measure of gene expression potential of the bacterial cell. Recently, the tRNAome of L. lactis was determined including the positions of 16 post-transcriptional modifications (Puri et al., 2014). Protein-overexpression stress employed in that study led to changes in tRNA expression required for up-regulation of housekeeping genes. An increase in tRNAs that would reflect the codon usage of the gene of the overexpressed protein was not seen. In the study presented here we generally observe a decrease of most of the tRNAs when L. lactis is placed under stress. Heat, cold, and oxidative stress seems to affect cellular tRNA transcript levels the least; we noted that cold stress induces the expression of seven tRNA genes. The RNA-seq data did not allow uncovering the actual charging or state of modification of the tRNAs.

Of the 186 sRNA genes currently annotated in L. lactis more than half were shown to be differentially expressed in response to one or more of the six different stress conditions that we employed. Since functional characterization has only been performed for a small number of these sRNAs (van der Meulen et al., 2016), only a few conclusions can be drawn as of yet. Some of the downregulated sRNAs might perform housekeeping functions and would not be required under conditions in which cells do no longer grow. On the other hand, sRNAs that are induced might respond to the stressor either to bring the cells into a protective state of slower or no growth, or they might act more specifically in order to overcome the harmful environmental change. Research on the latter group of sRNAs could increase our insights in their functioning during industrially relevant stress conditions and is currently ongoing.

### AUTHOR CONTRIBUTIONS

SM and JK designed the experiments; SM performed the experiments; SM and AJ analyzed the data; SM and JK wrote the manuscript.

#### ACKNOWLEDGMENTS

We kindly thank the members of the functional fermentation team within the Top Institute Food and Nutrition (TIFN). In particular, we would like to thank Herwig Bachmann, Eddy Smid, Arjen Nauta, Claire Price, and Hans Brandsma for helpful discussions.

#### REFERENCES


#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fmicb. 2017.01704/full#supplementary-material

Supplementary Information S1 | Excel table containing the input files for T-Rex.


lactis NCDO712 reveals a novel pilus gene cluster. PLoS ONE 11:e0167970. doi: 10.1371/journal.pone.0167970


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 van der Meulen, de Jong and Kok. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Comparative Genomics of Lactobacillus acidipiscis ACA-DC 1533 Isolated From Traditional Greek Kopanisti Cheese Against Species Within the Lactobacillus salivarius Clade

Maria Kazou<sup>1</sup> , Voula Alexandraki<sup>1</sup> , Jochen Blom<sup>2</sup> , Bruno Pot<sup>3</sup> , Effie Tsakalidou<sup>1</sup> and Konstantinos Papadimitriou<sup>1</sup> \*

<sup>1</sup> Laboratory of Dairy Research, Department of Food Science and Human Nutrition, Agricultural University of Athens, Athens, Greece, <sup>2</sup> Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, Germany, <sup>3</sup> Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Department of Bioengineering Sciences (DBIT), Vrije Universiteit Brussel, Brussels, Belgium

#### Edited by:

Frank T. Robb, University of Maryland, Baltimore, United States

#### Reviewed by:

Jinshui Zheng, Huazhong Agricultural University, China Seong Woon Roh, World Institute of Kimchi (WIKIM), South Korea Sergei Kozyavkin, Fidelity Systems, United States

\*Correspondence:

Konstantinos Papadimitriou kpapadimitriou@aua.gr

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 25 October 2017 Accepted: 23 May 2018 Published: 11 June 2018

#### Citation:

Kazou M, Alexandraki V, Blom J, Pot B, Tsakalidou E and Papadimitriou K (2018) Comparative Genomics of Lactobacillus acidipiscis ACA-DC 1533 Isolated From Traditional Greek Kopanisti Cheese Against Species Within the Lactobacillus salivarius Clade. Front. Microbiol. 9:1244. doi: 10.3389/fmicb.2018.01244 Lactobacillus acidipiscis belongs to the Lactobacillus salivarius clade and it is found in a variety of fermented foods. Strain ACA-DC 1533 was isolated from traditional Greek Kopanisti cheese and among the available L. acidipiscis genomes it is the only one with a fully sequenced chromosome. L. acidipiscis strains exhibited a high degree of conservation at the genome level. Investigation of the distribution of prophages and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) among the three strains suggests the potential existence of lineages within the species. Based on the presence/absence patterns of these genomic traits, strain ACA-DC 1533 seems to be more related to strain JCM 10692<sup>T</sup> than strain KCTC 13900. Interestingly, strains ACA-DC 1533 and JCM 10692<sup>T</sup> which lack CRISPRs, carry two similar prophages. In contrast, strain KCTC 13900 seems to have acquired immunity to these prophages according to the sequences of spacers in its CRISPRs. Nonetheless, strain KCTC 13900 has a prophage that is absent from strains ACA-DC 1533 and JCM 10692<sup>T</sup> . Furthermore, comparative genomic analysis was performed among L. acidipiscis ACA-DC 1533, L. salivarius UCC118 and Lactobacillus ruminis ATCC 27782. The chromosomes of the three species lack long-range synteny. Important differences were also determined in the number of glycobiome related proteins, proteolytic enzymes, transporters, insertion sequences and regulatory proteins. Moreover, no obvious genomic traits supporting a probiotic potential of L. acidipiscis ACA-DC 1533 were detected when compared to the probiotic L. salivarius UCC118. However, the existence of more than one glycine-betaine transporter within the genome of ACA-DC 1533 may explain the ability of L. acidipiscis to grow in fermented foods containing high salt concentrations. Finally, in silico analysis of the L. acidipiscis ACA-DC 1533 genome revealed pathways that could underpin the production of major volatile compounds during the catabolism of amino acids that may contribute to the typical piquant flavors of Kopanisti cheese.

Keywords: Lactobacillus, genome, pseudogene, motility, horizontal gene transfer, phage, probiotic, metabolism

### INTRODUCTION

fmicb-09-01244 June 7, 2018 Time: 17:38 # 2

The genus Lactobacillus constitutes a diverse group of bacteria comprising more than 200 species and subspecies<sup>1</sup> that are ubiquitous and frequently found in a variety of nutrientrich ecological niches (Pot et al., 2014; Sun Z. et al., 2015). Lactobacilli produce lactic acid as the main end-product of carbohydrate fermentation allowing them to prevail in microbial ecosystems. This attribute along with their safety profile and their ability to shape organoleptic characteristics of the final product are the central reasons for their extensive use in artisanal or industrial food fermentations (Bernardeau et al., 2008; Sun Z. et al., 2015; Reginensi et al., 2016). Apart from food-related lactobacilli, the genus includes many commensals of the human, animal and plant microbiota (Cannon et al., 2005; Duar et al., 2017). The available genomes for Lactobacillus species and the close phylogenetic relationship among food- and host-related strains offer a wealth of information that underpin specialized mechanisms of bacterial adaptation to different environments (Sun Z. et al., 2015).

Lactobacillus acidipiscis is a salt-tolerant species originally isolated from fermented fish (Tanasupawat et al., 2000). The species has been also found in a variety of cheeses, i.e., Halloumi (Lawson et al., 2001; Naser et al., 2006; Kim et al., 2011), Cotija (Morales et al., 2011), Minas (Perin et al., 2017), and double cream cheese (Morales et al., 2011; Melgar-Lalanne et al., 2013), as well as in fermented fish (An et al., 2010; Tsuda et al., 2012; Thamacharoensuk et al., 2017), fermented meat (Nguyen et al., 2013), sake (Koyanagi et al., 2016), pickles (Arasu et al., 2015), grasses (Tohno et al., 2012; Khota et al., 2016), mulberry silage (Altaher et al., 2015), sweet paste (Mao et al., 2017), the traditional Chinese fermented vegetable Sichuan paocai, (Cao et al., 2017) and table olives (Randazzo et al., 2017). Moreover, L. acidipiscis has also been found in vinegar (Li P. et al., 2014) and soy sauce, where it is considered to be a spoiler (Tanasupawat et al., 2002; Cheng et al., 2014; Li N. et al., 2014). L. acidipiscis ACA-DC 1533 was isolated from traditional Greek Kopanisti cheese and along with Lactobacillus rennini were the dominant microbiota of this cheese (Asteri et al., 2009). Interestingly, both of them produced alcohols and carbonyl compounds most probably via amino acid catabolism that may contribute to the typical piquant flavors of Kopanisti cheese (Yvon and Rijnen, 2001; Asteri et al., 2009; Donnelly, 2016).

Phylogenetic analysis of L. acidipiscis places the bacterium in the Lactobacillus salivarius clade. The L. salivarius clade is the second largest group of lactobacilli with 27 recognized species following that of Lactobacillus delbrueckii (29 species; Pot et al., 2014). The L. salivarius clade consists mainly of commensal isolates and to a lesser degree of strains found in fermented foods (Cousin et al., 2015). Several strains belonging to the clade exhibit putative probiotic traits (Neville and O'Toole, 2010). Therefore, comparative genomics among members of the L. salivarius clade may reveal important aspects, such as niche adaptation, technological potential, and probiotic properties (Forde et al., 2011; Raftis et al., 2011; Sun Z. et al., 2015). So far, there are eight genomes with fully sequenced chromosomes in the L. salivarius clade publicly available in the NCBI database, i.e., six from L. salivarius (Claesson et al., 2006; Jimenez et al., 2010; Raftis et al., 2014; Chenoll et al., 2016), one from Lactobacillus ruminis (Forde et al., 2011) and one from L. acidipiscis (Kazou et al., 2017). Furthermore, L. acidipiscis JCM 10692<sup>T</sup> and DSM 15836<sup>T</sup> isolated from fermented fish as well as L. acidipiscis DSM 15353 and KCTC 13900 isolated from Halloumi cheese have been partially sequenced (Kim et al., 2011; Sun Z. et al., 2015). In fact, strains JCM 10692<sup>T</sup> and DSM 15836<sup>T</sup> are replicas of the same strain<sup>2</sup>,<sup>3</sup> and the same applies for strains DSM 15353 and KCTC 13900<sup>4</sup>,<sup>5</sup> .

The genome sequence of L. acidipiscis ACA-DC 1533 has been published (Kazou et al., 2017) and the current study aims to examine aspects of the evolution, physiology, metabolism and technological properties of the species according to the available L. acidipiscis genomes. Furthermore, we perform comparative genomics among the species with fully sequenced genomes in the L. salivarius clade to shed light to niche adaptation (host or food related, or both). Our analysis reveals technological properties of L. acidipiscis ACA-DC 1533 that may support the potential use of the isolate in food fermentations.

#### MATERIALS AND METHODS

#### Chromosome-Plasmid Sequences and Annotations

Species/strains employed in phylogenetic analysis and comparative genomics are presented in **Supplementary Table S1**. All annotated sequences derived from RefSeq version 86 with the exception of plasmids pLAC2 and pLAC3 of L. acidipiscis ACA-DC 1533 that have not been included in RefSeq yet, so we used their GenBank/ENA versions (Kazou et al., 2017). In the table we present all relevant information to aid the reader assess whether differences or similarities in gene content among strains analyzed may be influenced by differences in sequencing

<sup>1</sup>http://www.bacterio.net/lactobacillus.html

**Abbreviations:** ABC, ATP-binding cassette; ANI, average nucleotide identity; CAZy, Carbohydrate Active EnZymes; CBMs, Carbohydrate-Binding Modules; CEs, Carbohydrate Esterases; COG, Clusters of Orthologous Groups; CRISPRs, Clustered Regularly Interspaced Short Palindromic Repeats; dBBQs, dataBase of Bacterial Quality scores; dbCAN, DataBase for automated Carbohydrateactive enzyme Annotation; EPS, Exopolysaccharide; GHs, Glycoside Hydrolases; GIs, Genomic islands; GTs, Glycosyl Transferases; HGT, Horizontal Gene Transfer; HK, Histidine Kinase; ISs, insertion sequences; KEGG, Kyoto Encyclopedia of Genes and Genomes; LCBs, locally collinear blocks; MFS, major facilitator superfamily; OCSs, One-component Systems; ODPs, Other DNA-binding Proteins; PBMCs, peripheral blood mononuclear cells; PEP-PTS, Phosphoenolpyruvate Phosphotransferase System; P2RP, Predicted Prokaryotic Regulatory Proteins; r2cat, Related Reference Contig Arrangement; RM, Restriction-Modification; RPs, Regulatory Proteins; RR, Response Regulator; RSM, reconstituted skim milk; SFs, Sigma Factors; TA, toxin-antitoxin; TCSs, Twocomponent Systems; TFs, Transcription Factors; TRs, Transcriptional Regulators.

<sup>2</sup>http://www.jcm.riken.jp/cgi-bin/jcm/jcm\_number?JCM=10692

<sup>3</sup>https://www.dsmz.de/catalogues/details/culture/DSM-15836.html

<sup>4</sup>https://www.dsmz.de/catalogues/details/culture/DSM-15353.html

<sup>5</sup>http://kctc.kribb.re.kr/En/jsearch/j\_sview.aspx?sn=13900

technologies and/or tools used for sequence assembly and annotation.

### Phylogenetic Analysis

fmicb-09-01244 June 7, 2018 Time: 17:38 # 3

A whole genome phylogenetic tree based on the core genes among representative strains of all species in the L. salivarius clade using L. acidipiscis ACA-DC 1533 as the reference genome was constructed with the EDGAR software (Blom et al., 2009). It should be noted that whenever available, sequences of type strains were preferred. Core gene sets were aligned using MUSCLE, the individual alignments were concatenated and the resulting genome alignment was used as input for the construction of the phylogenetic tree with the neighbor-joining method as implemented in the PHYLIP package. Weissella kandleri DSM 20593<sup>T</sup> and Lactobacillus delbrueckii subsp. bulgaricus ATCC 11842<sup>T</sup> were used as outgroups.

#### Comparative Genomic Analysis

To confirm the clonal relation among sequenced strains of L. acidipiscis as these are deduced from different databases, we used an ANI heat map as calculated with the EDGAR tool. The completeness of partial genome sequences of L. acidipiscis strains was assessed using the dBBQs (Wanchai et al., 2017). Preliminary evaluation of the presence of plasmids in the partially sequenced L. acidipiscis strains was performed with the r2cat tool (Husemann and Stoye, 2010), using as templates the three pLAC plasmid sequences of strain ACA-DC 1533. The circular map of L. acidipiscis ACA-DC 1533 was constructed by the DNAPlotter software (Carver et al., 2009). Pan/core-genome and singleton analysis were conducted with EDGAR. Comparison of the motility gene clusters among L. acidipiscis ACA-DC 1533 and KCTC 13900 as well as Lactobacillus curvatus NRIC 0822 was performed with the Easyfig comparison tool (Sullivan et al., 2011). The GenBank accession numbers for the motility operons of L. acidipiscis KCTC 13900 and L. curvatus NRIC 0822 are KM886858 and KM886863, respectively (Cousin et al., 2015). The EggNOG server version 4.5 was used for COG annotation (Huerta-Cepas et al., 2016). COG frequency heat maps with double hierarchical clustering were generated using the RStudio and the package "gplots"<sup>6</sup> . GIs, ISs, putative prophages, CRISPRs, RM systems, TA systems and putative antimicrobial peptides were predicted using the IslandViewer 4 web-based resource (Bertelli et al., 2017), the ISsaga platform (Varani et al., 2011), the PHASTER web server (Arndt et al., 2016), the CRISPRFinder web tool (Grissa et al., 2007), the REBASE database (Roberts et al., 2015), the TAfinder (Xie et al., 2018) and the BAGEL (van Heel et al., 2013), respectively. The glycobiome profile was investigated using the dbCAN (Yin et al., 2012) against the CAZy database (Lombard et al., 2014). Furthermore, transporters were determined using the TransportDB database (Elbourne et al., 2017). Pathways were assigned with the KEGG database (Kanehisa et al., 2016). Regulatory proteins including TCSs, TFs, and ODPs were detected with the P2RP web server (Barakat et al., 2013). Full-length chromosome alignments were created by progressiveMAUVE (Darling et al., 2010). Finally, the carbohydrate fermentation profile of L. acidipiscis ACA-DC 1533 was determined using API 50 CHL stripes (bioMérieux, Marcy-l'Etoile, France).

### RESULTS AND DISCUSSION

### Whole Genome Phylogeny of the L. salivarius Clade

The phylogenetic relationship among the species of the L. salivarius clade was determined based on whole genome sequences. Analysis with the EDGAR software revealed two major clusters containing 12 and 14 species, respectively (**Figure 1**). L. acidipiscis was grouped together with Lactobacillus pobuzihii in a cluster, which also included L. salivarius. The strains employed in the phylogenetic analysis of the L. salivarius clade exhibited a pan genome of 13,470 genes, while the coregenome consisted of 349 genes. Moreover, proteins of the species belonging to the L. salivarius clade were distributed into various COG functional categories with a relatively distinct profile for each species. Interestingly, hierarchical clustering of the COG frequency heat map (**Figure 2**) revealed two clusters, which were very similar to the two clusters mentioned above that were obtained in the whole genome phylogenetic tree (**Figure 1**). It should be noted that L. acidipiscis ACA-DC 1533 was placed separately from these two clusters most probably due to an increased percentage of genes in the replication, recombination and repair (L) COG category. This difference could arise from a higher number of transposases in the ACA-DC 1533 genome but the number of transposases in the partial genomes employed during this analysis may be severely skewed. Nevertheless, L. acidipiscis also exhibited a higher number of transposases when compared to the complete genome sequences of L. salivarius and L. ruminis (please see below). Both whole genome phylogeny and COG analysis can be influenced by the partial nature of some of the sequences employed as well as differences in pipelines used for genome assembly and annotation. However, the whole genome phylogenetic tree is similar in the overall topology to the 16S rRNA phylogenetic tree of the entire Lactobacillus genus published by Pot et al. (2014) which is independent of genome completeness and annotation. The same applies when we compared our whole genome phylogenetic tree to the tree based on the concatenated amino acid sequences of 16 marker genes published by Sun Z. et al. (2015).

#### General Genomic Features of L. acidipiscis Strains

To date, there are five sequenced strains of L. acidipiscis, i.e., ACA-DC 1533, KCTC 13900, DSM 15353, JCM 10692<sup>T</sup> and DSM 15836<sup>T</sup> . As mentioned above, strains KCTC 13900 and DSM 15353 as well as JCM 10692<sup>T</sup> and DSM 15836<sup>T</sup> are replicas. Since this is not always obvious in the respective literature (Kim et al., 2011; Sun Z. et al., 2015), the relatedness among the two pairs of L. acidipiscis strains was also obtained by the ANI performed with

<sup>6</sup>http://www.rstudio.org

EDGAR (**Supplementary Figure S1**). Results obtained confirmed the clonal relationship among the strains. To evaluate the level of completeness between the L. acidipiscis genomes in each of the two pairs of replica strains, we used the genome quality scores from the dBBQs based on the sequence completeness, the tRNA and rRNA score, as well as the number of essential genes predicted in the genome sequence (Wanchai et al., 2017). According to these results, strains KCTC 13900 and JCM 10692<sup>T</sup> were found to be more complete than strains DSM 15353 and DSM 15836<sup>T</sup> , respectively (**Supplementary Table S2**). For this reason, strains KCTC 13900 and JCM 10692<sup>T</sup> were employed for further analysis.

The characteristics of the L. acidipiscis ACA-DC 1533 genome were described previously (Asteri et al., 2010; Kazou et al., 2017). The complete chromosomal sequence of the strain was recently re-annotated in RefSeq revealing a total of 2,455 genes including 2,199 protein-coding genes and 172 potential pseudogenes mostly due to frame shifting and internal stop codons (**Figure 3**). Among pseudogenes, hypothetical proteins and mobile elements, such as ISs and transposases, were the most common (**Supplementary Table S3**). The genome also includes six rRNA operons distributed throughout the genome and 63 tRNA genes with the majority located around the five rRNA operons (data not shown).

The additional L. acidipiscis assemblies of strains JCM 10692<sup>T</sup> and KCTC 13900 are fragmented and thus do not allow the determination of their accurate chromosomal size as well as the evaluation of their plasmid content. Nevertheless, in these assemblies we could detect plasmid sequences after analysis with the r2cat tool, using as templates the three pLAC plasmid sequences (data not shown). Strain ACA-DC 1533 exhibits 2,288 protein-coding genes versus 2,126 and 1,969 for the JCM 10692<sup>T</sup> and KCTC 13900 strains, respectively. Analysis with EDGAR revealed that the pan-genome consists of 2,722 genes, with 1,569 and 411 genes belonging to the core- and the dispensable genomes, respectively (**Figure 4A** and **Supplementary Tables S4A,B**). Furthermore, the analysis revealed that singletons represented approximately the 18% of the pan-genome. Strain JCM 10692<sup>T</sup> carries the highest number of singletons (n = 197) followed by strains ACA-DC 1533 (n = 157) and KCTC 13900 (n = 136) (**Supplementary Table S4C**). However, such differences may not be readily explained given the differences in completeness among these genomes. We would like to mention that the total number of genes for each strain presented in **Figure 4A** is somewhat lower than the total number of genes annotated for the strain. The missing genes are genes that are not part of the 3-genome or 2-genome cores, but also do not appear in the strictly calculated singletons as they have some second-best BLAST hits, or non-reciprocal-BLAST hits or in general show some similarity to other genes in the dataset that rules them out as singletons as calculated by the EDGAR tool. The distribution of proteins into the COG functional categories is shown in a heat map for the three L. acidipiscis strains (**Figure 4B**). Despite their differences in completeness, the three genomes present very similar percentages in each of the COG categories. There was only one exception in replication,

in each functional COG category and the vertical axis the representative strains of the species in the L. salivarius clade.

recombination and repair (L) category in which strain ACA-DC 1533 appears to have 15.4% compared to 10.3% and 8.6% for strains JCM 10692<sup>T</sup> and KCTC 13900, respectively. As mentioned above, this higher percentage of proteins in the L category for strain ACA-DC 1533 was also evident in the comparison of all species within the L. salivarius clade (**Figure 2**). This difference may again reflect the fragmented nature of L. acidipiscis JCM 10692<sup>T</sup> and KCTC 13900 genome assemblies.

COG functional classification of the singletons is shown in **Figure 5**. We could find singletons of the three strains distributed in all COG categories with the majority associated with replication, recombination and repair (L), cell wall/membrane/envelope biogenesis (M), carbohydrate transport and metabolism (G) and transcription (K). The high prevalence of proteins in the L COG category appears again, this time in all three strains, especially strains ACA-DC 1533 and

KCTC 13900. Strain JCM 10692<sup>T</sup> appears to have approximately half the singletons in the L COG category, but this may be an artifact deriving from its partial sequence. It is unclear whether genes involved in information storage and processing might have technological implications. It could be suggested though, that the efficiency of central cellular mechanisms like those of the L, M, and K COG categories may provide the strain/species with a competitive advantage in a complex ecosystem. On the contrary, carbohydrate transport and metabolism can have a direct impact on the diversity of ecological niches in which the bacterium can grow.

### In Silico Evaluation of Motility of L. acidipiscis Strains

From a microbial ecology point of view, motile species may have competitive benefits against non-motile species, regarding e.g., niche colonization and biofilm formation (Neville et al., 2012). Currently, 16 motile Lactobacillus species have been recognized in the entire genus, all belonging to the L. salivarius clade with the exception of L. curvatus, which is a member of the Lactobacillus sakei clade (Cousin et al., 2015). Motility of L. acidipiscis has been recently described in strain KCTC 13900 revealing that the 54 proteins involved in flagellum regulation, synthesis, export and chemotaxis are organized in a single operon (Cousin et al., 2015). Annotation of ACA-DC 1533 identified 51 motility genes (LAC1533\_RS09635-RS09885) producing a functional flagellar apparatus as also observed by in vivo experiments (data not shown). Core-genome analysis revealed that the motility operon is also present in strain JCM 10692<sup>T</sup> and flanked by the same genes (**Supplementary Table S4B**). As shown in **Figure 6**, alignment of the motility operons of L. curvatus NRIC 0822 and L. acidipiscis strains KCTC 13900 and ACA-DC 1533 revealed that they are conserved.

### GIs Found in L. acidipiscis Genomes

HGT is one of the main processes responsible for genome evolution. Genomic fragments acquired by HGT events are characterized as GIs and may have a direct impact on the genome plasticity (Juhas et al., 2009). Here, we focused our analysis in the 13 GIs of the ACA-DC 1533 chromosome identified by

the IslandViewer software tool (**Supplementary Figure S2**). Of note, GI 9 contains the genome's array of ribosomal proteins (**Supplementary Table S5**). This is most probably a false positive result, as genes encoding ribosomal proteins have differences in sequence composition compared to regular protein coding genes (Fernández-Gómez et al., 2012) and are thus detected wrongfully by IslandViewer as part of a GI. For this reason, GI 9 was excluded from further analysis. The remaining 12 putative GIs contain a total of 229 genes and the respective lengths ranging from 4,677 to 36,954 bp. Many of these genes are involved in carbohydrate, lipid and amino acid metabolism as well as in membrane transport systems. According to the pan-genome analysis, GIs 3, 7, and 8 are unique for strain ACA-DC 1533 while GIs 1, 4, and 6 are common in all three L. acidipiscis strains, indicating acquisition early in the evolution of the species. It is interesting to note that GI 5 is present in strains ACA-DC 1533 and JCM 10692<sup>T</sup> but absent in KCTC 13900. Other GIs are shared among the L. acidipiscis strains to a variable degree (**Supplementary Table S5**).

### Prophage Sequences, CRISPR-Cas Systems, RM Systems and TA Systems of L. acidipiscis Strains

PHASTER allowed the identification of one intact (1,228,777- 1,272,253 bp, from now on called phage 1) and two incomplete prophage regions (1,575,825-1,586,105 bp and 1,802,666- 1,830,756 bp) in the ACA-DC 1533 chromosome. Phage 1 contains 53 CDSs, most of which encode hypothetical proteins (approximately 41.5%). Furthermore, phage tail proteins, capsid proteins and attL/attR sites flanking the prophage DNA were also identified. Phage 1 is related to several prophages most of which can be found in Lactobacillus genomes. Strain JCM 10692<sup>T</sup> carries a phage region similar to phage 1 (from now on called phage 2) sharing 30 out of 53 proteins (**Supplementary Table S6A**). Furthermore, strain KCTC 13900 seems to have an intact prophage region (from now on called phage 3) of 40.8 Kbp length related also to Lactobacillus phages (**Supplementary Table S6A**).

Three CRISPR sequences (i.e., CRISPR1, 2, and 3) were only identified in strain KCTC 13900 (**Supplementary Table S6B**). BLASTN analysis of all the spacers identified in these three CRISPR-Cas systems showed that several of them, namely spacers 9, 11, 13, 14, 19, 20, and 21 in CRISPR 1 and spacers 5, 14 and 21 in CRISPR 2 had hits in the Lactobacillus plantarum virulent phage phiJL-1. Moreover, spacers 22 and 26 in CRISPR 2 had hits in L. salivarius plasmids. Since L. salivarius strains carrying such plasmids are related to the host environment, this may suggest that L. acidipiscis has occupied this niche as well. Most importantly, spacers 1, 3, 5, 6, and 7 in CRISPR 1 and spacer 35 in CRISPR 2 had hits against phage 1 and/or phage 2 genes. Spacers in CRISPRs can reveal aspects of the evolutionary history of their host (Papadimitriou et al., 2014). Thus, it could be hypothesized that strain KCTC 13900 has also been exposed to phage 1 or phage 2 but it was able to acquire immunity through its CRISPR-Cas systems. Our findings may indicate that phages 1 or 2 are abundant in the ecological niches occupied by different L. acidipiscis strains or that, despite the different origins of isolation, the three L. acidipiscis strains were present in the same ecological niche sometime in the past. Moreover, the presence of phages 1 and 2 in the ACA-DC 1533 and JCM 10692<sup>T</sup> genomes, respectively, corroborates with the lack of CRISPR systems in the two strains. However, the presence of prophages in the genomes of L. acidipiscis strains may protect them from superinfection by other phages or plasmids (Bondy-Denomy et al., 2016).

Bacterial defense mechanisms against foreign DNA include RM and TA systems (Darmon and Leach, 2014). Strain ACA-DC 1533 has a type I system that seems to be complete, as it contains the DNA-methyltransferase subunit M (LAC1533\_RS04765), the specificity subunits S (LAC1533\_RS04770) and R (LAC1533\_RS04775), as well as a second type I system (LAC1533\_RS01110-RS01130) possibly inactivated, since the restriction subunit R is a potential pseudogene (LAC1533\_RS01130). According to the REBASE database, the strain also carries three putative type II RM systems (LAC1533\_RS03065, LAC1533\_RS05790 and LAC1533\_RS08450-RS08455) and two type IV RM systems (LAC1533\_RS02780 and LAC1533\_RS04790) (**Supplementary** **Figure S3**). Plasmid pLAC3 also carries an AvaI RM system. Finally, we looked into TA systems. We concentrated our search on type II TA systems for which TAfinder prediction tool is available. In strain ACA-DC 1533 we found nine TA systems in the chromosome and one in the pLAC2 plasmid (**Supplementary Table S7**).

#### Comparative Genomics of L. acidipiscis Against L. salivarius and L. ruminis

To further investigate the lifestyle and/or the technological traits of L. acidipiscis ACA-DC 1533, we performed comparative genomic analysis against L. salivarius UCC118 and L. ruminis ATCC 27782. L. salivarius UCC118 was chosen as the

representative strain of the species since it is the first sequenced and presumably the best characterized strain of the clade (Harris et al., 2017). The comparison was performed initially at the chromosome level since the chromosomes of all three strains are completely sequenced. L. salivarius UCC118 was isolated from the human ileal-caecal region and comprises a chromosome of 1.8 Mbp and three plasmids, one of which is a megaplasmid of 242 Kbp (Claesson et al., 2006). L. ruminis ATCC 27782 isolated from the bovine rumen has a chromosome size of 2.1 Mbp with no plasmids (Forde et al., 2011). As mention above,

proteins involved in each functional COG category (B).

L. acidipiscis ACA-DC 1533 has a chromosome of 2.6 Mbp, which is the largest among the three species. L. acidipiscis ACA-DC 1533 and L. ruminis ATCC 27782 exhibited the highest number of potential pseudogenes, i.e., 7.3 and 9.0%, respectively in contrast to the 2.8% of L. salivarius UCC188. However, other complete L. salivarius chromosomes exhibit a variable percentage of potential pseudogenes, up to 6.6% (**Supplementary Table S8**). Taking this observation into account, it seems that pseudogenes may not be constant among strains of the same species and thus the existence of only one complete chromosomal sequence for L. acidipiscis and L. ruminis are not enough to comment about their overall genome decay at the species level. Nevertheless, L. acidipiscis ACA-DC 1533 and L. ruminis ATCC 27782 appear to have undergone genome decay to an extent that is relatively restricted, at least when compared to the genome decay of highly specialized dairy lactobacilli like L. delbrueckii subsp. bulgaricus (van de Guchte et al., 2006).

Our analysis also revealed that the number of common proteins among the three species is 813, higher than that calculated for the entire L. salivarius clade as analyzed above (**Figure 7A** and **Supplementary Table S9A**). L. acidipiscis ACA-DC 1533 seems to carry the highest number of unique genes (n = 847) mostly encoding hypothetical proteins, transposases, ABC transporters, PEP-PTS and membrane transport proteins (**Supplementary Table S9B**). Similarly to **Figure 4A**, the total number of genes for each strain presented is somewhat lower than the total number of genes annotated for the strain since some genes cannot be assigned neither in the singletons nor in the 3-genome or 2-genome cores for the reason presented above. Furthermore, there is no extensive synteny among the three species as observed during full-length chromosome alignments created by progressiveMAUVE (**Supplementary Figure S4**). The analysis revealed a high number of LCBs with a quite short average length. Several studies based on comparative genomics among Lactobacillus species have established the genomic diversity of the Lactobacillus genus, which is higher compared to that of a typical bacterial family (Sun Z. et al., 2015; Martino et al., 2016).

The distribution of proteins into the COG functional categories for the three species is shown in **Figure 7B**. As expected, L. acidipiscis ACA-DC 1533 chromosome contained more proteins compared to L. salivarius UCC118 and L. ruminis ATCC 27782 in the L COG category owing to an inflated number of transposases and reverse transcriptases. Inspection of each of the two categories of gene products revealed that they may contain in some instances identical paralogs, but this is not always the case. The biological reason behind this observation is not clear. However, considering that both L. salivarius UCC118 and L. ruminis ATCC 27782 chromosomes are completely sequenced and that both L. acidipiscis ACA-DC 1533 and L. ruminis ATCC 27782 are annotated with the same pipeline in RefSeq, the possibility that this difference is some type of artifact is rather unlikely. Another obvious difference was the absence of proteins in the cell motility (N) COG category from the L. salivarius UCC118 chromosome. In all other COG categories,

the distribution of proteins was at a comparable level among the three strains.

We also compared plasmid sequences of L. acidipiscis ACA-DC 1533 and L. salivarius UCC118. It has been shown for the latter that important housekeeping genes may be carried in its plasmids (Harris et al., 2017). In the case of L. acidipiscis plasmids most of the proteins were hypothetical. Nevertheless, we were able to identify some genes encoding proteins that may be important for the physiology, metabolism and/or the technological properties of the strain. For example, we determined the presence of carbohydrate and ion transporters (**Supplementary Table S10**), putative carbohydrate metabolizing

enzymes (please see below), and an arsenate reductase. In addition and as mentioned above, plasmids of L. acidipiscis ACA-DC 1533 carry an AvaI RM system and a type II TA system (**Supplementary Table S7**).

#### Glycobiome Analysis of L. acidipiscis, L. salivarius, and L. ruminis

The glycobiomes of L. acidipiscis ACA-DC 1533, L. salivarius UCC118 and L. ruminis ATCC 27782 were investigated using dbCAN. According to the analysis, L. acidipiscis ACA-DC 1533 had the largest glycobiome with 85 enzymes involved in carbohydrate metabolism, followed by L. salivarius UCC118 and L. ruminis ATCC 27782 with 78 and 68 enzymes, respectively (**Supplementary Table S11**). Among the 85 enzymes, 37 were identified as GHs, 21 as GTs, 13 as CEs and 14 as CBMs. Compared to the 37 GHs of L. acidipiscis ACA-DC 1533, L. salivarius UCC118 and L. ruminis ATCC 27782 contained 27 and 26 GHs, respectively. Among the GH families identified in the L. acidipiscis ACA-DC 1533, L. salivarius UCC118 and L. ruminis ATCC 27782 genomes, GH 13 was the most pronounced containing mainly enzymes with plant substrate specificity (Crost et al., 2013). Indeed, the carbohydrate fermentation profile of L. acidipiscis ACA-DC 1533 using the API 50 CHL stripes (**Supplementary Table S12**) and L. salivarius UCC118 (Li et al., 2006) showed that the two strains were able to ferment a number of carbohydrates of plant origin, i.e., L-arabinose, D-ribose, D-cellobiose, Dtrehalose, D-glucose, D-fructose, D-mannitol, D-sorbitol, and D-saccharose. Furthermore, several GH families, namely GH 35, GH 38, GH 46, GH 70, and GH 76, were unique for the L. acidipiscis ACA-DC 1533 genome indicating that the bacterium presumably requires these enzymes in its ecological niche, which might be different to that of L. salivarius UCC118 and L. ruminis ATCC 27782. Interestingly, the presence of a betagalactosidase (GH 35) and two 6-phospho-beta-galactosidase genes (GH 1) in the L. acidipiscis ACA-DC 1533 genome could probably be required for growth in milk. L. acidipiscis ACA-DC 1533 genome seems to contain also the highest number of CBM modules in family 50 compared to the L. salivarius UCC118 and L. ruminis ATCC 27782 genomes. CBM 50 modules are commonly found in bacterial lysins having a peptidoglycan binding function and a contribution to cell division (Visweswaran et al., 2013). Similarly to what has been reported previously for L. salivarius UCC118 (Harris et al., 2017) and according to our analysis, part of the glycobiome of both L. salivarius and L. acidipiscis ACA-DC 1533 resides in their plasmids. Specifically for L. acidipiscis, we found two GT 4 in plasmid pLAC2. It seems plausible to state that diversity of the plasmid glycobiome in strains of L. salivarius is significantly more rich than that of L. acidipiscis perhaps due to the presence of the megaplasmid. Moreover, analysis using the TransportDB database identified 47 potential sugar specific PTS transport proteins in the L. acidipiscis ACA-DC 1533 genome (3 on pLAC2) and 25 and 16 potential PTS transport proteins for L. salivarius UCC118 and L. ruminis ATCC 27782 genomes, respectively (**Supplementary Table S10**).

## Proteolytic System of L. acidipiscis, L. salivarius, and L. ruminis

The proteolytic system of lactic acid bacteria consists of cellwall bound proteinases, which initiate the degradation of caseins, peptide and amino acid transport systems and a pool of intracellular peptidases, which further degrade the peptides to shorter peptides and free amino acids (Liu et al., 2010). The proteolytic system of the three L. acidipiscis strains, L. salivarius UCC118 and L. ruminis ATCC 27782 was investigated according to the scheme of Liu and co-workers (Liu et al., 2010) (**Supplementary Table S13**). The cell-wall bound proteinase (PrtP), the aminopeptidase A (PepA), the endopeptidases PepE/PepG and the proline peptidase PepL were missing from all strains. It is worth mentioning that PrtP gene is intact in plasmid pR1 of L. salivarius strain Ren (Sun E. et al., 2015). The rest of the peptidases were found in up to three copies per genome. Furthermore, L. acidipiscis ACA-DC 1533 and L. ruminis ATCC 27782 carried one oligopeptide ABC transport system (Opp), which was missing from the L. salivarius UCC118 genome. Interestingly, the Opp operon is present in L. acidipiscis ACA-DC 1533 and JCM 10692<sup>T</sup> but absent in KCTC 13900. On the contrary, a di/tripeptide ABC transport system (Dpp) and a DtpT transporter of di- and tri-peptides were found in the three species (including all three L. acidipiscis strains). However, it is worth noting that the DppD protein of L. acidipiscis KCTC 13900 is a potential pseudogene inactivating the entire Dpp system which deserves further investigation. Moreover, L. acidipiscis ACA-DC 1533 chromosome seems to contain 17 amino acid ABC transport proteins, while L. salivarius UCC118 and L. ruminis ATCC 27782 chromosomes only 11 and 10, respectively. Even though the five Lactobacillus chromosomes and/or genomes carry a number of peptide and amino acid transporters as well as several intracellular peptidases, the absence of PrtP indicates that the strains may not directly hydrolyze large protein molecules, but they may take advantage of peptides and free amino acids already available in their ecological niche.

### Miscellaneous Genomic Features Deriving From the Comparison Among L. acidipiscis ACA-DC 1533, L. salivarius UCC118, and L. ruminis ATCC 27782

We also focused our analysis to IS elements that may contribute in bacterial genome evolution, to transport proteins which allow the transport of the substances in and out of the cell, as well as to RPs that control gene expression.

IS elements of L. salivarius UCC118 and L. ruminis ATCC 27782 have been previously identified (Claesson et al., 2006) but we have updated the analysis using the latest version of ISsaga and the most recent annotation files for the two strains. In the chromosomes of L. acidipiscis ACA-DC 1533, L. salivarius UCC118 and L. ruminis ATCC 27782, a total of 53, 10 and 30 IS elements were predicted with ISsaga, respectively (**Supplementary Table S14**). The higher number of IS elements in the chromosome of L. acidipiscis ACA-DC 1533 may suggest a higher potential for genome plasticity compared

to the L. salivarius UCC118 and L. ruminis ATCC 27782 chromosome. The majority of IS elements in the L. acidipiscis ACA-DC 1533 chromosome belong to the ISL3 and IS982 families which were also previously identified in food related lactobacilli like Lactobacillus delbrueckii subsp. bulgaricus and Lactobacillus helveticus, respectively (Germond et al., 1995; Callanan et al., 2005).

Furthermore, the L. acidipiscis ACA-DC 1533 genome contains 287 transport proteins compared to 240 and 238 of L. salivarius UCC118 and L. ruminis ATCC 27782 genomes, respectively. They mainly belong to the ABC superfamily and to the MFS (**Supplementary Table S10**). Additional analysis of the L. acidipiscis ACA-DC 1533 genome revealed 17 potential glycine/betaine transport proteins organized in at least five distinct genomic loci. The glycine/betaine transport system may be necessary to overcome osmotic stress since L. acidipiscis is a salt-tolerant species owning strains able to grow in the presence of even 12% NaCl (our unpublished results; Tanasupawat et al., 2000; Romeo et al., 2003; Pot et al., 2014).

RPs include TCSs and TFs. TCSs are the most abundant phosphorylation-dependent signal transduction systems in prokaryotes and typically comprise a membrane-bound HK and a RR (Barakat et al., 2013). On the other hand, TFs contain TRs, OCSs, RRs and SFs. Analysis of L. acidipiscis ACA-DC 1533 and L. salivarius UCC118 identified six HKs and seven RRs for both strains. Analysis of L. ruminis ATCC 27782 chromosome revealed seven HKs and 10 RRs. Furthermore, the L. acidipiscis ACA-DC 1533 chromosome contained the highest number of TFs among the three strains analyzed, including 68 TRs, 28 OCSs, five RRs, six SFs and 19 ODPs, most of which were unclassified (**Supplementary Table S15**). The higher number of TFs in the L. acidipiscis compared to the other two species may suggest a more intricate regulation of gene expression and perhaps an increased interaction with the environment.

### Assessing the Probiotic and Technological Properties of L. acidipiscis ACA-DC 1533

Initially, we investigated the probiotic potential of L. acidipiscis ACA-DC 1533 based on the available information for L. salivarius UCC118 which has been extensively studied as a probiotic strain (Neville and O'Toole, 2010). The L. salivarius UCC118 genome contains a bile-salt hydrolase (Claesson et al., 2006) and two EPS clusters associated with the strain's probiotic activity (Harris et al., 2017). These traits were absent from the L. acidipiscis ACA-DC 1533 genome. In addition, proteins that may play a role in the interaction of L. salivarius UCC118 with the host, may include mucus-, collagen-, salivary agglutininand epithelial-binding proteins, as well as enterococcal surface proteins (van Pijkeren et al., 2006; O'Shea et al., 2012). All these proteins are sortase-dependent surface proteins which were either absent from the L. acidipiscis ACA-DC 1533 genome or were characterized as potential pseudogenes. The only exception identified was a fibrinogen/fibronectin-binding protein, similar to that of L. salivarius UCC118 (Collins et al., 2012) that was also present in the L. acidipiscis ACA-DC 1533 genome. Furthermore, analysis of the L. acidipiscis ACA-DC 1533 genome with the BAGEL tool did not predict any bacteriocin gene, in contrast to the L. salivarius UCC118 genome, which produces the two-component class II bacteriocin Abp118 (Flynn et al., 2002). BAGEL also predicted in L. acidipiscis JCM 10692<sup>T</sup> three potential structural genes coding for pediocin, sakacin P and carnocin like bacteriocins (the last being a potential pseudogene) and some accessory genes (e.g., immunity, transfer, and maturation) and further experimental testing for their production needs to be performed.

We then investigated aspects of the technological potential of L. acidipiscis ACA-DC 1533 taking into account that Asteri and co-workers showed that the major volatile/flavor metabolites produced by this strain when grown in RSM and MRS, were 3-methylbutanal, 3-methylbutanol, benzaldehyde and acetoin (Asteri et al., 2009). The majority of the aforementioned metabolites produced by L. acidipiscis ACA-DC 1533 are degradation products of amino acids (**Figure 8**). In particular, benzaldehyde can be formed from two aromatic amino acids, namely phenylalanine and tyrosine, using an enzymatic and a non-enzymatic step (Nierop Groot and de Bont, 1998; Fernandez and Zuniga, 2006). Moreover, 3-methylbutanal and 3-methylbutanol are catabolic products of the branchedchain amino acid leucine (Fernandez and Zuniga, 2006). The α-ketoacid decarboxylase and the alcohol dehydrogenase involved in the leucine catabolism pathway were found to be present in the three L. acidipiscis genomes but absent from L. salivarius UCC118 and L. ruminis ATCC 27782. On the contrary, aspartate aminotransferase, which catalyzes the transamination of phenylalanine and tyrosine, was present in all the Lactobacillus genomes analyzed. Many studies have been shown that the amino acid degradation products, especially those deriving from the branched-chain, aromatic and sulfurcontaining amino acids, are regarded as significant flavor compounds in several cheese varieties (Ardö, 2006; Liu et al., 2008; Afzal et al., 2017). Furthermore, acetoin, which was produced by L. acidipiscis ACA-DC 1533, can be formed from pyruvate using two alternative pathways. Pyruvate, which derives from glycolysis, is converted into a-acetolactate by α-acetolactate synthase (LAC1533\_RS03500). α-Acetolactate is then catabolized either to acetoin by α-acetolactate decarboxylase (LAC1533\_RS03505) or to diacetyl in the presence of oxygen. Finally, diacetyl/acetoin dehydrogenase (LAC1533\_RS01560) catalyzes the conversion of diacetyl to acetoin (Celinska and Grajek, 2009). It should be mentioned that diacetyl was not detected as a volatile metabolite of L. acidipiscis ACA-DC 1533 in the work of Asteri et al. (2009). However, the presence of diacetyl/acetoin dehydrogenase in the ACA-DC 1533 genome could probably mean that by the time of sampling diacetyl was fully converted into acetoin. Given that L. acidipiscis ACA-DC 1533, along with L. rennini, were the only species found in Kopanisti cheese, the production of the above mentioned metabolites by L. acidipiscis ACA-DC 1533 via amino acid catabolism may contribute to the characteristic piquant flavor of Kopanisti cheese (Yvon and Rijnen, 2001; Asteri et al., 2009; Donnelly, 2016).

## CONCLUSION

fmicb-09-01244 June 7, 2018 Time: 17:38 # 13

The L. salivarius clade includes mainly commensal species and it has been suggested that several strains may have probiotic properties (Neville and O'Toole, 2010). In this study, we analyzed the available genomes of L. acidipiscis, a species within the L. salivarius clade that until today has mainly been isolated from fermented foods of dairy or other origin. We wanted to examine whether L. acidipiscis is also a commensal that is transferred to the ecosystem of fermented foods accidentally from the host. Furthermore, we wanted to investigate the probiotic and/or the technological potential of the species. We sequenced the genome of strain ACA-DC 1533, originally isolated from Kopanisti, a traditional spread-type cheese that is highly salted and particularly piquant (Kazou et al., 2017). Our investigation suggested that L. acidipiscis has a relatively large genome compared to other species of the L. salivarius clade (at least those of L. salivarius and L. ruminis) with a relatively restricted percentage of pseudogenes. These findings along with the observation that L. acidipiscis possesses a high number of glycobiome enzymes may indicate an ability to occupy versatile environments and has not been evolved toward a specific ecological niche. Perhaps, adaptation to a nutrientrich niche would have been more consistent with a smaller genome size, more typically like that of L. salivarius. Interestingly, L. acidipiscis strains ACA-DC 1533 and JCM 10692<sup>T</sup> appear to be more related compared to strain KCTC 13900 based on the presence/absence distribution of a number of genetic traits like prophages and CRISPRs. L. acidipiscis ACA-DC 1533 does not seem to present any evident probiotic trait at the genomic level. Besides the absence of several genes that have been related to probiotic properties of L. salivarius UCC118 preliminary experiments with strain L. acidipiscis ACA-DC 1533 on human PBMCs did not reveal an increased interleucine-10/interleucine-12 ratio, indicative of a potential to stimulate a Treg response (our unpublished results). Some probiotic properties have been suggested for specific L. acidipiscis strains like an antiproliferative effect against Caco-2 cells (Thamacharoensuk et al., 2017) or the improvement of feed conversion efficiency in broiler chickens (Altaher et al., 2015). Thus, further in silico and/or experimental assessment of the probiotic properties of L. acidipiscis may be required. In addition, L. acidipiscis is able to grow in the dairy environment since it can ferment lactose, it possesses a complete proteolytic system for the degradation of milk proteins (without carrying a cell-envelope proteinase) and it can produce volatile compounds during the catabolism of amino acids that may contribute to the flavor of the final product. Intriguingly, L. acidipiscis is also considered a spoiler in vinegar and soy sauce. For this reason, technological steps to prohibit its growth in certain fermented foods need to be devised. Further research is needed with different species of the L. salivarius clade like the newly sequenced Lactobacillus agilis to better appreciate the mechanisms underlining the adaptation to the host and/or the food environment. Sequencing of more strains/species would provide invaluable information about the ecology of this important clade within the Lactobacillus genus.

### AUTHOR CONTRIBUTIONS

MK and VA performed genome analysis and participated in the writing of the manuscript. JB and BP performed genome analysis. ET conceived the project and participated in the writing of the manuscript. KP conceived the project, performed genome analysis, and participated in the writing of the manuscript. All authors read and approved the final manuscript.

### FUNDING

This work was co-financed by the European Social Fund and the National resources EPEAEK and YPEPTH through the Thales project.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2018. 01244/full#supplementary-material

FIGURE S1 | Heat map of ANI values among the five sequenced L. acidipiscis strains.

FIGURE S2 | Circular map of the L. acidipiscis ACA-DC 1533 chromosome as generated by IslandViewer 4. Highlighted regions correspond to GIs. GIs are colored within the circular map according to the prediction method used: GIs in orange were predicted with SIGI-HMM, GIs in blue with IslandPath-DIMOB and the integrated GIs are presented on the periphery in red. The black line plot corresponds to the GC content (%) of the chromosomal sequence.

FIGURE S3 | Circular map of the L. acidipiscis ACA-DC 1533 chromosome highlighting the predicted RM systems by the REBASE database. The symbols are color coded as indicated at the bottom of the figure.

FIGURE S4 | Chromosome alignments among the L. acidipiscis ACA-DC 1533, L. salivarius UCC118 and L. ruminis ATCC 27782 strains generated by progressiveMAUVE. Locally collinear blocks (LCBs) of conserved sequences are presented by the same color (white corresponds to the strain-specific regions).

TABLE S1 | General information of the strains analyzed in this study.

TABLE S2 | Level of completeness among the four L. acidipiscis strains as calculated by dBBQs.

TABLE S3 | Pseudogenes identified in the L. acidipiscis ACA-DC 1533 chromosome.

TABLE S4 | (A) Pan-genome analysis among the three L. acidipiscis strains calculated with the EDGAR software. (B) Core-genome among the three L. acidipiscis strains calculated with the EDGAR software. (C) Singletons of the three L. acidipiscis strains calculated with the EDGAR software.

TABLE S5 | Genes within genomic islands of L. acidipiscis ACA-DC 1533. Common genes within the rest of L. acidipiscis strains were predicted from the pan-genome analysis with the EDGAR software.

TABLE S6 | (A) Prophage regions identified in the three L. acidipiscis strains using the PHASTER software. (B) CRISPR systems identified in the L. acidipiscis KCTC 13900 genome using the CRISPRFinder tool.

TABLE S7 | TA systems predicted in the chromosome and pLAC2 plasmid of L. acidipiscis ACA-DC 1533 genome.

TABLE S8 | Percentage of potential pseudogenes identified in the chromosomes of the completed sequenced genomes of L. salivarius clade.

TABLE S9 | (A) Core-genome among the L. acidipiscis ACA-DC 1533, L. salivarius UCC118 and L. ruminis ATCC 27782 chromosomes as calculated with the EDGAR software. (B) Singletons among the L. acidipiscis ACA-DC 1533, L. salivarius UCC118 and L. ruminis ATCC 27782 chromosomes as calculated with the EDGAR software.

TABLE S10 | Transporters of L. acidipiscis ACA-DC 1533 chromosome and pLAC2 plasmid identified by TransportDB.

TABLE S11 | Putative enzymes involved in carbohydrate metabolism of L. acidipiscis ACA-DC 1533, L. salivarius UCC118, and L. ruminis ATCC 27782 chromosomes.

#### REFERENCES


TABLE S12 | Acid production by L. acidipiscis ACA-DC 1533 using API 50CHL stripes.

TABLE S13 | The proteolytic system of L. acidipiscis, L. salivarius, and L. ruminis chromosomes and/or genomes.

TABLE S14 | Insertion sequences (ISs) identified among the L. acidipiscis ACA-DC 1533, L. salivarius UCC118, and L. ruminis ATCC 27782 chromosomes using the ISsaga platform.

TABLE S15 | Regulatory proteins identified in the chromosomes of L. acidipiscis ACA-DC 1533, L. salivarius UCC118, and L. ruminis ATCC 27782 using P2RP web server.


islands in marine bacteria. BMC Genomics 13:347. doi: 10.1186/1471-2164- 13-347



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Kazou, Alexandraki, Blom, Pot, Tsakalidou and Papadimitriou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Puzzling Over the Pneumococcal Pangenome

#### N. Luisa Hiller1,2 \* and Raquel Sá-Leão3,4

<sup>1</sup> Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA, United States, <sup>2</sup> Center of Excellence in Biofilm Research, Allegheny Health Network, Pittsburgh, PA, United States, <sup>3</sup> Laboratory of Molecular Microbiology of Human Pathogens, Instituto de Tecnologia Química e Biológica António Xavier, Universidade Nova de Lisboa (ITQB NOVA), Oeiras, Portugal, <sup>4</sup> Departamento de Biologia Vegetal, Faculdade de Ciências, Universidade de Lisboa, Lisbon, Portugal

The Gram positive bacterium Streptococcus pneumoniae (pneumococcus) is a major human pathogen. It is a common colonizer of the human host, and in the nasopharynx, sinus, and middle ear it survives as a biofilm. This mode of growth is optimal for multi-strain colonization and genetic exchange. Over the last decades, the far-reaching use of antibiotics and the widespread implementation of pneumococcal multivalent conjugate vaccines have posed considerable selective pressure on pneumococci. This scenario provides an exceptional opportunity to study the evolution of the pangenome of a clinically important bacterium, and has the potential to serve as a case study for other species. The goal of this review is to highlight key findings in the studies of pneumococcal genomic diversity and plasticity.

#### Edited by:

Kimberly Kline, Nanyang Technological University, Singapore

#### Reviewed by:

Stuart C. Clarke, University of Southampton, United Kingdom Jason W. Rosch, St. Jude Children's Research Hospital, United States

#### \*Correspondence:

N. Luisa Hiller lhiller@andrew.cmu.edu

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 13 July 2018 Accepted: 09 October 2018 Published: 30 October 2018

#### Citation:

Hiller NL and Sá-Leão R (2018) Puzzling Over the Pneumococcal Pangenome. Front. Microbiol. 9:2580. doi: 10.3389/fmicb.2018.02580 Keywords: Streptococcus pneumoniae (pneumococcus), pangenome, genomic diversity, genomic plasticity, horizontal gene transfer, competence, vaccine, antibiotics

#### PNEUMOCOCCAL VACCINES AND ANTIBIOTICS

Despite current vaccines and antibiotics, Streptococcus pneumoniae remains a leading cause of morbidity and mortality worldwide (O'Brien et al., 2009; Drijkoningen and Rohde, 2014). Pneumococcal infection can take several forms such as acute otitis media, pneumonia, bacteremia and meningitis (Pneumococcal Disease | Clinical | Streptococcus pneumoniae | CDC, 2017). Pneumococcal infection is preceded by colonization of the upper respiratory tract (mainly the nasopharynx), which is frequent and asymptomatic (Simell et al., 2012). As a colonizer, pneumococci live as a biofilm, an ideal environment for strain co-existence and horizontal gene transfer (Hall-Stoodley et al., 2006; Oggioni et al., 2006; Sanderson et al., 2006; Hoa et al., 2009; Marks et al., 2012a; Blanchette-Cain et al., 2013).

Most pneumococci have a polysaccharide capsule, an anti-phagocytic structure that surrounds the cell. There are nearly one hundred diverse capsular types (Bentley et al., 2006; Geno et al., 2015). Current prevention strategies of pneumococcal infection include the use of multivalent pneumococcal conjugate vaccines which target a subset of all capsular types selected based on their association to the most common and/or virulent isolates in circulation. The first multivalent conjugate vaccine targeted seven capsular types (PCV7) and was widely implemented in the United States in 2000, and subsequently in many countries across the globe. Currently, there are two vaccines (PCV10 and PCV13) being used worldwide. For a detailed review see Geno et al. (2015). These vaccines are effective in preventing disease and decreasing colonization due to vaccine serotypes (Pneumococcal Disease | Clinical | Streptococcus pneumoniae | CDC, 2017).

Due to their limited valency, however, serotype replacement can occur (Weinberger et al., 2011). Disease replacement has been reported with different magnitudes depending on local epidemiology. Colonization replacement is extensive with no overall net decrease in prevalence (Lee et al., 2014; Nunes et al., 2016). As such, although the overall benefit of limited-valency pneumococcal conjugate vaccines is unquestionable, its benefits are expected to be eroded over time (Weinberger et al., 2011).

Management of pneumococcal disease includes the use of antibiotics. Although isolates of pneumococci in the preantibiotic era were susceptible to many antimicrobial agents, resistant isolates have been described since their introduction in clinical use (Frisch et al., 1943; Hamburger et al., 1943). In the 1960s, resistance to tetracycline (Evans and Hansman, 1963), macrolides (Dixon, 1967; Kislak, 1967), and penicillin was reported (Hansman and Bullun, 1967). In the 1970s, multidrug resistant strains, i.e., strains resistant to three or more classes of antibiotics, were described (Jacobs et al., 1978). By the late 1980s – early 1990s penicillin-resistant pneumococci, often multidrug-resistant – had spread globally, achieving extremely high incidence in some countries both as colonizing and diseasecausing agents (Sá-Leão et al., 2000; McGee et al., 2001).

Ever since its first description in 1881, S. pneumoniae has been extensively studied leading to seminal scientific discoveries such as the putative use of polysaccharide antigens as vaccines (Avery et al., 1917), the ability of polysaccharides to induce antibodies (Heidelberger and Avery, 1923), bacterial gene transfer (Griffith, 1928), the first bacterial autolytic enzyme (Dubos, 1937), the isolation and chemical characterization of the first polysaccharide antigen (Goebel and Adams, 1943), the identification of the "transforming principle" (later named DNA) as the genetic material (Avery et al., 1944), the therapeutic efficacy of penicillin (Tillet et al., 1944), the role of bacterial capsule in resistance to phagocytosis (Felton et al., 1955), and the first bacterial quorum sensing factor (Tomasz, 1965).

In the genomic era, pneumococci continue to be intensively investigated. The first pneumococcal genome was published in 2001 (Tettelin et al., 2001). Currently, the genomes of over 8,000 strains are publicly available with a constant increase<sup>1</sup> . This scenario provides an exceptional opportunity to study the evolution of the pneumococcal pangenome under multiple selective pressures.

#### THE PNEUMOCOCCAL PANGENOME

The genes of the pneumococcus expand beyond those encoded within a single strain. Instead, they are distributed over the pneumococcal population, providing this species with an expanded set of genes to draw from for its own adaptation and evolutionary success (Hiller et al., 2007; Donati et al., 2010). As illustrated by the sequencing of the TIGR4 strain in 2001, pneumococcal genomes are approximately 2 megabases in length and encode an estimated 2200 coding sequences (Hoskins et al., 2001; Tettelin et al., 2001; Lanie et al., 2007; Ding et al., 2009).

<sup>1</sup>https://pubmlst.org/

Over twenty percent of the coding sequences of any single pneumococcal isolate are not encoded in all strains, but instead are part of an accessory genome unevenly distributed across isolates of this species (**Figure 1**; Hiller et al., 2007; Donati et al., 2010). The pneumococcal core genome is estimated to encode 500–1100 clusters of orthologues (Donati et al., 2010; Croucher et al., 2013a; Gladstone et al., 2015; Tonder et al., 2017). In contrast, the pneumococcal pangenome is estimated to encode 5000–7000 clusters of orthologues (see **Table 1** for definitions) (Donati et al., 2010; Croucher et al., 2013a; Gladstone et al., 2015; Tonder et al., 2017). In this manner, about three quarters of pneumococcal genes are differentially distributed across strains (**Figure 1**). Notably, in a study of isolates from the Maela refugee camp in Thailand, the pangenome structure was distinct from other studies: the pangenome was estimated at approximately 13,000 orthologous clusters and the core at approximately 400 (Tonder et al., 2017).

The accessory genome, with its diversity of genes and alleles, is a collection of genetic material. Thus, in a multi-strain biofilm, pneumococcal isolates can exchange DNA, generating strains with novel gene combinations, which may be beneficial to outcompete colonizers, evade host immunity, and escape human interventions, such as vaccines and antibiotics (Ehrlich, 2001; Ehrlich et al., 2005).

The origin of the accessory genome extends beyond this species (**Figure 2**). S. mitis also colonizes the human nasopharynx, and is the main reservoir of pneumococcal genetic diversity outside the species. Further, other colonizers

#### TABLE 1 | Definition of terms used in this review.

fmicb-09-02580 October 26, 2018 Time: 16:8 # 3


FIGURE 2 | Schematic of pneumococcal genomic plasticity. Schematic is build on a phylogenetic tree of a subset of streptococcal species (adapted from Antic et al., 2017). Arrows indicate gene transfer events discussed in this article. (A) Pneumococcal genomes, shaped by recombination across strains. The rates of gene exchange vary across lineages. Non-encapsulated strains display high rates of transfer, and may serve as recombination highways. The PMEN1 lineage (type 81) has impacted the genomes of multiple lineages as a common gene donor. The PMEN31 lineage (serotype 3 lineage from clonal complex 180) has relatively stable genomes. (B) A distinct phyletic branch captures strain differentiation. It is composed of the classical non-typable strains, which colonize the nasopharynx and can cause conjunctivitis. Gene exchange between this branch and all other pneumococci may prevent speciation. (C) Gene transfer between S. pneumoniae and S. mitis has played a critical role in penicillin resistance via formation of mosaic PBPs. (D) Gene transfer between S. suis and S. pneumoniae, illustrated by the acquisition of an adhesin by pneumococci.

of the upper respiratory tract, such as S. pseudopneumoniae, S. oralis and S. infantis are substantial contributors to the pneumococcal pangenome (Kilian et al., 2008; Donati et al., 2010). Finally, mobile elements from distantly related streptococci and other genera are additional sources of pneumococcal diversity (Courvalin and Carlier, 1986).

#### MECHANISMS OF PNEUMOCOCCAL PLASTICITY

The pneumococcus has evolved a striking propensity for genetic exchange. When combined with the extensive variability in pneumococcal genome content, these high rates of recombination can generate novel gene combinations within short time frames, such as during a single chronic infection (Hiller et al., 2010; Golubchik et al., 2012). Pneumococci spread from person-to-person via secretions from the upper respiratory pathway (Weiser et al., 2018). Thus, gene transfers that are incurred in the nasopharynx are the most likely to be transmitted, and the main source of long-term variation in the species. In general, bacteria employ transformation, conjugation and transduction for horizontal gene transfer; the pneumococcus makes use of all these mechanisms.

The pneumococcus is a classic example of natural competence. Via quorum sensing, bacterial cells produce and sense a peptide pheromone (competence stimulating peptide) that induces a state of competence (Tomasz and Hotchkiss, 1964; Havarstein et al., 1995). Competence leads to the regulation of over one hundred genes, including the apparatus for DNA uptake and transformation (Dagkessamanskaia et al., 2004; Peterson et al., 2004). While transformation frequencies vary across strains and conditions, transformation in pneumococcus can be highly efficient. In biofilm grown cells in a bioreactor, transformation reaches rates on the order of 10−<sup>3</sup> (Lattar et al., 2018). It is noteworthy, that in these conditions recombination between strains may be unidirectional, although the mechanism behind this intriguing observation remains to be uncovered (Lattar et al., 2018). In animal models of dual-strain carriage, transformation

reaches rates on the order of 10−<sup>2</sup> (Marks et al., 2012b). The representative length of a recombinant fragment resembles that of a gene, at approximately 1 Kb (Donati et al., 2010). However, under selective pressure and in environments that involve cell-tocell contact (such as an in vitro biofilm or the nasopharynx) much larger recombinant fragments are observed, often with lengths substantially over 10 Kb (Trzcinski et al., 2004 ´ ; Brueggemann et al., 2007; Hiller et al., 2010; Golubchik et al., 2012; Cowley et al., 2018).

Multiple factors contribute to pneumococcal transformation. First, pneumococcal biofilms provide abundant DNA. The biofilm matrix is DNA-rich and pneumococci encode competence-induced bactericidal molecules that lyse neighboring cells exposing DNA from divergent strains (Håvarstein et al., 2006; Dawid et al., 2007; Hall-Stoodley et al., 2008). Second, pneumococci can protect internalized ssDNA. Cells produce high amounts of single stranded DNA binding (SsbB) protein, which protects up to half a genome equivalent of intracellular DNA. These ssDNA-SsbB complexes serves as DNA reservoirs that can be used in multiple recombination events (Attaiech et al., 2011). Third, the recombination machinery does not require long stretches of identical sequence for recombination, this allows for the incorporation of highly divergent alleles or novel genes (Prudhomme et al., 2002). Moreover, human interventions can also influence transformation. Specifically, multiple classes of antibiotics can activate competence (Prudhomme et al., 2006; Stevens et al., 2011; Slager et al., 2014; Domenech et al., 2018). Three independent mechanisms have been described. First, aminoglycosides trigger an increase in misfolded proteins, which are substrates for the HtrA protease that normally degrades CSP. This substrate competition increases the pool of available CSP (Stevens et al., 2011). Second, fluoroquinolones and HPUra stall replication, and in doing so increase the dosage of the competence genes that are located near the origin of replication (Slager et al., 2014). Third, aztreonam and clavulanic acid drive cell chaining, increasing local concentrations of CSP and promoting competence (Domenech et al., 2018). In conclusion, the pneumococcus has evolved multiple strategies to ensure high rates of transformation and human intervention can further increase these rates.

Transduction is the horizontal transfer of genes via bacteriophages. These phages display lysogenic and lytic cycles. In the lysogenic phase, prophages are integrated into the host genome and vertically transmitted. In the lytic phase, phage excise from the host genome, undergo replication, lyse the host cell and are horizontally transmitted through infection of other bacterial cells. Prophages are ubiquitous within pneumococcal genomes, and there is clear evidence of phage activity in the species (Mcdonnell et al., 1975; Bernheimer, 1977; Ramirez et al., 1999; Romero et al., 2009a,b; Brueggemann et al., 2017). The majority of pneumococcal prophages can be organized into five groups based on sequence similarity. Furthermore, the phage integrase defines the genomic location where a phage will integrate into the genome, and most pneumococcal prophages are localized in one of five conserved locations (Brueggemann et al., 2017). Pneumococcal prophages display a highly variable strain distribution, consistent with extensive gene transfer, gain and loss, as well as restriction modification systems that limit acquisition of certain phages in subsets of strains (Croucher et al., 2014b). Nonetheless, in the mist of such plasticity, there are associations between specific prophages and pneumococcal lineages as exemplified by the φMM1 phage. This phage is integrated in strains of the PMEN1 lineage, a well-studied pandemic multidrug penicillin-resistant lineage also referred to as the Spain 23 F (ST81) clone, where it has persisted for decades (Muñoz et al., 1991; Corso et al., 1998; Ramirez et al., 1999; Sá-Leão et al., 2000; McGee et al., 2001; Obregón et al., 2003; Brueggemann et al., 2017). Studies on a φMM1-like phage have shed light on the potential role of phages in pneumococcal biology: φMM1-like phages promote adherence to inert surfaces and pharyngeal cells in multiple genomic backgrounds (Loeffler and Fischetti, 2006). The link between phage proteins and adherence is not specific to this group of phages, the phage tail protein PblB (not encoded by the MM1 group) also promotes adherence to human epithelial cells, as well as nasopharyngeal and lung infection in a murine model of pneumococcal disease (Hsieh et al., 2015). Finally, the Spn1 prophage offers an example of the impact of a phage directly on pneumococcal physiology. Spn1 causes a defect in LytA-mediated autolysis and an increase in chain length, it decreases fitness in the mouse model of asymptomatic colonization, and it affords increased resistance to lysis by penicillin (DeBardeleben et al., 2014). Together these findings are consistent with the hypothesis that at least some groups of prophages modify the fitness of their pneumococcal hosts and provide an evolutionary framework to study the long-term persistence of subsets of phages within this bacterial species. The vast majority of phage-encoded coding sequences have unknown functions, and their role in pneumococcal biology and evolution remains an exciting open question.

Pneumococcal genomes also encode phage-related chromosomal islands (PRCIs) (Croucher et al., 2014b; Javan and Brueggemann, 2018). Unlike prophages the PRCIs are not capable of lytic cycles, instead they are presumed to hitchhike with prophages. PRCIs encode genes that enable use of the phage-reproduction machinery of other phages; they can be conceptualized as phage parasites (Novick et al., 2010). The PRCI distribution across lineages appears much less diverse than that of prophages, perhaps due to limited lateral transfer or due to high specificity between PRCI and subsets of prophage hosts. The pneumococcal PRCIs can be organized into four major groups (Javan and Brueggemann, 2018). The molecular machinery of pneumococcal PRCIs, and their role in genomic plasticity and pneumococcal biology is largely unknown.

Conjugation is employed by pneumococci for horizontal gene transfer of integrative and conjugative elements (ICEs), which consist of conjugative transposon or integrative plasmids. Conjugation requires direct contact between donor and recipient cells. When inserted into the bacterial genome these autonomous mobile elements undergo vertical transfer. Alternatively, they can be excised from the genome, undergo horizontal gene transfer across strains, and integrate into the recipient's genome (Burrus et al., 2002). Many of the best-characterized pneumococcal ICEs are related to Tn5253, and correspond to composite transposons. These elements are found in

variable locations within the bacterial genome, consistent with independent insertion events (Ayoubi et al., 1991; Croucher et al., 2009; Chancey et al., 2015). ICEs are particularly important to pneumococcal biology given their propensity to carry drug resistant genes, including resistance to macrolides, tetracycline, and chloramphenicol (Ayoubi et al., 1991; Croucher et al., 2009; Chancey et al., 2015). ICEs can carry multiple kilobases of genomic material, and as such allow for substantial genomic diversity to emerge from a single transfer event. A well-studied instance of an ICE that carries multiple drug resistance determinants is ICESp23FST81, a Tn5353 found within many strains of the PMEN1 lineage. This element encodes multiple drug resistant genes, and has gained additional ones via integration of the Tn916 element into the original Tn5252 ICE. Further, it also encodes the TprA2/PhrA2 quorum sensing signaling system, which signals to itself as well as to a homologous quorum sensing system widely distributed in the species (TprA/PhrA). Downstream and negatively regulated by TprA2, is a lanthipeptide cluster that promotes disease in a murine model of pneumonia (Kadam et al., 2017). While ICESp23FST81 is almost exclusive to PMEN1, it is also found outside this lineage in a ST156 strain depicting an instance of ICE transfer across diverse lineages (Chancey et al., 2015). ICEs and PRCIs display mosaic patterns that are consistent with variation via transformation events. Thus, these mobile elements reveal how transduction or conjugation cooperate with transformation to generate pneumococcal diversity (Croucher et al., 2014b; Chancey et al., 2015).

In summary, transformation, transduction and conjugation drive the evolution of pneumococcus. Recombination plays a pivotal role in generating vaccine escape strains via capsular switching and drug resistant strains via the acquisition of resistance alleles or genes.

#### GENE TRANSFER IN THE PNEUMOCOCCUS: IMPLICATIONS FOR THE PANGENOME AND PNEUMOCOCCAL BIOLOGY

Pneumococcus researchers have the privilege of studying a species with an abundance of genomic data (Jolley and Maiden, 2010; Riley et al., 2012). There are multiple genome-wide studies, including detailed annotations of single genomes and comparative analyses of lineages, serotypes, and populations (Dopazo et al., 2001; Hoskins et al., 2001; Tettelin et al., 2001; Hiller et al., 2007; Lanie et al., 2007; Croucher et al., 2009, 2011, 2013a, 2014b; Ding et al., 2009; Donati et al., 2010; Everett et al., 2012; Wyres et al., 2012; Chewapreecha et al., 2014; Chaguza et al., 2016b, 2017; Lees et al., 2017; Makarewicz et al., 2017; Aprianto et al., 2018; Azarian et al., 2018). The selective pressure imposed by multivalent pneumococcal conjugate vaccines, provides a unique opportunity to observe population dynamics and investigate the pangenome under an instance of dramatic selection. The GPS (**G**lobal **P**neumococcal **S**equencing **p**roject) is an ongoing remarkable global initiative amidst several others aiming to employ genome wide studies to understand the impact of the vaccine on the pangenome<sup>2</sup> .

The origin of non-vaccine type strains post-vaccination is of great importance for vaccine implementation and public health. In these circumstances serotype replacement occurs mainly due to expansion of non-vaccine types that pre-exist before vaccine introduction (Simões et al., 2011; Frazão et al., 2013; Chaguza et al., 2017). A study in Portugal, observed that the overall prevalence of drug-resistant pneumococci among carriers, commonly associated with vaccine serotypes in the pre-vaccine era, did not change following PCV7 use despite extensive serotype replacement (Simões et al., 2011). This was due to expansion of drug-resistant lineages expressing non-vaccine types that were already in circulation, albeit at lower frequency, in the pre-vaccine era. Perhaps contrary to expectations, postvaccine capsular switches of vaccine types to non-vaccine types did not contribute significantly to the pool of post-vaccine drug resistant isolates. Nonetheless, there is at least one example of serotype switch leading to maintenance, in the post-vaccine era, of a clone previously associated with a vaccine type. The PMEN14 lineage, also referred to as Taiwan 19F, was a highly prevalent lineage and a major contributor to pneumococcal disease in the PCV7 era. While the 19F capsule was targeted by PCV7, its serotype switch to serotype 19A was not and spread in the United States in the PCV7 era (Geno et al., 2015). A large comparative analysis of PMEN14 isolates detected multiple instances of serotype switching to 19A (Croucher et al., 2014a). Notably, a study of strains from non-vaccinated volunteers also captured multiple instances of gene transfer into the PMEN14 lineage (Chewapreecha et al., 2014; Croucher et al., 2014a). These studies support the model where gene exchange within a lineage generates a mixed population, which is situated to withstand selective pressures such as those imposed by the vaccines (Croucher et al., 2014a). Once selective pressure is imposed, it results in the spread of the genotypes with a fitness advantage; the subsequent competition across these genotypes, shapes the population.

Akin to fitness, linkage disequilibrium is another player in the outcome of recombination in pneumococci. A comparative genomic analysis combined with mathematical models suggests that pneumococci display clear groups of alleles, termed metabolic types (Watkins et al., 2015). This analysis categorized allelic differences in 876 metabolic/transport genes; it demonstrated a highly non-random distribution of genes and higher linkage disequilibrium within this set when compared to control sets of genes. These metabolic types represent a fitness peak, which may be decreased by recombination events. In a more recent work, it is proposed that epistatic interaction between the groEL chaperone and other genes may be a strong driving force in this process (Lourenço et al., 2017). Moreover, metabolic types are associated with serotypes and thus are likely to influence the fitness of serotype switches. It seems probable that the 19F–19A switches described above do not disrupt metabolic profiles. The authors propose that genes within a metabolic type have co-evolved, and represent a highly successful

<sup>2</sup>https://www.pneumogen.net/gps/project\_outline.html

set of alleles that are well adapted to a particular metabolic niche. In conclusion, gene exchange plays a critical role in the evolution of pneumococci, but may be limited by decreased fitness of recombinant strains and linkage disequilibrium.

### Evidence of in vivo Recombination: Characterization, Tempo, and Barriers

The first documented example of in vitro recombination is the Griffith (1928) experiment. A non-encapsulated strain incorporated DNA from an encapsulated isolate, leading to its conversion from avirulent to virulent (Griffith, 1928). Building on this eminent experiment, the pneumococcal community has captured and quantified in vivo recombination.

Pneumococcal gene transfer is common across strains in the same lineage. This can be exemplified by exchange across strains of the pandemic PMEN1 lineage. A study comparing 240 PMEN1 isolates, recovered across the globe over 25 years, captured over 700 recombination events (Croucher et al., 2011). These events were concentrated in the capsular locus, regions encoding surface exposed molecules implicated in host interactions (pspA, psrP, and pspC), as well as the ICESpn23FST81 and the φMM1- 2008 prophage. While recombination across PMEN1 isolates is widespread, these strains have a low propensity to receive heterologous DNA (Wyres et al., 2012). Ninety five percent of the coding sequences from modern PMEN1 strains are highly similar to those of their common ancestor, represented as an isolate from 1967. It has been proposed, that the DpnIII restriction modification system, which is encoded in the vast majority of PMEN1 strains and rare outside the lineage, provides one mechanism to decrease the frequency of heterologous recombination in this group and in doing so, fosters genomic stability in the PMEN1 lineage (Eutsey et al., 2015). In contrast, strains in the PMEN1 lineage display a high propensity to donate DNA and have substantially impacted the genomic composition of many lineages (Wyres et al., 2012). The genes that encode a subset of the penicillin binding proteins (PBPs) confer penicillin non-susceptibility; the alleles from the PMEN1 lineage are widely distributed across the pneumococcal population (Sá-Leão et al., 2002; Wyres et al., 2012). In addition, the PMEN3 and CGSP14 lineages, have acquired 5.3 and 9.5% of their genomes from the PMEN1 lineage, respectively (Wyres et al., 2012). Combined, these studies capture recombination in vivo, describe the uneven transfer of genes across lineages, and illustrate an example of extensive gene exchange within a single pneumococcal lineage.

New gene combinations can be generated in the time frame of a chronic mucosal infection or a colonization event. A study of a single, chronic and polyclonal infection captured the progressive accumulation of recombinations (Hiller et al., 2010). Specifically, it compared a set of six clinical strains, isolated over a 7 month period from a single child. This set contains the major parental strain, the DNA donor, and the recombinant and thus allowed the reconstitution of inter-lineage in vivo recombination replacements. A substantial percentage of the genome, over seven percent of the dominant lineage, was recombined over the course of this single infection. The transferred regions were distributed over the genome consistent with 23 recombination events. A separate study captured multiple genes donated by a single donor in a population analysis of vaccine escape strains (Golubchik et al., 2012). It documented transfers, ranging from 0.04 to 44 Kb in size, in the capsular locus and across the genome. Together, these studies capture multiple instances of in vivo transfer between strains. Their findings are consistent with multiple transfers during a single competence event and with consecutive sequential events between donor and recipient strains during co-existence in a biofilm.

#### Inter-Species Gene Transfer

A prominent example of inter-species gene exchange is documented in the evolution of penicillin resistance (**Figure 2**, gene exchange between C and A). The main genetic determinants of penicillin resistance are variable alleles of the penicillinbinding proteins PBP1A, PBP2B and PBP2X. Resistance occurs, when alleles of these cell wall synthesis proteins display decreased binding affinity to penicillin (Hakenbeck et al., 1980; Zighelboim and Tomasz, 1980; Grebe and Hakenbeck, 1996; Smith and Klugman, 1998). The genes coding for these low affinity PBPs are mosaic genes formed by recombination between strains, in many cases between pneumococcus and other streptococcal species (Dowson et al., 1989; Sibold et al., 1992; Chi et al., 2007; Sauerbier et al., 2012; Mousavi et al., 2017). A recent study captured a gene exchange event between a strain of S. mitis and pneumococcus isolated from a longitudinal study in a cystic fibrosis patient (Rieger et al., 2017). An additional example for a functional class of genes that undergoes exchange across species is peptides implicated in cell-cell communication. These coding sequences can display allelic distributions that do not match that of the species tree, suggestive of intra and inter-species gene exchange (Hoover et al., 2015; Cuevas et al., 2017; Kadam et al., 2017). Their presence and distributions outside pneumococcus are consistent with both gene exchange and coordinated interactions across species boundaries.

#### Differences in Plasticity Across Lineages

Pneumococcal lineages differ in their plasticity. Nonencapsulated strains appear to have the highest recombination frequencies. A study of over 3,000 carriage genomes, from residents of the Maela refugee camp in Thailand, observed the highest frequency of gene donors and recipients in nonencapsulated strains (Chewapreecha et al., 2014). These findings are highly relevant to public health as they suggest that strains beyond the scope of the vaccines serve as highways of recombination. In contrast, the lineage encompassing the serotype 3 strains from clonal complex 180 stand out as a lineage with a very stable genome and rare transfer events (Croucher et al., 2013b). This example is particularly significant as serotype 3 is targeted by PCV13, yet serotype 3 isolates are prevalent in some populations post PCV13 (Horácio et al., 2016; Slotved et al., 2016; Lapidot et al., 2017; Silva-Costa et al., 2018). The explanation for the possible reduced efficacy of the vaccine toward serotype 3 strains may lie in the fact that, unlike most capsules, their mucosal capsule is not covalently linked to the cell wall (Caimano et al., 1998; Cartee et al., 2000). Studies in the murine model suggest that type 3 capsule can detach from the

bacteria and serve as decoy diverting the antibody response away from the bacteria and in doing so decreasing vaccine efficacy (Choi et al., 2016). These type 3 strains are highly virulent and display a relatively low duration of carriage (Sleeman et al., 2006). Thus these extreme differences in plasticity may be attributed, at least in part, to differences in virulence potential and duration of carriage, which are associated with transformation efficiency (Chaguza et al., 2016a).

#### Balancing Selection of Accessory Genes

What is the consequence of vaccination on the pangenome composition? Given that accessory genes are often associated with lineages, what happens to the distribution of accessory genes associated with lineages of vaccine serotypes? One expectation is that as lineages decrease in frequency due to vaccine pressure, so do the accessory genes associated with such lineages. However, recent theoretical work and population studies suggest an alternative, where the frequency of accessory genes within a population returns to a baseline frequency after disruption by the vaccine (Corander et al., 2017; Azarian et al., 2018).

The mechanism proposed for balancing accessory genes within the pneumococcal population is negative frequencydependent selection (Corander et al., 2017). A comparison of multiple pneumococcal populations suggests that the frequency of many accessory genes in a pneumococcal population is maintained even after major disruptions in the primary lineages via vaccination (Corander et al., 2017). The same trend was observed in a large genomic study that compared over 900 genomes from two Native American communities (Azarian et al., 2018). Strains were collected over a period of 14 years from three groups: a vaccine naïve population, a population soon after introduction of the vaccine, and a population four to 6 years post-vaccine. Over time, accessory genes trended to frequencies observed in the naïve population (Azarian et al., 2018). Together, these studies provide building evidence that selection from the human immune system and/or other microbes serve to equilibrate the composition of the pneumococcal pangenome. When combined with the notion of metabolic types, this observation has profound implications regarding our ability to predict the influence of vaccines and therapies on the pangenome composition and consequently on our ability to design effective long-term tools for prevention, diagnosis, and treatment of pneumococcal disease.

#### Strain Diversification: A Distinct Phyletic Group Among the Pneumococcus

Within the genetically diverse pneumococci there is a clear example of strain diversification. A subset of non-encapsulated strains has long been identified as distinct in clinical studies (Porat et al., 2006; Williamson et al., 2008; Zegans et al., 2009; Norcross et al., 2010; Haas et al., 2011). More recently, four comparative genomic studies have observed that a subset of the non-encapsulated strains is clustered into a distinct phyletic branch (Croucher et al., 2014b; Hilty et al., 2014; Valentino et al., 2014; Antic et al., 2017). These are referred to as the classical non-encapsulated strains (Keller et al., 2016). Not all non-encapsulated strains are part of this distinct phyletic branch. The sporadic non-encapsulated strains belong to the major pneumococcal phylogenetic branch. For more detail, a comprehensive review of non-encapsulated strains is available (Keller et al., 2016). In this manner, nonencapsulated pneumococci can be classified into two highly distinct phylogenetic groups based on their core genome sequence, where the classical set represents a distinct phyletic branch (**Figure 2**, branch "B").

The classical non-encapsulated strains encode a distinct accessory genome (Croucher et al., 2014b; Hilty et al., 2014; Valentino et al., 2014; Antic et al., 2017). This discrete set of genes has biological implications, as the vast majority of isolates that cause pneumococcal conjunctivitis are part of this group (Hilty et al., 2014; Valentino et al., 2014). The adhesin SspB exemplifies a gene product that is widely distributed in classical nonencapsulated strains and absent outside this set. SspB enhances attachment to ocular epithelial cells and may play a role in tissue tropism by extending the pneumococcal niche to the conjunctiva (Antic et al., 2017). A theoretical study on speciation has explored this distinct pneumococcal lineage to model speciation. The study suggests that frequent recombination between the classical non-encapsulated strains and other pneumococci maintains this branch at a constant genetic distance and may serve as a force to prevent speciation (Marttinen and Hanage, 2017). In summary, the classical non-typable strains illustrate strain diversification in the pneumococcus, and serve as the most extreme example of genetic diversity within this species.

### PERSPECTIVES

The pneumococcal community is poised for a new set of challenges and discoveries, with genomic data, in vitro and in vivo evolution studies, and omics technologies developing at unprecedented speeds. The use of antibiotics and the implementation of pneumococcal vaccines have played a crucial role in decreasing mortality due to pneumococcal infections. However, pneumococcus remains a major human pathogen, with high rates of carriage, multi-drug resistance, and serotype replacement (Huang et al., 2009; O'Brien et al., 2009; Rodrigues et al., 2009; Lee et al., 2014; WHO | Pneumococcal Disease, 2014; Nunes et al., 2016). There is significant progress in the development of a universal serotype-independent vaccine. These efforts are focused on protective and widespread protein antigens as well as on the development of whole cells vaccines (Miyaji et al., 2013; Rosch, 2014; Moffitt and Malley, 2016; Wilson et al., 2017). In parallel, experimental evolution in in vitro biofilms provide insights into adaptation in vivo (Churton et al., 2016; Cowley et al., 2018). Current studies suggest the revolutionary idea that we may be in a position to predict the genomic composition of dynamic pneumococcal populations and their responses to vaccines and therapies. Moreover, in keeping with pneumococcus as a model for genomic studies, these findings may be applicable to the evolution of pangenomes in other bacterial species. When integrated with future advances in the understanding of virulence determinants and master regulators of pathogenesis, such

information should permit exquisitely effective vaccines and therapy implementation. Perhaps, one day, we may have the required knowledge to manipulate bacterial pangenomes to transform opportunistic pathogens into human commensals.

#### AUTHOR CONTRIBUTIONS

NH and RS-L had written the review.

#### FUNDING

This work was supported by the Department of Biological Sciences at Carnegie Mellon, the Carnegie Mellon University

#### REFERENCES


Article Processing Charge (APC) Fund, the ONEIDA project (LISBOA-01-0145-FEDER-016417) co-funded by FEEI – "Fundos Europeus Estruturais e de Investimento" from "Programa Operacional Regional Lisboa 2020", and by Portuguese national funds from Fundação para a Ciência e a Tecnologia.

#### ACKNOWLEDGMENTS

We would like to thank Dr. Angela Brueggemann, Ms. Karina Müller-Brown, and Dr. Arthur Huen for their productive comments and suggestions. We dedicate this review to the Great Prof. Alexander Tomasz, who has contaminated us with his contagious fascination for the pneumococcus.



Clin. Microbiol. Infect. Dis. 20(Suppl. 5), 45–51. doi: 10.1111/1469-0691. 12461


pneumoniae clinical isolates. BMC Microbiol. 8:173. doi: 10.1186/1471-2180-8- 173



and pan-genome in different pneumococcal populations. bioRxiv [Preprint]. doi: 10.1101/133991


antibody to protein antigens. PLoS Pathog. 13:e1006137. doi: 10.1371/journal. ppat.1006137


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer JR declared a past co-authorship with one of the authors NH to the handling Editor.

Copyright © 2018 Hiller and Sá-Leão. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Sucrose-Induced Proteomic Response and Carbohydrate Utilization of *Lactobacillus sakei* TMW 1.411 During Dextran Formation

Roman M. Prechtl <sup>1</sup> , Dorothee Janßen<sup>1</sup> , Jürgen Behr 2†, Christina Ludwig<sup>2</sup> , Bernhard Küster <sup>2</sup> , Rudi F. Vogel <sup>1</sup> and Frank Jakob<sup>1</sup> \*

<sup>1</sup> Lehrstuhl für Technische Mikrobiologie, Technische Universität München, Freising, Germany, <sup>2</sup> Bavarian Center for Biomolecular Mass Spectrometry, Freising, Germany

#### *Edited by:*

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### *Reviewed by:*

Diego Mora, Università degli Studi di Milano, Italy Michael Gänzle, University of Alberta, Canada

> *\*Correspondence:* Frank Jakob frank.jakob@wzw.tum.de

#### *†Present Address:*

Jürgen Behr, Leibniz-Institut für Lebensmittel-Systembiologie, Technische Universität München, Freising, Germany

#### *Specialty section:*

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

*Received:* 31 August 2018 *Accepted:* 31 October 2018 *Published:* 23 November 2018

#### *Citation:*

Prechtl RM, Janßen D, Behr J, Ludwig C, Küster B, Vogel RF and Jakob F (2018) Sucrose-Induced Proteomic Response and Carbohydrate Utilization of Lactobacillus sakei TMW 1.411 During Dextran Formation. Front. Microbiol. 9:2796. doi: 10.3389/fmicb.2018.02796 Lactobacillus (L.) sakei belongs to the dominating lactic acid bacteria in indigenous meat fermentations, while diverse strains of this species have also been isolated from plant fermentations. We could recently show, that L. sakei TMW 1.411 produces a high molecular weight dextran from sucrose, indicating its potential use as a dextran forming starter culture. However, the general physiological response of L. sakei to sucrose as carbohydrate source has not been investigated yet, especially upon simultaneous dextran formation. To address this lack of knowledge, we sequenced the genome of L. sakei TMW 1.411 and performed a label-free, quantitative proteomics approach to investigate the sucrose-induced changes in the proteomic profile of this strain in comparison to its proteomic response to glucose. In total, 21 proteins were found to be differentially expressed at the applied significance criteria (FDR ≤ 0.01). Among these, 14 were associated with the carbohydrate metabolism including several enzymes, which enable sucrose and fructose uptake, as well as, their subsequent intracellular metabolization, respectively. The plasmid-encoded, extracellular dextransucrase of L. sakei TMW 1.411 was expressed at high levels irrespective of the present carbohydrate and was predominantly responsible for sucrose consumption in growth experiments using sucrose as sole carbohydrate source, while the released fructose from the dextransucrase reaction was more preferably taken up and intracellularly metabolized than sucrose. Genomic comparisons revealed, that operons coding for uptake and intracellular metabolism of sucrose and fructose are chromosomally conserved among L. sakei, while plasmid-located dextransucrase genes are present only in few strains. In accordance with these findings, all 59 different L. sakei strains of our strain collection were able to grow on sucrose as sole carbohydrate source, while eight of them exhibited a mucous phenotype on agar plates indicating dextran formation from sucrose. Our study therefore highlights the intrinsic adaption of L. sakei to plant environments, where sucrose is abundant, and provides fundamental knowledge regarding the use of L. sakei as starter culture for sucrose-based food fermentation processes with in-situ dextran formation.

Keywords: *Lactobacillus sakei,* proteomics, genomics, sucrose, dextran, metabolism

## INTRODUCTION

The species Lactobacillus (L.) sakei is typically isolated from fermented meat products, which was suggested as its main habitat, since it belongs to the dominating species in spontaneous meat fermentations (Hammes et al., 1990; McLeod et al., 2010). Hence, L. sakei spp. are frequently exploited as starter cultures for the manufacturing of fermented meat products (e.g., rawfermented sausages) (Zagorec and Champomier-Vergès, 2017). Since the main nutrients available for growth in meat products are (purine) nucleosides, certain amino acids (e.g., arginine), glucose and ribose (Champomier-Vergès et al., 2001; Chaillou et al., 2005; Rimaux et al., 2012), the adaption of L. sakei spp. to these carbon sources, as well as, associated proteomic profiles and metabolic pathways have been subject of several studies, including growth experiments and -omics approaches (Hammes et al., 1990; Hüfner et al., 2007; Fadda et al., 2010; McLeod et al., 2010, 2011).

However, L. sakei has also been isolated from various plant fermentations such as silage and sauerkraut, while it had originally been isolated from rice wine, namely the traditional Japanese beverage Sake (Vogel et al., 1993; Torriani et al., 1996; Champomier-Vergès et al., 2001; Amadoro et al., 2015; Prechtl et al., 2018). Above that, some strains of L. sakei express glucansucrases that synthesize high molecular weight dextrans from sucrose, which were discussed to contribute to biofilm formation or could be used as fish prebiotics in aquacultures (Gänzle and Follador, 2012; Nácher-Vázquez et al., 2015, 2017a; Prechtl et al., 2018). Thus, the species L. sakei seems to be adapted to plant-based environments as well, where sucrose is usually the most abundant carbon source.

Despite its frequent use as a starter culture for the manufacturing of fermented meat products, the capability of some strains to produce exopolysaccharides such as dextran from sucrose has not been commercially exploited yet. In such applications, sucrose would have to be added as carbon source instead of the usually applied glucose, enabling in-situ dextran synthesis and thereby the manufacturing of "clean label" products with improved properties (Hilbig et al., unpublished data).

However, little is still known about the general physiological response of dextran-forming L. sakei to this carbohydrate and especially about metabolic pathways, which allow intracellular sucrose consumption and could hence compete with dextran formation. To address this lack of knowledge, we sequenced the genome of the dextran-producing strain L. sakei TMW 1.411 and performed a label-free proteomic approach for identification of upregulated metabolic pathways upon its growth on sucrose. Moreover, we determined produced and consumed metabolites by L. sakei TMW 1.411 during dextran formation, analyzed sucrose metabolic pathways among diverse L. sakei strains via comparative genomics and finally correlated the obtained results to the general outcome of the proteomic study.

### MATERIALS AND METHODS

#### Chemicals

Chemicals used for growth media and buffers, as well as, solutions for dextran quantification and sample preparation were purchased from Carl Roth GmbH (Karlsruhe, Germany), Merck KGaA (Darmstadt, Germany), and GERBU Biotechnik GmbH (Heidelberg, Germany).

#### Bacterial Strain and Growth Conditions

Lactobacillus sakei TMW 1.411, originally isolated from sauerkraut, was obtained as a cryo-culture from our strain collection. To recover the cells from cryo-cultures, a modified MRS (mMRS) medium (Stolz et al., 1995) supplemented with the carbohydrates glucose 5 g/L, fructose 5 g/L, and maltose 10 g/L was used for the preparation of agar plates (1.5%) and liquid precultures, whereas incubation was carried out for 36–48 h at 30◦C at micro-aerobic conditions in sealed tubes (15 ml; Sarstedt AG & Co, Germany) without shaking. Depending on the performed experiment, the sugars were replaced by other carbohydrates in the working cultures as indicated. To determine viable cell counts (CFU/mL), 50 µL of appropriate dilutions in saline (0.9% NaCl) were spread on mMRS agar with sterile glass beads (2.7 mm, Carl Roth GmbH, Karlsruhe, Germany) and incubated at 30◦C for at least 48 h.

#### Genome Sequencing and Annotation

High-molecular-weight DNA was isolated from liquid cultures (late exponential growth phase) in mMRS and purified as previously described (Kafka et al., 2017). To obtain the whole genome shotgun sequence (WGS), the Illumina MiSeq <sup>R</sup> sequencing technology was applied in combination with the SPAdes 3.9 assembly algorithm. Afterwards, the WGS was annotated with PGAP (prokaryotic genome annotation pipeline) and RAST (rapid annotations using subsystems technology), which included SEED subsystem analysis (Aziz et al., 2008; Tatusova et al., 2016). The RAST annotated genome and open reading frames (ORF) are deposited as online supplementary material, whereas the WGS project has been published (DDBJ/ENA/GenBank) under the accession QOSE00000000. The version described in this paper is version QOSE01000000.

#### Proteomic Analysis Experimental Setup

To investigate the proteomic shift in response to sucrose as sole carbon source, 4 × 15 ml precultures (four biol. replicates) of L. sakei TMW 1.411 were prepared in mMRS as described above (section Bacterial Strain and Growth Conditions) and used to inoculate 4 × 100 ml cultures in mMRS (20 g/L glucose) with a final OD<sup>600</sup> of 0.1. The cultures were grown to the mid-exponential growth phase (pH ∼5.0, determined in preliminary experiments), which had given good results in previous experiments (Schott et al., 2017), and subsequently distributed to 50 ml sealed tubes each (eight tubes in total). Afterwards, the cultures were pelletized (5,000 × g, 10 min) and washed once in fresh mMRS. Next, the suspensions were pelletized again and resuspended in an equal volume of mMRS supplemented with either glucose or sucrose (20 g/L each), followed by incubation at 30◦C for 2 h. Subsequently, 2.5 mL of cooled trichloroacetic acid (100%) were added to 40 mL of glucose/sucrose treated cultures (6.25% w/v final concentration) and the suspensions were immediately transferred to pre-cooled 50 ml tubes and incubated on ice for 10 min. After centrifugation (5,000 × rpm, 10 min, 4◦C), the pellets were washed twice with 10 mL cold acetone (−20◦C) (2,000 rpm, 10 min, 4◦C), whereas the supernatants were discarded carefully. Finally, the pellets were frozen in liquid nitrogen and stored at −80◦C until protein isolation and peptide preparation (section Peptide Preparation, Separation, and Mass Spectrometry). In addition, aliquots were taken from each of the four precultures, as well as, the eight batches after 2 h incubation to determine pH values and the viable cell count in CFU/mL on agar plates.

#### Peptide Preparation, Separation, and Mass Spectrometry

Cell pellets were resuspended in lysis buffer [8 M urea, 5 mM EDTA disodium salt, 100 mM NH4HCO3, 1 mM Dithiothreitol (DTT) in water, pH = 8.0] and disrupted mechanically using glass beads (G8772, 425– 600µm, Sigma, Germany), whereas a Bradford assay (Bio-Rad Protein Assay, Bio-Rad Laboratories GmbH, Munich, Germany) was performed to determine the total protein concentration in the lysate. Afterwards, 100 µg protein extract of each sample were used for in-solution digestion: After reduction (10 mM DTT, 30◦C, 30 min) and carbamidomethylation (55 mM chloroacetamide, 60 min in the dark), trypsin was added to the samples, and the solutions were incubated overnight at 37◦C. Next, the digested protein samples were desalted using C18 solid phase extraction with Sep-Pak columns (Waters, WAT054960) following the manufacturer's protocol. Finally, the purified peptide samples were dried with a SpeedVac device and dissolved in an aqueous solution of acetonitrile (2%) and formic acid (0.1%) at a final concentration of 0.25 µg/µL.

Peptide analysis was performed on a Dionex Ultimate 3000 nano LC system, which was coupled to a Q-Exactive HF mass spectrometer (Thermo Scientific, Germany). At first, the peptides were loaded on a trap column (75µm × 2 cm, self-packed, Reprosil-Pur C18 ODS-3 5µm resin, Dr. Maisch, Ammerbuch) at a flow rate of 5 µL/min in solvent A<sup>0</sup> (0.1% formic acid in water). Next, the separation was performed on an analytical column (75µm × 40 cm, self-packed, Reprosil-Gold C18, 3µm resin, Dr. Maisch, Ammerbuch) at a flow-rate of 300 nL/min applying a 120 min linear gradient (4–32%) of solvent B (0.1% formic acid, 5% DMSO in acetonitrile) and solvent A<sup>1</sup> (0.1% formic acid, 5% DMSO in water).

The mass spectrometer was operated in the data dependent mode to automatically switch between MS and MS/MS acquisition. The MS1 spectra were obtained in a mass-to-charge (m/z) range of 360–1,300 m/z using a maximum injection time of 50 ms, whereas the AGC target value was 3e6. Up to 20 peptide ion precursors were isolated with an isolation window of 1.7 m/z (max. injection time 25 ms, AGC value 1e5), fragmented by higher-energy collisional dissociation (HCD) applying 25% normalized collision energy (NCE) and finally analyzed at a resolution of 15,000 in a scan range from 200 to 2,000 m/z. Singlycharged and unassigned precursor ions, as well as, charge states >6+ were excluded.

#### Protein Identification and Quantification

Both identification and quantification of peptides and proteins were performed with the software MaxQuant (v. 1.5.7.4) by searching the MS2 data against all protein sequences predicted for the reference genome of L. sakei TMW 1.411 by the RAST annotation pipeline (section Genome Sequencing and Annotation; GenBank QOSE0100000) using the embedded search engine Andromeda (Cox et al., 2011). While the carbamidomethylation of cysteine was a fixed modification, the oxidation of methionine, as well as, the N-terminal protein acetylation were variable modifications. Up to two missed Trypsin/P cleavage sites were allowed and precursor and fragment ion tolerances were set 10 and 20 ppm, respectively. The label-free quantification (Cox et al., 2014) and data matching were enabled within the MaxQuant software between consecutive analyses, whereas filtering of the search results was performed with a minimum peptide length of 7 amino acids, as well as, 1% peptide and protein false discovery rate (FDR) plus common contaminants and reverese identifications.

#### Data Processing and Statistical Analysis

The Perseus software (version 1.6.0.7) was used to process the MaxQuant output file (proteinGroups.txt) and conduct statistical analyses (Tyanova et al., 2016). After filtering of the protein groups (removal of identified by site hits, reverse identifications, contaminants), the LFQ intensity data were log<sup>2</sup> transformed, whereas the IBAQ intensities were log<sup>10</sup> transformed. To improve the validity of statistical analysis, only proteins which had been identified (i) by at least two unique peptides and (ii) in all four replicates of at least one group (glucose/sucrose treated cells) were considered, whereas missing values after logtransformation were imputed from a normal distribution (width: 0.2; down shift: 1.8). The log2-transformed LFQ data were used for a stringent t-Test analysis, using a Benjamini-Hochberg FDR of 0.01 for truncation, whereas proteins with an absolute log<sup>2</sup> fold change (FC) of ≥1 were further discussed in the present study. To estimate absolute protein abundancies at a certain condition, the transformed IBAQ intensities were averaged ranked descending for each group. The results of the t-Test analysis are available as only **Supplementary Material**.

### Monitoring of Growth, Metabolite, and Dextran Formation in mMRs Medium

Precultures in mMRS were prepared from single colonies (two biol. replicates) on agar plates as described (section Bacterial Strain and Growth Conditions) and subsequently used to prepare growth series (inoculum: OD<sup>600</sup> = 0.1) in mMRS medium supplemented with sucrose (50 g/L) as sole carbon source. After various incubation times, characteristic growth parameters such as viable cell count (CFU/mL; 2.2) and pH (Knick 761 Calimatic, Knick, Germany) were determined. Culture supernatants were prepared by centrifugation (5,000 × g, 10 min, 4 ◦C) und subsequently stored at −20◦C until metabolite and dextran quantification. Sugars and organic acid concentrations were determined with a HPLC system (Dionex Ultimate 3000, Thermo Fisher Scientific, United States) coupled to Shodex refractive index (RI) detector (Showa Denko Shodex, Germany), whereas 20 µL were injected from prepared samples. For sample preparation, supernatants were either filtered (0.2µm nylon filters, Phenomenex, Germany) and diluted (analysis of sugars) or treated as follows (analysis of organic acids): 50 µL perchloric acid (70%) were added to 1 mL of supernatant, mixed thoroughly and incubated overnight (4◦C). Afterwards, the samples were centrifuged (13,000 × g, 30 min, 4◦C) and filtered (0.2µm). The sugars were measured with a RezexTM RPM Pb2<sup>+</sup> column at a flow-rate of 0.6 mL/min (85◦C) using filtered (0.2µm) deionized water as eluent, whereas organic acids were measured with a RezexTM ROA H<sup>+</sup> column (both Phenomenex, Germany) at a flow-rate of 0.7 mL/min (85◦C) with 2.5 mM H2SO<sup>4</sup> (prepared with filtered, deionized water). Metabolites were identified and quantified by means of appropriate standard solutions with the ChromeleonTM software (v. 6.8; Dionex, Germany).

Dextran quantification was performed from dialyzed supernatants (3.5 kDa cut-off dialysis tubings, MEMBRA-CEL <sup>R</sup> , Serva Electrophoresis GmbH, Germany) using the phenol sulfuric acid (PSA) method as described previously (Prechtl et al., 2018). To ensure the reliable removal of sucrose, dialysis was performed over 48 h (4◦C) with continuous stirring, whereas the water (5 L exchange volume, ∼100 × dilution factor) was changed at least five times. As blank value, non-inoculated mMRS medium was dialyzed in the same way, subjected to PSA quantification and finally subtracted from the amount of dextran quantified in the samples.

#### Data Deposition

An additional file containing the protein sequences with corresponding FIG identifiers (from RAST annotation), as well as, relevant proteome tables with assigned SEED categories and the results of the t-Test evaluation are deposited online as supplementary data. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifier PXD011417 (http://proteomecentral.proteomexchange.org).

### RESULTS

#### General Features of the Genome of *L. sakei* TMW 1.411

Prior to performing the proteomic experiment, the genome of L. sakei TMW 1.411 was sequenced and annotated as described in the materials and methods section (Genome Sequencing and Annotation).

The obtained WGS of L. sakei TMW 1.411 comprises 41 contigs. A genome size of ca. 1.9 Mb with a GC content of 41.0% was predicted, which is in the usual range of different genome-sequenced strains of L. sakei. Additional information including the genome coverage, the number of predicted coding sequences (CDS) and RNAs is presented in **Table 1**. The contigs seq28, seq32, and seq36 could be circularized due to sequence overlaps, and manual BLASTn analysis of the processed sequences confirmed a high nucleotide sequence identity (90– 99%) with known plasmids of L. sakei and L. curvatus species (**Table 2**).

TABLE 1 | General features of the sequenced genome of L. sakei TMW 1.411.


Among these plasmids, seq28 (denoted as plasmid p-1.411\_1) harbored a 5.3 kb ORF encoding a predicted glucansucrase of the glycoside hydrolase (GH) family 70 (gene locus DT321\_09485). The corresponding amino-acid sequence was almost identical (>95% identity and >95% coverage) to those of known dextransucrases (Dsr) in L. sakei and L. curvatus species, such as DsrLS (ac. no. ATN28243) and GtfKg15 (ac. no. AAU08011.1), or Gtf1624 (ac. no. CCK33643) and Dsr11928 (ac. no. AXN36915.1), respectively. The main difference between the amino-acid sequences was the length of an alanine-rich amino acid repeat, which formed a putative linker segment between the GH70 domain and the C-terminal cell-wall anchor motif (LPxTG).

Ca. 14% of all SEED category assignments accounted for the metabolism of carbohydrates, including mono-, di- and oligosaccharides, as well as, amino-sugars and sugar alcohols (**Table S1**). Among these, the genes associated with sucrose and fructose metabolism were considered to be most relevant for a growth of L. sakei TMW 1.411 on sucrose as sole carbon source, since fructose is released by dextransucrases concomitantly with dextran polymerization. In both cases, the corresponding genes were arranged in an operon (**Table 3**).

Manual BLASTn analysis revealed, that among all currently 38 genome-sequenced L. sakei strains solely L. sakei LK-145 lacks the sucrose operon (**Table S2**), while each deposited L. sakei genome contains the fructose operon. On the contrary, only four of these strains comprised a dextransucrase gene (L. sakei FLEC01, MFPB19, J112, J156). In all these strains, the genes were encoded on plasmids, whose nucleotide sequences were nearly identical (**Table 2**). Metabolization of sucrose by L. sakei was further confirmed in preliminary growth experiments on agarplates, since all of 59 tested strains of L. sakei of our strain collection were able to grow on sucrose as sole carbon source. Moreover, eight strains including L. sakei TMW 1.411 produced mucous substances (most likely EPS) from sucrose.

### Generation and Evaluation of the Proteomic Data Set

A scheme of the chosen approach, which describes the performed experimental steps to investigate the proteomic changes of L. sakei TMW 1.411 after a switch to sucrose as sole carbon source, is presented in **Figure S1**. To evaluate the plausibility of the generated data set and exclude a bias for distinct protein


TABLE 2 | WGS sequence contigs of L. sakei TMW 1.411 with assigned plasmid names, identified homologs, and related information.

TABLE 3 | Sucrose and fructose operons of L. sakei TMW 1.411 with associated genes and predicted functions according to RAST annotation.


The gene names were assigned according to the strain L. sakei ssp. sakei 23K. The gene loci refer to the deposited WGS (GenBank ac. no. QOSE00000000), whereas the FIG identifiers refer to the RAST annotation which was used for the evaluation of proteomic data (file is provided as online supplement).

categories during the sequential data filtering steps, the in-silico proteome of L. sakei TMW 1.411 was compared with the protein sub-groups created during data filtering with respect to protein numbers and corresponding SEED categories, which had been derived from RAST annotations (**Figure 1**).

Compared to the in-silico proteome of L. sakei TMW 1.411, which comprised 1912 putative proteins according to the number of coding sequences (CDS) predicted by RAST annotation, 1017 proteins could be identified based on the applied quality criteria described in 2.4.4. This resulted in a proteome coverage of ∼53%, which is in the typical range of label-free quantitative proteomics approaches (Liu et al., 2014). To further increase the accuracy of statistical evaluation, only proteins detected in all four replicates of a group (glucose or sucrose) were considered for statistical analysis (section Data Processing and Statistical Analysis), which amounted to a subset of 911 proteins, whose expression levels were compared between both groups by statistical analysis. The SEED category distributions were similar for all protein sub-groups (except for the differentially expressed proteins being addressed in 3.3). Therefore, a potential bias for any protein category during the data filtering steps could be excluded (**Figure 1B**).

Another important measure to ensure the validity of the proteomic experiment was to confirm equal viable cell counts and pH values in both groups (glucose and sucrose) prior to protein isolation, as any variation could have been the result of non-uniform cell growth/death or acidification during the 2 h of incubation, which might have influenced the proteomic profiles as well. Thus, the average viable cell count was determined in both batch groups and the values (glucose: 1.3 ± 0.2 × 10<sup>8</sup> CFU/mL; sucrose: 1.6 ± 0.5 × 10<sup>8</sup> CFU/mL) were demonstrated to be statistically not different (t-Test, p = 0.05). As for the pH values, the cells incubated in sucrose containing mMRS medium (pH = 4.17 ± 0.02) showed a slightly weaker acidification compared to the reference batch in glucose containing mMRS medium (pH = 4.10 ± 0.01). Although the mean difference was only 0.07 pH units, statistical analysis (two sample t-test, **Table S3**) revealed it to be significant (p = 0.01).

#### Comparison of the Proteomic States Associated With Growth in Glucose and Sucrose

To compare the proteomic profiles of the cultures incubated in glucose and sucrose, respectively, the log<sup>2</sup> transformed LFQ

intensities of 911 proteins (**Figure 1**) were compared between both groups applying a stringent statistical analysis (t-Test with Benjamini-Hochberg FDR ≤ 0.01, 2.4.4). The results are visualized in a volcano-plot (**Figure 2**).

At the applied statistical criteria (section Data Processing and Statistical Analysis), 21 proteins were found to be differentially expressed in cells incubated with glucose or sucrose as sole carbon source, whereas 16 displayed an absolute log<sup>2</sup> FC of >1 and will be further discussed in this study. As reflected by the SEED category distribution of the differentially expressed proteins (**Figure 1B**), ∼60% of the assigned categories were associated with the metabolism of carbohydrates. This included the genes of the sucrose and fructose operon, respectively, which were up-regulated in sucrose incubated cells, whereas the highest log<sup>2</sup> FC (7.1 and 5.8) were observed for the characteristic enzymes of the sucrose metabolic pathway, namely the PTS sucrose transporter subunit and the sucrose-6-phosphate hydrolase (**Figure 2** and **Table 4**). Interestingly, although being significantly upregulated in the sucrose treated cells, the proteins of the fructose operon showed a relatively high abundance in the glucose treated cells as well, as suggested by the IBAQ intensities (**Figure 3**), which can be used to estimate absolute proteomewide protein abundances (Schwanhäusser et al., 2011; Ahrné et al., 2013).

Three enzymes associated with the catabolism of deoxynucleosides (**Figure 2**, orange), as well as, the trehalosephosphate hydrolase and a predicted hydrolase also showed an increased expression in sucrose treated cells. The enzymes of the arginine-deiminase pathway were found to be more abundant in glucose treated cells, which either suggested a sucrose-induced downregulation or a glucose mediated upregulation (**Figure 2**, red).

Apart from that, the expression levels of the dextransucrase Dsr1411 were compared for both carbon sources, since sucrose is the natural substrate of this enzyme and thus could have a positive impact on its expression. However, this enzyme was not differentially expressed (**Figure 2**, green). Moreover, an evaluation of the IBAQ intensities pointed at relatively high amounts of this enzyme within the cellular proteome irrespective of the present carbon source (**Figure 3**, green).

To demonstrate the validity of the experiment, the t-Test results for the expression of five common housekeeping proteins (GroEL/ES, RpoD, DNA Gyrase Subunits A/B), which was expected to be independent of the present carbon source, were highlighted in the Volcano-Plot (**Figure 2**, black descriptors). Additional **Supplementary Information** about the differentially expressed proteins and a detailed summary of the t-Test evaluation are provided in **Table 4** and as online supplementary data (**Table S3**).

### Monitoring of Sugar Consumption, as Well as, Lactate and Dextran Formation During Growth on Sucrose

The proteomic experiment (3.2 + 3.3) gave insights into the basic response of L. sakei TMW 1.411 to sucrose at an early stage of growth (after 2 h incubation in sucrose containing mMRS). In this way, the differential expression of sucrosemetabolizing pathways could be detected. To further investigate sucrose utilization under common EPS production conditions

(Prechtl et al., 2018), metabolite and dextran concentrations were monitored during growth in mMRS medium over 48 h (**Figure 4**).

The CFU of L. sakei TMW 1.411 increased after ca. 6 h, which was accompanied by dextran synthesis, sucrose consumption and lactate formation during the exponential growth phase (**Figure 4A** + B). Fructose was detectable for the first time after 9 h of cultivation and reached a maximum after 10 h (ca. 1.6 g/L; **Figure 4B**). Afterwards, the fructose concentration decreased until it was depleted after 24 h, whereas the sucrose concentration stayed more or less constant at 39 g/L between 10 and 24 h, and finally showed another slight decrease to 37 g/L after 48 h. The fructose concentrations lay always below the dextran concentrations, although the amount of fructose released during dextran synthesis and the produced amount of dextran in glucose equivalents (glc. equ.) should be stoichiometrically identical, if released fructose reflected the total dextransucrase activity. In total, ca. 10 g/L sucrose were consumed during the 48 h of fermentation, whereas about 3 g/L dextran were produced. Considering the theoretical maximum possible amount of ∼5 g/L dextran (<50% of the consumed sucrose due to one released fructose + water molecule per transferase reaction), this resulted in a dextran yield of roughly 60%. Glucose, which could possibly be released by the hydrolysis activity of dextransucrases, as well as, fermentation products such as acetate or ethanol could not be detected in the culture supernatant.

#### DISCUSSION

### Genetic Adaption of *L. sakei* to Plant Environments

Although L. sakei belongs to the dominating species in indigenous meat fermentations, where glucose and ribose are the predominating carbohydrates (Champomier-Vergès et al., 2001), the operons for both sucrose and fructose utilization are strikingly conserved among L. sakei spp. as demonstrated in the present work (section General Features of the Genome of L. sakei TMW 1.411). Furthermore, a predicted glucan-1,6-α-glucosidase (DexB) is located within the sucrose operon, which was demonstrated to hydrolyze isomaltose (degradation of starch) in L. acidophilus (Møller et al., 2012). Since these carbohydrates are commonly available in plants, this suggests an


TABLE 4 | Log<sup>2</sup> FCs, p-values (-log10) and related information of differentially expressed proteins (Benjamini-Hochberg FDR ≤ 0.01).

Negative log<sup>2</sup> FC values indicate higher abundance in glucose treated cells, whereas positive values indicate higher abundance in sucrose treated cells. The predicted functions and assigned SEED subsystems were derived from RAST annotation (FIG identifiers). The gene loci refer to the deposited WGS sequence (accession number QOSE00000000).

adaption of this species to plant-based environments, including the digestive organs of plant feeding organisms. Thus, it is not surprising that several species have been isolated from fermented plant products (Vogel et al., 1993; McLeod et al., 2011), including L. sakei TMW 1.411, which had originally been isolated from a sauerkraut fermentation.

Above that, genomic analyses revealed several L. sakei strains including TMW 1.411 to harbor a highly homologous ∼11 kb plasmid encoding a cell wall-bound dextransucrase (section General Features of the Genome of L. sakei TMW 1.411; **Table 2**), which uses sucrose as substrate for polymer synthesis. Interestingly, the same plasmid was found in a strain of the closely related species L. curvatus. As dextran formation is responsible for biofilm formation in diverse LAB species (Leathers and Cote, 2008; Walter et al., 2008; Zhu et al., 2009; Leathers and Bischoff, 2011; Nácher-Vázquez et al., 2017a; Fels et al., 2018; Xu et al., 2018), its production could protect L. sakei against desiccation and could enable surface adhesion, providing an advantage in the colonialization of plants (Cerning, 1990; Badel et al., 2011; Zannini et al., 2016). In a meatbased environment, however, the expression of dextransucrases is unlikely to provide any advantages, since sucrose is not available and above that, only a small fraction of L. sakei strains is carrying the corresponding plasmid (**Table S2**). Nevertheless, it was reported to be a stably inherited low-copy number plasmid, which was attributed to its repA/B based replication mechanism and a possible toxin-antitoxin system. The dextransucrase gene has been proposed to have integrated through a transposition process (Nácher-Vázquez et al., 2017b). This might once have led to a selective advantage in habitats, where sucrose is the predominant carbon source, such as in plants or in e.g., plant sap sucking insects.

### Sucrose Independent Expression of the Dextransucrase Dsr1411

While in Leuconostoc spp. and Weissella spp. the expression of glucansucrases (e.g., dextransucrases) has been shown to be most often specifically induced by sucrose, many other LAB express glucansucrases independently of sucrose (Kralj et al., 2004; Arsköld et al., 2007; Bounaix et al., 2010; Gänzle and Follador, 2012; Harutoshi, 2013; Nácher-Vázquez et al., 2017b).

The switch of the carbon source from glucose to sucrose performed within the present study neither had an inducing nor any observable stimulating effect on the abundance of the dextransucrase Dsr1411 (**Figure 2**). This clearly suggested a sucrose-independent expression and agreed with previous experiments, where dextran synthesis by L. sakei TMW 1.411 in buffer solutions was possible even if the cells had been previously cultivated in glucose containing mMRS medium (Prechtl et al.,

2018). Furthermore, it was demonstrated by gene expression analyses that the plasmid-encoded dextransucrase DsrLS of L. sakei MN1 was constitutively expressed and connected to replication and maintenance functions of the plasmid pMN1 (Nácher-Vázquez et al., 2017b). Since pMN1 and p-1.411\_1 were shown to be more or less identical (**Table 2**), these results should be transferable to the expression of Dsr1411 in L. sakei TMW 1.411.

Apart from that, analysis of the IBAQ intensities of identified proteins suggested a surprisingly high abundance of the dextransucrase Dsr1411 in the cellular proteome, which was even comparable to those of common housekeeping proteins, such as the RNA polymerase sigma factor RpoD (**Figure 3** and **Table S1**). The structural characterization of the dextran produced by Dsr1411 was published only recently by our group (Prechtl et al., 2018), and the results (e.g., molecular weight) were similar to those published for the dextrans synthesized by DsrLS (L. sakei MN1) and Gtf1624 (L. curvatus TMW 1.624) (Ruhmkorf et al., 2012; Nácher-Vázquez et al., 2015).

#### Global Proteomic Response of *L. sakei* TMW 1.411 to Sucrose

After the performed switch to sucrose as sole carbon source, the upregulation of both the sucrose and the fructose operon was detected. The upregulation of the fructose operon, which contains a fructose uptake system, indicates the active uptake of extracellular fructose at an early stage of growth. This could be explained by the simultaneous secretion of active dextransucrases (section Sucrose Independent Expression of the Dextransucrase Dsr1411), which extracellularly release fructose. The corresponding metabolic pathways are described in **Figure 5**.

Furthermore, the glucan-1,6-α-glucosidase DexB was upregulated in the sucrose treated cells suggesting induced dextran degradation. Although DexB of L. acidophilus NCFM (60% amino acid identity) was demonstrated to be active on dextran (Møller et al., 2012), neither of the two DexB variants contains a N-terminal signal peptide targeting its secretion into the extracellular environment according to SignalP analysis (Petersen et al., 2011). This conforms with the general assumption, that high molecular weight EPS does not primarily serve as a carbon reserve for the producer strains (Zannini et al., 2016), as active uptake of such high molecular weight polymers has not been reported to our knowledge. The upregulation of this enzyme might rather be indicative for uptake and metabolization of short-chain isomaltooligosaccharides (IMO), which could be produced by dextransucrases in addition to high molecular weight dextran. However, their possible import mechanism remains unclear, while it was reported that fructooligosaccharides (FOS) are efficiently imported by the PTS sucrose transport system in L. plantarum (Saulnier et al., 2007).

Apart from the enzymes accounting for the intracellular utilization of sucrose and fructose, three further enzymes were upregulated upon growth on sucrose, which are involved in the catabolism of deoxyribose-nucleosides (**Figure 2**, orange). This pathway includes three major steps: (i) release of 2-deoxyribose-1-phosphate from purine/pyrimidine-deoxynucleosides by the corresponding phosphorylases (EC 2.4.2.1/2.4.2.2); (ii) interconversion of

2-deoxyribose-1-phosphate and 2-deoxyribose-5-phosphate by phosphopentomutase (EC 5.4.2.7); (iii) formation of acetaldehyde and the glycolysis intermediate glyceraldehyde-3-phosphate by deoxyribose-phosphate aldolase (EC 4.1.2.4) (Tozzi et al., 2006). All enzymes of this pathway were significantly upregulated after sucrose treatment (**Figure 2**), except for the phosphopentomutase (gene locus DT321\_08540), whose upregulation (log<sup>2</sup> FC = 1.2) was only significant at less stringent t-Test criteria (FDR ≤ 0.05; **Table S3**). Interestingly, the same enzymes where shown to be upregulated in some L. sakei strains after a switch of the carbon source from glucose to ribose by a transcriptomic approach (McLeod et al., 2011). As the upregulation of these proteins can currently not be related to sucrose metabolism, their differential expression could be interpreted as general response to the change of the carbon source, e.g., to maintain glycolytic reactions during starvation until the organism has adapted to the new carbon source. However, further experimental analyses are necessary to confirm this hypothesis, since other factors such as glucose mediated carbon catabolite repression (CCR) might play a role as well.

Upon growth on glucose only three proteins were detected, which were more abundant in the presence of glucose (and thus downregulated after sucrose treatment). These proteins belonged to the arginine-deiminase (ADI) pathway (**Figure 2**, red), which involves three enzymes being encoded in the arc operon, namely (i) arginine deiminase (arcA, EC 3.5.3.6), (ii) ornithine carbamoyltransferase (arcB, EC 2.1.3.3) and (iii) carbamate kinase (arcC, EC 2.7.2.2). This pathway enables the synthesis of ATP from arginine upon formation of NH3, CO<sup>2</sup> and ornithine, and was therefore supposed to provide a metabolic advantage in nutrient-poor, meat-based environments (Rimaux et al., 2011). Apart from that, several other physiological functions of the ADI pathway have been discussed, including de novo pyrimidine synthesis and the cytoplasmic alkalization by NH<sup>3</sup> as protection against acid stress (Arena et al., 1999; Rimaux et al., 2011).

Basically, the arc operon has been shown to be subjected to CcpA/HPr mediated carbon catabolite repression (CCR), which is initiated at high concentrations of ATP and fructose-1,6-bisphosphate (FBP) in the presence of a preferred carbon source (Montel and Champomier, 1987; Deutscher et al., 1995; Fernández and Zúñiga, 2006; Görke and Stülke, 2008; Landmann et al., 2011). The mechanism involves regulatory cre sites in promotor regions, which are targeted by the CcpA/HPr complex, and both cre sites identified upstream of the arcA gene in L. sakei 23K (Zúñiga et al., 1998) are present in L. sakei TMW 1.411 as well (positions −124 and −44 from the start codon of arcA, gene locus DT321\_05025). However, the ADI pathway should be downregulated in the glucose treated cultures, if glucose were the preferred carbon source of L. sakei TMW 1.411 for energy generation and concomitant lactate production.

With respect to the alkalizing function of the ADI pathway, the pH values measured prior to protein isolation indeed suggested a slightly stronger acidification by the cells incubated in glucose containing mMRS medium (section Comparison of the Proteomic States Associated With Growth in Glucose and Sucrose), which might be explained by the lack of a metabolic switch to sucrose utilization. Although the difference in the pH values of both batches was only small (0.07 pH units), it was statistically significant (p = 0.01) and might point to an increased lactate formation from glucose. Hence, it is possible that the ADI pathway was upregulated in the glucose treated cells to compensate for a faster lactate formation in the presence of this carbohydrate. However, this hypothesis cannot be proven by the available data and has to be examined in future experiments, as it is beyond the scope of the present work.

#### Carbohydrate Utilization of *L. sakei* TMW 1.411 Upon Simultaneous Dextran Formation

Evaluation of the proteomic data had revealed the upregulation of the fructose operon after 2 h of sucrose exposure, which suggested a utilization of this carbon source already at an early stage (section Global Proteomic Response of L. sakei TMW 1.411 to Sucrose). Monitoring of the fructose concentration during growth on sucrose containing mMRS medium confirmed this finding, since fructose was detectable in the supernatant only

GH70: dextransucrase of L. sakei TMW 1.411, glycoside hydrolase 70 family; PEP, phosphoenolpyruvate; PTS II, phosphotransferase-system subunit II; PGI, phosphoglucose-isomerase; PFK, phosphofructokinase.

after 9 h, whereas dextran formation and thus release of fructose had already started after 6 h of cultivation **Figure 4**. If no fructose utilization had occurred, its concentration curve would have been expected to overlap with the dextran curve due to the stoichiometry of dextran synthesis. However, the concentrations of fructose were lower than the theoretically released amounts at any time during the cultivation. Furthermore, fructose seemed to be the only utilized carbohydrate after 10 h. Its depletion was observed after 24 h, suggesting its preferential use by L. sakei TMW 1.411 compared to sucrose, whose concentration was stagnating within this time period. Since the utilization of fructose requires less enzymes to be translated for energy generation than the utilization of sucrose (**Figure 5**), a preferential metabolization of fructose could indeed be energetically beneficial. As a consequence, the constitutive production and secretion of dextransucrases might provide another advantage for a life in a sucrose-dominated environment, as it not only facilitates biofilm formation, but simultaneously provides a favorable carbon source.

The induction of the sucrose operon (as suggested by the proteomic data) indicates that L. sakei TMW 1.411 can metabolize this carbon source intracellularly. However, solely the slight decrease of the sucrose concentration between 24 and 48 h points toward an active utilization of sucrose after fructose depletion, as no significant increases in dextran amounts could be detected. It is thus difficult to infer the amount of intracellularly metabolized sucrose in the exponential or early stationary growth phase. Although the calculated dextran yield was only 60% compared to the consumed amount of sucrose, this might have been the result of the intrinsic hydrolysis activity of dextransucrases as well (van Hijum et al., 2006; Leemhuis et al., 2013). However, glucose was never detected throughout the cultivation, which was also observed for the dextran forming strain L. sakei MN1 upon growth on sucrose (Nácher-Vázquez et al., 2017a). Yet, it is most likely that any released glucose was immediately taken up into the cytoplasm and subsequently metabolized. Thus, more detailed experiments will be needed to comprehensively resolve the fine-regulated sucrose utilization of L. sakei TMW 1.411 during dextran formation, whereas transcriptomic analyses might help to resolve complex up-/downregulation events of the operons throughout the cultivation.

#### AUTHOR CONTRIBUTIONS

RP performed and planned the main experimental work presented in this manuscript and wrote the main text of the manuscript. DJ was involved in some experimental work. JB, CL, and BK were involved in conducting and evaluating the proteomic experiments. RV was involved in planning the experimental setup and writing the manuscript. FJ was involved in planning the experimental setup and in writing the manuscript.

#### FUNDING

This work was supported by the German Research Foundation (DFG) and the Technische Universität München within the

#### REFERENCES


funding program Open Access Publishing. Part of this work was supported by the German Federal Ministry for Economic Affairs and Energy via the German Federation of Industrial Research Associations (AiF) and the Research Association of the German Food Industry (FEI), project numbers AiF 18357 N and AiF 19690 N.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2018.02796/full#supplementary-material

Figure S1 | Overview of the experimental steps for the analysis of sucrose-induced changes in the proteomic profile of L. sakei TMW 1.411. This figure is partly based on Figure 1 of Schott et al. (2017).

Table S1 | Proteome tables and IBAQ values.

Table S2 | Operon and dextransucrase conservation.

Table S3 | t-Test evaluation.

Table S4 | RAST annotated ORFs of the L. sakei TMW 1.411 genome.


Lactobacillus paracasei subsp. paracasei F19. J. Proteome Res. 16, 3816–3829. doi: 10.1021/acs.jproteome.7b00474


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Prechtl, Janßen, Behr, Ludwig, Küster, Vogel and Jakob. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Occurrence and Dynamism of Lactic Acid Bacteria in Distinct Ecological Niches: A Multifaceted Functional Health Perspective

Fanny George<sup>1</sup> , Catherine Daniel<sup>2</sup> , Muriel Thomas<sup>3</sup> , Elisabeth Singer<sup>1</sup> , Axel Guilbaud<sup>1</sup> , Frédéric J. Tessier<sup>1</sup> , Anne-Marie Revol-Junelles<sup>4</sup> , Frédéric Borges<sup>4</sup> and Benoît Foligné<sup>1</sup> \*

<sup>1</sup> Université de Lille, Inserm, CHU Lille, U995 – LIRIC – Lille Inflammation Research International Center, Lille, France, <sup>2</sup> Université de Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, U1019 – UMR 8204 – CIIL – Center for Infection and Immunity of Lille, Lille, France, <sup>3</sup> Micalis Institute, INRA, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France, <sup>4</sup> Laboratoire d'Ingénierie des Biomolécules, École Nationale Supérieure d'Agronomie et des Industries Alimentaires – Université de Lorraine, Vandœuvre-lès-Nancy, France

#### Edited by:

Jan Kok, University of Groningen, Netherlands

#### Reviewed by:

Cristian Botta, Università degli Studi di Torino, Italy Chiara Montanari, Università degli Studi di Bologna, Italy

> \*Correspondence: Benoît Foligné benoit.foligne@univ-lille.fr

#### Specialty section:

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

Received: 28 June 2018 Accepted: 12 November 2018 Published: 27 November 2018

#### Citation:

George F, Daniel C, Thomas M, Singer E, Guilbaud A, Tessier FJ, Revol-Junelles A-M, Borges F and Foligné B (2018) Occurrence and Dynamism of Lactic Acid Bacteria in Distinct Ecological Niches: A Multifaceted Functional Health Perspective. Front. Microbiol. 9:2899. doi: 10.3389/fmicb.2018.02899 Lactic acid bacteria (LAB) are representative members of multiple ecosystems on earth, displaying dynamic interactions within animal and plant kingdoms in respect with other microbes. This highly heterogeneous phylogenetic group has coevolved with plants, invertebrates, and vertebrates, establishing either mutualism, symbiosis, commensalism, or even parasitism-like behavior with their hosts. Depending on their location and environment conditions, LAB can be dominant or sometimes in minority within ecosystems. Whatever their origins and relative abundance in specific anatomic sites, LAB exhibit multifaceted ecological and functional properties. While some resident LAB permanently inhabit distinct animal mucosal cavities, others are provided by food and may transiently occupy the gastrointestinal tract. It is admitted that the overall gut microbiome has a deep impact on health and diseases. Here, we examined the presence and the physiological role of LAB in the healthy human and several animal microbiome. Moreover, we also highlighted some dysbiotic states and related consequences for health, considering both the resident and the so-called "transionts" microorganisms. Whether LAB-related health effects act collectively or follow a strainspecificity dogma is also addressed. Besides the highly suggested contribution of LAB to interplay with immune, metabolic, and even brain-axis regulation, the possible involvement of LAB in xenobiotic detoxification processes and metal equilibrium is also tackled. Recent technological developments such as functional metagenomics, metabolomics, high-content screening and design in vitro and in vivo experimental models now open new horizons for LAB as markers applied for disease diagnosis, susceptibility, and follow-up. Moreover, identification of general and more specific molecular mechanisms based on antioxidant, antimicrobial, anti-inflammatory, and detoxifying properties of LAB currently extends their selection and promising use, either as probiotics, in traditional and functional foods, for dedicated treatments and mostly for maintenance of normobiosis and homeostasis.

Keywords: lactic acid bacteria, gut microbiota, fermented food, probiotics, ecosystems, ecological niches

## INTRODUCTION

fmicb-09-02899 November 24, 2018 Time: 12:25 # 2

A historical metabolic-based (and somewhat pleonasmic) consensus definition of lactic acid bacteria (LAB) is a broad group of bacteria characterized by the formation of lactic acid as a sole or main (over 50%) end product of carbohydrate utilization. However, LAB more strictly correspond to members of the order Lactobacillales from a taxonomic point of view. The important taxonomic and physiological diversity of LAB representatives is, however, not convenient when addressing specific ecological niches and roles and applications of LAB. Indeed, LAB adapt to various conditions and change their metabolism accordingly; they cover a varied range of genera including species of lactobacilli, enterococci, lactococci, pediococci, streptococci, tetragenococci, vagococci, leuconostocs, oenococci, carnobacteria, and weissella. The LAB thus constitutes a very heterogeneous group and are often misleadingly circumscribed to lactobacilli only. In contrast, other microbes used in the making of fermented dairy products or claimed as probiotics such as bifidobacteria, propionibacteria, and even brevibacteria, belonging to the anaerobic actinomycetales, are falsely included or assimilated within this group. This is partly due to their overlapping habitat or common properties together with unclear species identity.

Recent efforts have been undertaken to identify lactobacilli and related species (Sun et al., 2015). From eighty identified Lactobacillus species, 15 years ago (Satokari et al., 2003), we now reach to more than 200, with continuous new discoveries, e.g., Lactobacillus timonensis (Afouda et al., 2017) or Lactobacillus metriopterae (Chiba et al., 2018) due to nextgeneration sequencing (NGS), clustered regularly interspaced short palindromic repeats (CRISPR)-based methods (Sun et al., 2015), and culturomics (Lagier et al., 2015). This has justified the proposal for a reclassification of the genus (Claesson et al., 2008). According to the last taxonomic update of the lactobacilli, a dozen clades have been recently organized (Salvetti et al., 2012). The current knowledge on evolution of the genus Lactobacillus, its environmental niches, and the degree of host specificity was recently completed (Duar et al., 2017). Ecological differentiation of the genus Carnobacterium was recently established based on comparative genomic analysis (Iskandar et al., 2017).

The genus Enterococcus encompasses more than 50 species that can also be found in diverse environments, from the soil to the intestine of animals and humans, including the hospital environment, which provide concerns (Dubin and Pamer, 2014). In addition, enterococci are present as spoilage microflora of processed meats, but, on the other hand, they are important for aroma development and ripening of traditional products such as certain cheeses and sausages (Franz et al., 2003). Depending on their origins and evolution (Lebreton et al., 2013), they may act as both commensals and pathogens, and strains are consequently of clinical importance, harboring virulence factors (Ali et al., 2017) as well as possibly being used as probiotics (Holzapfel et al., 2018).

The various genera, species and even strains of LAB inhabit and cope with specific environments in order to exert dedicated or multiple specific functions according to their structural determinants and/or metabolic pathways. With the exception of enterococci as opportunistic pathogens and the case of pathogenic streptococci, there are only a few reported cases of bacteremia due to LAB (Goldstein et al., 2015; Kamboj et al., 2015). The LAB are regarded as generally recognized as safe (GRAS) because of their ubiquitous use in food and their unique role in the healthy microflora of human mucosal surfaces. To date, 50 LAB strains have obtained the qualified presumption of safety (QPS) status by the European Food Safety Agency (Ricci et al., 2017), again, comprising mostly lactobacilli spp. (n = 37). However, enterococci have been classified in risk group 2 in the European Directive 93/88.

The LAB greatly differ in morphology, optimal growth and tolerance temperature, salt and pH tolerance, metabolism, and surface and secreted molecule (Lebeer et al., 2010; Pessione, 2012; Gänzle, 2015). They may secrete effector proteins, produce exopolysaccharides (EPS) and generate biofilms, or adhere on abiotic and/or biologic surfaces, depending upon genera, species, or strain. Accordingly, the origins, quantities, the diversity, and abundances of LAB strains in complex ecosystems may greatly impact on the intended effect(s), product(s), and functionality. As a striking example of broad effects, the presence of psychrotrophic Lactobacillus spp. as the prevailing spoilage organisms in packaged cold-stored meat or fish products is unwanted (Andreevskaya et al., 2018). In contrast, dominance of only a few Lactobacillus species inhabiting human vaginal cavities is essential to maintain a low microbial diversity and prevent further vaginosis (Borges et al., 2014; Vaneechoutte, 2017b). Consequently, LAB, in general, and lactobacilli, in particular, cannot be considered as a whole, and have to be stratified depending on their intrinsic characteristics and potential applications. Hence, up to 25 species are defined as fructophilic LAB that inhabit fructose-rich niches in nature, e.g., flowers, fruits, and fermented foods. These LAB can also be isolated from the gut of bees and flies (Endo, 2012), and the latter have coevolved to determine microbe–host mutualism (Matos and Leulier, 2014). A motile phenotype has been characterized in nearly 15 distinct Lactobacillus species, while motility genes have also been detected in other closely related strains. This will both contribute to select geographical ecological niches and, due to flagellin signalization, to sustain the immune potential of such bacteria (Cousin et al., 2015).

It has been recognized for a long time that LAB are fascinating and useful bacteria (Tannock, 2004) and, now, recent advances in technologies allow addressing the specific mechanisms of the important bacteria–host interactions in appropriate models. The uses of LAB as biotherapeutic agents and probiotics are based on multiple molecules and modes of action that are currently deciphered. Here, we present a short overview dealing with the overall occurrence of LAB in the environment and some of their contributions to health, focusing on strain-dependent effects together with consideration on individual hosts and experimental models.

#### LAB IN THE ENVIRONMENT AND RAW AND FERMENTED FOODS

The LAB can be found nearly everywhere although their total load and relative abundance in microbial ecosystems are extremely diverse and depend on the specific environment. The selective pressure exerted by these environments is a key driver in the genomic diversity among LAB strains derived from distinct habitats (McAuliffe, 2018). LAB have been identified in Japanese lakes, with viable cell counts ranging from 1 to 3 log per ml, with a clear seasonal variation (Yanagida et al., 2007). LAB could also be isolated from soils following enrichment protocols (Chen et al., 2005). Although soils do not contain large amounts of LAB, they are somewhat more abundant on the plant–soil interface such as the rhizosphere of some trees and comestible fungi. Some of the strains obtained from these niches showed antimicrobial properties (Fhoula et al., 2013). The abundance and diversity of LAB in soils greatly depend on carbon-richness, which is, e.g., greater under fruit–trees and in soils associated with anthropic activities or free-range farming, and after the use of manure. Notably, many halotolerant LAB, known to be able to survive and even grow in dry environments, can be recovered from soils when soil acidity may contribute too. It is interesting to note that LAB are also used in agriculture, as biofertilizers, safe biocontrol agents for bacterial and fungal phytopathogens, regulators of abiotic plant stress, and biostimulant agents to ameliorate plant growth (recently reviewed by Lamont et al., 2017). Understanding the phytomicrobiome, including the role of LAB therein is an emerging field with great potential to mitigate plant stress and promote plant resistance and production.

The LAB only represent a subdominant part of raw vegetables and fruits microorganisms (2 and 4 log cfu per g.), while the microbial autochtonous population varies between 5 and 7 log cfu per g. (Di Cagno et al., 2013); hetero-fermentative and homo-fermentative species belonging to Leuconostoc, (mostly L. plantarum), Weissella, Enterococcus, and Pediococcus genera are those most frequently identified as epiphytes within the microbiota, depending on the plant species. We also should consider the low but relevant endophyte LAB representatives when regarding adaptation of LAB to plant niches. Indeed, the capacity to adapt to the intrinsic features of the raw plant matrices and persist stably as endophytes throughout plant phenological stages represent additional criteria for selecting robust LAB candidates (Filannino et al., 2018). Some LAB found in bakery sourdough are there partly due to contaminating flour and may originate from milling, representing a part of the endophytic microbial community of wheat at very low levels. Moreover, LAB diversity in sourdoughs may also originate from external layers of wheat plant organs (epiphytic) and the bakery environment (the bakers, insects, and nuisance species animals) (Minervini et al., 2015). The ability of LAB to produce organic acids and other antimicrobial substances has made them essential in the preservation of plant-based foods while they also are the most important microbes promoting significant positive changes in healthy aspects of plant foods. Their metabolism throughout fermentation contributes to lowering some toxic and antinutritional factors and promoting bioavailable bioactive compounds (Di Cagno et al., 2013; Marco et al., 2017; Filannino et al., 2018). Following either spontaneous fermentation or after intentional inoculation of food products with LAB as starters, the final LAB load in plant products can reach 8 to 9 log per g. Such a high concentration of LAB contributes to the biocontrol of pathogens (Gram-negative bacteria, Listeria monocytogenes) in food as well as to their possible use in probiotic/functional food.

The natural origin of LAB used in traditional fermented foods, such as sauerkraut, pickles, cheese, sausage, fish, fish- and soysauce, sourdough bread, and animal silage, is the corresponding matrix (fruits, cereals, milk, and animal farm environment) or the associated (wooden) material (Lortal et al., 2014). A plant-based origin for dairy lactococci was nicely demonstrated using genome evolution studies (McAuliffe, 2018). Starters and ferments have then been isolated, selected, and domesticated over past centuries to control the fermentation processes and standardize the taste and quality of the final products. The final count of LAB in fermented products is highly diverse and ranges from 4 to 5 log to over 9 log of bacteria, depending on product types, fermentation dynamics, and the overall microbial ecology of the products. Moreover, the abundance of other non-LAB, yeasts, and molds in these complex ecosystems, especially in cheeses, needs to be considered. The microbial diversity of fermented food together with some functional properties, including the contribution of some LAB therein, has been extensively reviewed elsewhere (Tamang et al., 2016a,b; Linares et al., 2017; Marco et al., 2017). The health benefits of fermented foods and the added value of ingesting LAB in order to either prevent or treat some specific diseases is now generally accepted and beyond discussion. Regardless of their traditional or industrial origins, fermented foods are active sources of LAB, among other microbes, that will enter the digestive tract and possibly exert a positive influence (and even a negative one) on the host.

### OCCURRENCE OF LAB IN MUCOSAL NICHES

#### The Human Vaginal Tract as a Unique Example of LAB Dominance and Functionality

Dominance of LAB in the vaginal niche is a characteristic of healthy women, reaching from 70 to up to 90% of resident bacteria, whereas LAB colonization of the genital tract of other mammal species, including primates, is anecdotic and generally below 1% of the relative abundance (Miller et al., 2016). This is unique in humans where vaginal fluid generally contains 7 to 8 log of lactobacilli per mL, represented by a dozen or so species most frequently found (Borges et al., 2014). However, the normal vaginal flora is usually dominated by one or two out of the major species of lactobacilli (Vaneechoutte, 2017b). In addition, occurrence of other rare Lactobacillus species and other LAB representatives such as Pediococcus acidilactici, Weissella spp. (W. kimchi, W. viridescens), Streptococcus anginosus, and Leuconostoc mesenteroides has also been reported. These

observations greatly depend on ethnic groups, individuals, and time, according to pregnancy or menopausal status (Borges et al., 2014). Notably, not all Lactobacillus-deficient vaginal microbiotas are adverse (Doyle et al., 2018). The main benefits of lactobacilli in the vaginal sphere are essentially that they have anti-infectious properties, and can prevent and target bacterial vaginosis, vaginal candidiasis, and sexually transmitted virus and bacterial pathogens. The normobiosis of the vaginal microbial community is based on low pH maintenance (ranging from 3.5 to 4) due to lactate production following glycogen consumption, which is unique in humans, and not necessarily in other animals. Another key factor is hydrogen peroxide production. While 95% of lactobacilli of vaginal origin from healthy women are H2O2 producers, this proportion can drop to 6–20% in the context of vaginosis. Moreover, dominance of L. iners, a non-H2O<sup>2</sup> producer, has been associated with poor protection and even a risk-factor in vaginal dysbiosis (Vaneechoutte, 2017a). However, the specific role of L. iners, which is uncultivable in standard LAB culture media and unable to produce D-lactate, is still on debate (Petrova et al., 2017). Finally, bacteriocin production is also involved in direct killing of microorganisms by lactobacilli in the vagina although most vaginal LAB isolates do not exhibit bactericidal activity (Spurbeck and Arvidson, 2011). Anyway, this production is an optional issue in selecting appropriate strains and sustaining the role of probiotics in maintaining vaginal health (Borges et al., 2014). As a more indirect mechanism, a recent study has demonstrated that the overall Lactobacillusassociated anti-inflammatory properties in the vaginal mucosa could contribute to lower HIV infectiveness (Gosmann et al., 2017). Collectively, only few LAB species are adapted to dominate the human vaginal cavity in order to preserve health.

#### Heterogeneity of LAB Abundance Within the Gut Microbiota

Very recently, Heeney et al. (2018) raised the question of the importance of Lactobacillus in the intestine of mammals by reviewing the occurrence of LAB in the gastrointestinal tract of distinct vertebrates. In contrast to rodents where LAB can represent 20% to up to 60% of all bacteria, the estimated proportion of LAB in the human proximal small intestine is always subdominant (6%), while their relative abundance in the colon is mostly below 0.5% or non-detectable. LAB have also been detected in substantial amounts in the stomach of humans although the actual numbers are subject to huge variations depending on individuals and their health status, the methodology used, and the type of samples. Depending on genetic backgrounds, the gut microbiota of laboratory mice is highly diverse and the proportion of LAB can range from undetectable to near 100%, ranging from 4.7 to 10.6 log (Friswell et al., 2010; Nguyen et al., 2015). The gut microbiome of wild wood mice comprises an average of 30% of Lactobacillales, although it dramatically changes with the seasons. Indeed, the proportion of Lactobacillus spp. ranges from 60% in spring to 20% in the fall (Maurice et al., 2015). The abundance of LAB in the intestinal microbiota of laboratory mice is highly influenced by the diet and can range from 10% Lactobacillales on a normal fiber control diet to 0.5% with low-fiber diet and 40% with a high-fiber diet (Trompette et al., 2014). Similarly, LAB abundances also vary according to circadian rhythm disturbances (Voigt et al., 2014).

The mode of delivery (vaginal versus C-section) in humans is important to allow, respectively, high or low levels of Lactobacillales in babies. These differences persist during the first 2 years of life and progressively disappear (Dominguez-Bello et al., 2010). Indeed, lactobacilli and enterococci are first colonizers and dominant bacteria, then become subdominant at toddler and adult ages. A metagenomic study identified nearly 60 distinct species of lactobacilli in human fecal samples, not exceeding 0.04% of all bacteria present, corresponding to near 8.5 log cell per g. from an average bacterial load of 12 log (Rossi et al., 2016). Within this diversity, one or two major species (L. rhamnosus and L. acidophilus) were estimated at 8 log. Of the six most represented species, those estimated from 7.5 log were identified as L. rhamnosus, L. ruminis, L. acidophilus, L. delbrueckii, L. casei, and L. plantarum, while others were detected below 5 log cell per g., close to the detection threshold. In this study, the abundance of enterococci was similar to that of lactobacilli, but commensal streptococci were two to five times more abundant. Turroni et al. (2014) reported dominance of L. gasseri, L. casei, L. namurensis, L. rogosae, and L. murinus in human fecal samples. Clearly, proportions of LAB and even Lactobacillus species are highly diverse and vary between individuals (Booijink et al., 2010). This may explain difficulties at comparing microbiome data from healthy controls and patients with distinct pathologies aiming at uncovering a possible protective or even deleterious general role of lactobacilli. Many studies investigating the role of the human gut microbiota in distinct immune-mediated inflammatory diseases have reported either increased, decreased, or unchanged levels of lactobacilli (Forbes et al., 2016; Heeney et al., 2018). All in all, no consistent marker for any pathology or a healthy state is simply defined by a specific proportion of Lactobacillus, or even based on the follow-up of the Lactobacillus load for an individual in time. Interestingly, gut remodeling due to a restrictive surgery and leading to short bowel syndrome was accompanied by enrichment of lactobacilli (Joly et al., 2010; Boccia et al., 2017), which may explain the D-lactate acidosis in some subgroups of patients (Joly et al., 2010; Mayeur et al., 2016; Boccia et al., 2017). The in lactobacilli-enriched microbiota in patients suffering from short bowel syndrome may also favor energy recovery occurring after resection (Gillard et al., 2017). The abundance of enterococci in the human gut, mostly E. faecium and E. faecalis, is estimated between 4 and 6 log bacteria per g. wet weight (Layton et al., 2010). It, thus, represents nearly 1% of the relative abundance, which is 10– 100 times higher than that of lactobacilli. Although commensal LAB are somewhat subdominant, they clearly play a role in gut physiology with consequences on health and qualitative (species and strain specificity) rather than quantitative aspects (genera and family abundance) should be considered. The interplay between LAB and non-LAB within the gut has not been completely revealed yet. For instance, several authors have

highlighted the possible indirect role of LAB in increasing the butyrate content in the feces. This capability is attributed to an initial cleavage of fibers in the large intestine to more fermentable compounds, which are then further converted to butyrate by butyrogenic non-LAB present. Moreover, exogenous LAB, sourced after oral administration of fermented foods or dietary supplements, may reach numbers similar to those of commensals (0.1–1% of the total microbiota present in the gastrointestinal tract, so 5–8 log per g.), depending on dietary habits and geographic areas (Plé et al., 2015). Such transitory or "visiting" living microorganisms, also named pseudocommensals (Rook, 2013) or transionts, cannot durably colonize the host. However, provided these bacteria are regularly or even occasionally ingested in sufficient amounts, they can act as symbionts or pathobionts, modulate the gut microbiota, and exert various health effects.

### The Lung as an Emerging Example of LAB Occurrence and Functionality?

The exploration of the microbiota in the lung is in its infancy when compared with the microbiomes from the gut or the vaginal cavity. Using 16S metagenomics, the microbiota of the respiratory tract has been described since 2010; it is mainly made up by two phila (Firmicutes and Bacteroidetes) in humans and four phila (Proteobacteria, Firmicutes, Bacteroidetes, and Actinobacteria) in mice (Singh et al., 2017). The load of lung microbiota is estimated to reach around 103–10<sup>4</sup> cultivable bacteria per gram of lung homogenates in mice (Remot et al., 2017). In humans, it has been estimated as a mean of 3 log bacterial genomes per cm<sup>2</sup> surface in the upper lobe (Hilty et al., 2010). The lung microbiota plays a key role in promoting tolerance (Gollwitzer et al., 2014) and is a determinant of maintenance of homeostasis. It remains difficult to define a healthy lung microbiota as of yet, due to its high intervariability as a function of age, diet, or environment, but LAB (Streptococcus, Lactobacillus, and Enterococcus) are prominent members and seem to be actors of respiratory symbiosis and health. As an example, the presence of Enterococcus faecalis decreases with the severity of asthma in the human lung microbiota (Turturice et al., 2017) and the decrease of the bacteria improves the outcomes of asthma in a preclinical mouse model (Remot et al., 2017). Nasal administration of Lactobacillus rhamnosus GG protects against influenza virus infection (Harata et al., 2010) and oral supplementation of Lactobacillus spp also modifies the lung ecosystem. Overall, these data indicate that LAB are major members of lung microbiota and might be useful in future preventive strategies or new therapeutics for respiratory health.

#### REGARDLESS OF THEIR ORIGIN, LAB CAN EXHIBIT MULTIFACETED FUNCTIONAL PROPERTIES

#### Immune Properties

Anti-inflammatory properties of LAB have been extensively studied in rodents and to a lesser extent in humans (Hevia et al., 2015; Papadimitriou et al., 2015). This is attested by many studies sustaining the protective role of commensal LAB and the benefits brought by exogenous LAB through nutritional and/or probiotic interventions. The overall anti-inflammatory value of LAB is also evidenced by the rarefaction of lactobacilli in the gut of inflammatory bowel disease patients. Moreover, the drop of lactobacilli observed in the fecal microbiota of aging populations can be related to the low-grade inflammation theory, i.e., 'inflamm-aging' and frailty (van Tongeren et al., 2005; Franceschi et al., 2017). Recently, it was shown that diet-induced exacerbation of experimental colitis is associated with a reduction in Lactobacillus sp. and a lower production of protective short-chain fatty acids, including butyrate (Miranda et al., 2018). Nevertheless, whether the overall rate of LAB or even a minimal threshold of lactobacilli may or may not represent an indicator for health is not yet clear. The prevalence and richness of Lactobacillales and lactobacilli may either increase or decrease depending on immune disease types and studies (Forbes et al., 2016; Heeney et al., 2018), and such changes can be linked to causal events or adaptive processes to counteract the injury. Again, besides genus and species, also strain-level attributes matter when considering immunomodulatory properties of LAB (Sanders, 2007). This has been well documented during the last decades, showing anti-inflammatory probiotic properties of specific lactobacilli based on multiple distinct mechanisms of reducing colitis symptoms in mice (Sanders et al., 2018). For example (nonexhaustive), a L. plantarum strain was shown to be beneficial against inflammation because of a specific teichoic acid structure (Grangette et al., 2005), while the anti-inflammatory effect of a strain of L. salivarius was dependent on peptidoglycan (Macho Fernandez et al., 2011). In contrast, the alleviation of colitis was attributed to S-layer proteins of L. acidophilus (Konstantinov et al., 2008), pili for a L. rhamnosus strain (Lebeer et al., 2012), and EPS for another L. plantarum (Górska et al., 2014). In line with these observations, the structurally different EPS from resident lactobacilli generate different immune responses by dendritic cells while, upon gut inflammation, specific bacterial molecular motifs are absent from lactobacilli isolated from IBD (Górska et al., 2016), providing tools for further application based on strain selection (Oleksy and Klewicka, 2018). The positive role of H2O<sup>2</sup> production to lower inflammation has been reported for L. crispatus and L. rhamnosus (Voltan et al., 2008; Lin et al., 2009). The colitis alleviating property of L. bulgaricus was related to activation of the aryl hydrocarbon receptor pathway in colon cells (Takamura et al., 2011), in line, the control of inflammation by an L. reuteri intervention was associated to the production of histamine followed by activation of a host epithelial cell receptor (Gao et al., 2015). An L. casei strain inhibited the secretion of the proinflammatory mediator IP-10 protein at the post-translational level (Hoermannsperger et al., 2009); this inhibition is based on the role of a specific secreted protein by the bacteria (von Schillde et al., 2012). Specific bacterial DNA motifs may also drive some of the immune-stimulatory effects in a toll-like receptor 9-dependent manner (Iliev et al., 2008). Interestingly, distinct immunological activities through TLR5-signaling

caused by flagellins isolated from motile lactobacilli presume a consequence of adaptation to commensalism (Kajikawa et al., 2016). Pro-inflammatory LAB, i.e., L. crispatus, were inconsistently reported in murine models of colitis (Zhou et al., 2012). Given that these observations are rare, one has to keep in mind that such undesirable (and embarrassing) results are difficult to publish and that they may be underestimated. LAB other than lactobacilli can also exert immune effects and strain-dependent impact on the release of pro-inflammatory cytokines and colitis. Strain diversity in such anti-inflammatory properties has been reported for food-derived pediococci and oenococci (Foligné et al., 2010), carnobacteria (Rahman et al., 2014), and enterococci (Wang et al., 2014). Notably, several strains of Enterococcus spp. are marketed as probiotics, for use in many health purposes both in humans and pets, alone or in combination with other LAB and/or bifidobacteria but most of the Enterococcus species are believed to have no prominent beneficial effect on inflammation as well as on the overall human health.

No single and unique mechanism of anti-inflammatory effects of LAB can be generalized. The overall combination of antiinflammatory and anti-oxidative properties with the occurrence of some immune-stimulatory signals of a single LAB strain has to be integrated by the host. Moreover, when interpreting the host's health, one also should consider the interplay of these specific molecular players with those of other (non-)LAB strains from the microbiota. Consequently, the anticipated effects of a promising strain can either be boosted or diluted, depending on its microbial microenvironment in the gut. In addition, the contribution of the matrix should not be neglected (Burgain et al., 2014; Lee et al., 2015). Together with a high variability in host immune reactivity, it seems difficult to fully predict the performance of individual LAB strains. It is essential to screen LAB strains with valuable properties (Foligné et al., 2013; Papadimitriou et al., 2015), but integrative endpoints are necessary to fully characterize the consequences of LAB for health; relevant antiinflammatory strain selection should thus be based on specific mechanisms and how they may interplay within the microbiota (Lebeer et al., 2018).

#### Metabolic Properties

Clear links have been established between gut microbiota, metabolism, and the nutritional status of distinct animals, including farm animals, whose growth performance was empirically boosted either by antibiotics or probiotics for several decades. The key role of gut microbes in the metabolic physiology of the host, throughout evolution, has been demonstrated in drosophila (Leulier and Royet, 2009) in fish (Egerton et al., 2018), in rodents and humans (Ley et al., 2006; Gérard, 2016). Molecular mechanisms involved multiple signaling pathways such as microbial production of short-chain fatty acids, the control of epithelial integrity (and endotoxemia), and modulation of chronic inflammation, which has been reviewed elsewhere (Maruvada et al., 2017; Dao and Clément, 2018). Microbiotaderived metabolites can also interplay with the regulation of appetite and satiety. They act particularly on intestinal food intake mediators likely GLP1, leptin, and ghrelin. In mammals, disruption of the homeostasis of gut microbiota (dysbiosis), resulting from an imbalance of bacterial strains, may induce physio-pathological processes leading to chronic obesity or metabolic disorders such as type 2 diabetes or metabolic syndromes (Carding et al., 2015).

Gnotobiotic rodents have been used to study the health effects on germ-free (axenic) animals of treatment with specific bacterial strains. The glycolytic activity of Streptococcus thermophilus is improved once inside the digestive tract of mono-associated rats (Rul et al., 2011). Colonization of the gut of germ-free mice by microbiota from obese mice significantly increases their total body fat compared with colonization by microbiota from lean mice (Turnbaugh et al., 2006). In addition, inoculation of both obese mice and humans with microbiota from lean mice or humans, respectively, improves symptoms of metabolic syndrome (Vrieze et al., 2012; Kulecka et al., 2016; Ji et al., 2018). Some Lactobacillus species are associated with weight gain, while others are associated with protection against obesity (Drissi et al., 2014). Compared to lean patients with a normal body mass index, abundance of lactobacilli was higher in obese and lower in anorexic individuals (Ley et al., 2005; Armougom et al., 2009). In contrast, a recent study established a relationship between the high oral Lactobacillus counts and protection to further weight gain, while a lack or a low level of oral lactobacilli may increase the risk of obesity (Rosing et al., 2017). Higher proportions of lactobacilli were related with type 2-diabetes (Larsen et al., 2010; Karlsson et al., 2013). Moreover, some lactobacilli were also reported to limit undernutrition and to have growthpromoting effects in mice (Schwarzer et al., 2016) while a strain of L. reuteri could contribute to preventing cachexia (Bindels et al., 2016). Data dealing with the occurrence of Lactobacillus spp. in obesity and type 2-diabetes are thus inconsistent, which is most probably related to their low and variable quantity in the gut microbiota. Comparative genomic analyses have shown that Lactobacillus species linked to weight-loss had specific arsenals of genes associated with anti-microbial activities such as bacteriocins (Drissi et al., 2014). In contrast, weight gainassociated Lactobacillus spp. harbored enzymes involved in lipid metabolism. Besides the species level, the importance of properties at the strain level was revealed using interventions in experimental rodent models.

Significant research efforts over the recent decades have been devoted to the development of effective treatments for obesity and metabolic disorders, using probiotics to mitigate dysbiosis and its impact on metabolism. Several studies have shown that ingestion of LAB by rodents reduced weight gain and improved the metabolic profile (blood glucose level, insulin, leptin), oxidative stress, and hepatic inflammation in various models such as mice fed with a high fat diet (HFD) (Alard et al., 2016; Park et al., 2017); Leprdb/db mice (lacking the functional, full-length Ob-Rb leptin receptor) (Yun et al., 2009), streptozotocin (STZ) induced diabetic mice fed an HFD (Pei et al., 2014), STZ-diabetic rats (Tabuchi et al., 2003), and rats fed with a diet high in fructose (Hsieh et al., 2013). These experiments mostly underlined the role of L. rhamnosus, L. plantarum, L. gasseri, L. casei L. mali, L. fermentum, and L. reuteri strains alone or in combination with other strains. Again, not all Lactobacilli are able to control

obesity and adiposity and some, i.e., an L. salivarius strain known as a probiotic for anti-inflammatory properties in mice, could not alleviate diet-induced obesity and insulin resistance, while a strain of L. rhamnosus did (Alard et al., 2016). Similar results have also been observed in clinical trials. For example, glycemia and cholesterol levels were reduced in elderly subjects after a month of daily consumption of a combination of L. acidophilus and B. bifidum strains (Moroti et al., 2012), while a reduction was observed of oxidative stress and glycemia levels in type-2 diabetes patients after 6 weeks of daily consumption of a probiotic dairy product containing L. acidophilus and B. lactis (Ejtahed et al., 2012). The beneficial effects of various species of Lactobacilli on obesity throughout clinical interventions have not been demonstrated nor reviewed yet (Crovesy et al., 2017). One possible explanation of metabolic disorders alleviation is that probiotics, and especially LAB, could reduce the absorption and conversion of food into useable energy and subsequent fat storage, resulting in anti-obesity, anti-inflammatory, and anti-diabetic effects in mice and humans (Kerry et al., 2018). The current molecular hypothesis of anti-obesity mechanisms involves the specific role of short-chain fatty acids and the host cell receptors FFAR2 and FFAR3, the contribution of bile-salt hydrolases from LAB, and further bile signature signaling of FXR and TGR5 receptors as well as LAB metabolites as antagonists of AhR (Lamas et al., 2016). However, no clear explanation is achieved yet.

In addition, LAB strains that either lower pathogen or even pathobiont numbers, those that strengthen the gut barrier and reduce LPS-endotoxemia, and the anti-inflammatory strains are also valuable candidates against obesity.

Lactic acid bacteria are also able to produce conjugated linoleic acids, gamma-aminobutyric acid (GABA) and may contribute to signal the neuroendocrine and vascular systems. More studies are necessary to determine the best strains, optimal dose, and treatment time to achieve beneficial outcomes for obesity, type-2 diabetes, non-alcoholic fatty liver disease (NAFLD) and decipher the corresponding mechanism(s). Collectively, those data should shed light on selected probiotic strains as important tools to prevent and treat patients with metabolic disorders and cardiovascular diseases.

#### Antimicrobial Properties

Besides the widely known lactic acid and H2O2-mediated antibacterial, anti-viral, and anti-fungal properties of LAB, specific bacteriocin-based mechanisms can control bacterial growth in distinct environments. Bacteriocins are ribosomally synthesized peptides or proteins that exhibit bactericidal or bacteriostatic activities (Leroy and De Vuyst, 2004; Cotter et al., 2005). The classification of bacteriocins is a long-term matter of debate. Lately, a large scientific consortium proposed a new classification in which post-translationally modified bacteriocins of less than 10 kDa are considered as Ribosomally synthesized and Posttranslationally modified Peptides (RiPPs) (Arnison et al., 2013). This new classification was slightly modified to embrace all bacteriocins leading to three classes, where class I are RiPPs, class II are unmodified bacteriocins (less than 10 kDa), and the class III are large bacteriocins (Alvarez-Sieiro et al., 2016).

Bacteriocins are found in almost every examined taxa. Their likely ubiquitous nature suggests they play a major role in shaping bacterial communities: they may serve as anticompetitors preventing invasion of a bacterium into an established community and could reciprocally allow a bacterium to invade a community (Riley and Wertz, 2002). Intuitively, their activity should lead to diversity reduction; however, experimental and modeling data suggest that under certain circumstances, bacteriocin production can promote diversity providing that taxonomic diversity is mirrored by bacteriocin-encoding gene diversity (Abrudan et al., 2012).

There are three major already established or future applications of bacteriocins. They can be used as biopreservative agents, as probiotic-promoting factors, and as antibiotics. Although two bacteriocins, nisin and pediocin, are allowed as food preservatives, bacteriocins are mainly indirectly used through producer strains. These strains are included into readyto-eat food in order to produce bacteriocinsin situ and are mainly used to target Listeria monocytogenes (Alvarez-Sieiro et al., 2016). Bacteriocin-producing LAB are also promising probiotic candidates for humans and animals (Dobson et al., 2012). Proof of concept was demonstrated by showing that bacteriocin production by L. salivarius UCC118 allows protecting mice from L. monocytogenes infection (Corr et al., 2007). Several studies revealed that bacteriocin production results in changes in the gut microbiota structures (Murphy et al., 2013; Umu et al., 2016) leading to the idea of using bacteriocins as tools for targeted manipulation of gut microbiota (Murphy et al., 2013).

Besides biopreservation and probiotic-promoting factors, bacteriocins are considered as a means to fight against emerging multidrug resistant pathogens (Cotter et al., 2013). A promising strategy consists in combining the use of bacteriocin with other antimicrobials to reduce the frequency of resistant variants appearing and/or to increase the antimicrobial potency (Mathur et al., 2017).

All these possible applications fuel active research, aiming at identifying new bacteriocins. Despite the tremendous literature describing new bacteriocins, new ones with interesting novel structures and unique activity are still being discovered. The challenge is now to design strategies that allow to avoid already reported bacteriocins. The traditional approach, isolating a strain exhibiting antagonistic activity followed by purification of the active compound prior characterization, is time-consuming, expensive, and tedious. Therefore, new workflows were developed that include steps dedicated to assess the novelty of bacteriocin candidates. One strategy combines liquid chromatography/mass spectrometry with principal component analyses of the antimicrobial spectrum of each bacteriocin-producing LAB strain (Perez et al., 2014). There are also in silico strategies based on genome mining that allow identifying candidate genes specifying putative bacteriocin proteins with novel structures (Collins et al., 2017).

#### Detoxifying Properties of LAB

The LAB, among other bacteria, have been suggested as tools for detoxification of several xenobiotics and pollutants such as

pesticides, toxins, and heavy metals. Whereas bioremediation is actively used in the environmental industry for many years, in vivo applications of LAB to reduce contaminant bioavailability are only in the early stages of development. Proof of concept has already been established, however, using distinct animal models and pilot clinical studies (Trinder et al., 2015; Wang Y.S. et al., 2016). For example, specific Lactobacillus spp. and Leuconostoc isolated from kimchi possessing an organophosphorus hydrolase gene were able to degrade pesticides such as chlorpyrifos and parathion (Islam et al., 2010). In this context, resistance to insecticides (chlorpyrifos, fipronil) was associated with the presence of Lactobacillales in the midgut of insects (Xia et al., 2013). Recent data support the use of an L. rhamnosus strain to reduce absorption and subsequent toxicity of organophosphate pesticide and neonicotinoid to Drosophila melanogaster (Trinder et al., 2016; Daisley et al., 2017, 2018). These are encouraging results that might be helpful in the fight against the current worldwide environmental threat of bees dying from these toxic compounds. They also suggest further examining the use of supplements for human, livestock, or apiary foods with selected probiotic microorganisms.

Lactic acid bacteria have also shown potential in mitigating toxic effects of distinct mycotoxins such as aflatoxins (B1, F1 and M1), patulin, ochratoxin A, and deoxinivalenol in food and feed (Fuchs et al., 2008; Ahlberg et al., 2015). Certain Pediococcus spp. have mycotoxin adsorbing and degrading properties (Martinez et al., 2017). Reducing aflatoxin bioavailability was reported in rats (Hernandez-Mendoza et al., 2011; Nikbakht Nasrabadi et al., 2013) and in mice (Jebali et al., 2015), but yet no clinical data are available on the use of LAB as detoxifying probiotics in humans. Other toxic substances can be handled by LAB. Acrylamide-binding ability by LAB was found to be both concentration- and strain-dependent, in vitro (Serrano-Nino et al., 2014) as well as in a gastric digestion stimulator (Rivas-Jimenez et al., 2016); the binding was mostly based on teichoic acid properties of specific strains. These results are promising for further actions and research to reduce exposure and bioavailability of such carcinogenic compounds.

Bioremediation of heavy metal (HM) and other hazardous and toxic metals such as chromium (CrVI) and aluminum, and, to a lesser extent, copper and cobalt, is challenging. The bioremediation is based on specific capacities of microorganisms to immobilize and/or inactivate pollutants by various passive (biosorption, complexation) or active transport (internalization, efflux/uptake ratio) mechanisms, including distinct bioprocesses (oxido-reduction, demethylation, e.g., for CH3-mercury) to lower HM bioavailability. Efforts have mostly focused on LAB and related species, and these have shown that selected lactobacilli (and bifidobacteria) can sequester lead (Pb) and cadmium (Cd) divalent cations and mercury (Hg) salts

(Ibrahim et al., 2006; Halttunen et al., 2007, 2008; Bhakta et al., 2012). A key point is the huge variations in binding capacity among LAB species and even strains (Kinoshita et al., 2013). This variability is partly caused by the high diversity of ionisable compounds on the surface of Gram-positive bacteria (Lebeer et al., 2010). For example, EPS and teichoic acids comprise negatively charged groups (hydroxyl, carboxyl, phosphate) as potential ligands for divalent metal cations (such as Pb2+, Cd2+, Hg2+). In contrast, peptidoglycan and a mosaic of specific surface proteins (e.g., S-layer proteins) have positive charges, which can explain affinity toward arsenate As(V) (Schär-Zammaretti and Ubbink, 2003). Since the last decade, bioremediation of HMs and tolerance of humans to HMs through microbial processes using food-grade microorganisms have been highlighted (Monachese et al., 2012; Kumar et al., 2018). This concept can also be applied in vivo 'inside the digestive tract.' Commensal intestinal microorganisms play a positive role in the interactions with HMs, as demonstrated by we and others (Nakamura et al., 1977; Breton et al., 2013). Provided that microbial species meet criteria for safe dietary use, transitory microorganisms may do at least as well or even better than the resident flora (Plé et al., 2015). Various selected food microbes can thus prevent the absorption of HMs by the body and remove them upon defecation. Promising proof of concept of efficiency was demonstrated in preclinical models of acute and chronic HM toxicity in mice for cadmium (Zhai et al., 2013, 2014), chromium (Younan et al., 2016), and aluminum (Yu et al., 2017). Likewise, a first clinical pilot study showed reduced circulating levels of toxic HMs in pregnant women and children living in a contaminated area: by using a designed fermented milk with a selected probiotic L. rhamnosus-supplemented yogurt the bioaccumulation of mercury and arsenic could be reduced (Bisanz et al., 2014). This work clearly illustrates the promising concept of using probiotics in a nutritional strategy against xenobiotics.

### Other Attributes and Emerging Properties of LAB

Opportunities to use LAB for health are quite unlimited as the role of intestinal microbiota has now been clearly highlighted in various diseases and pathologies, including those affecting organs at distant sites, cancers, neurologic, and psychological disorders. Some examples are abdominal hyperalgesia and pain that can be controlled by specific strains of lactobacilli targeting nociceptive signals and these strains have been proposed in treatment of irritable bowel syndrome (Ringel-Kulka et al., 2014; Perez-Burgos et al., 2015). Interestingly, in the 2014-clinical study, an L. acidophilus strain was clinically effective, but failed to attenuate pain in patients when co-administered with another bacterial strain (Bifidobacterium lactis). The results suggest a diversion effect and caution should be taken when using mixtures of strains. In cancer, a higher enterococci count was correlated with a lower risk of colorectal cancer development. Some reports showed an inverse correlation of fecal enterococci with colon adenomas (Kawano et al., 2018), while another study suggested a deleterious role of E. faecalis in colorectal cancer development (Huycke et al., 2002). Selected LAB, such as Enterococcus hirae, can be also used to boost cancer chemotherapy (Daillère et al., 2016), while appropriate strains may also play a role in immune cell regulation and exhibit anti-oxidative and antigenotoxic effects.

Other examples of innovative use of LAB as probiotics are in modulating central nervous system functions (Wang H. et al., 2016) and behavior symptoms such as chronic fatigue, depression, and anxiety (Slykerman et al., 2017), although clinical data are yet not fully convincing (Pirbaglou et al., 2016; Ng et al., 2018). The results have led to the definition of 'Psychobiotics' as a novel class of psychotropic treatments employing bacteria with neuroendocrine and behavioral properties (Sarkar et al., 2016). Functional magnetic resonance imaging has been used for the first time to measure brain activation triggered by probiotic LAB (Bagga et al., 2018). A deeper understanding of the relationships between the LAB (within the gut or ingested) and the host, if they do really exist and what are the mechanisms, is required to develop microbial-based therapeutic strategies for brain disorders. Indeed, emerging applications are still on examination, partly due to the lack of consistent studies, appropriate study designs, and the selection of the proper strain(s).

### GENERAL CONCLUSION

The LAB are multifaceted microorganisms that have existed on earth for several millions of years, with tens of thousands of years of shared history with animals and humans. They have been used for the production of fermented foods for centuries, and more or less actively developed as probiotics for several decades (**Figure 1**). LAB may strongly be part of the health concept for livestock rearing and in food and feed production. An outstanding effort has been made these last years, using extended omics approaches, to build the knowledge and further tools to elucidate the contribution of LAB in health and diseases. We are now facing the daunting task of integrating of all this information for general application as well as for individual use ("my personalized LAB"). Indeed, separately, LAB can either contribute to induce Th1 or Th2 immune responses; they may also induce specific or non-specific regulatory T cells, which may or may not be required by the host. Similarly, LAB have the potential to favor either weight loss or weight gain. LAB abundance is sometimes diminished or increased depending on diseases. The central role of LAB within the microbiota, providing antimicrobials, also raises the question of the control of ecological niches, which can be advantageous or not (Berstad et al., 2016; Hegarty et al., 2016). Berstad argued that 'we should stop thinking of LAB as always being friendly.' Indeed, few data exist on the long-term impact of LAB, considering their possible capacity to destabilize the microbiota, and the "paradox" to use them empirically in multiple pathologies and combined (metabolic, immune, psychological) disorders. Nevertheless, LAB are our obligate partners and we have to cope with these microorganisms. Dissociating the common

and specific interactions of LAB strains, -species and -genera within the whole of the microbiota in which they partake is still necessary to identify regulatory mechanisms, respectively, involved in distinct organs, systems, and hosts. Definitely, integrative system biology approaches are required to achieve the ultimate goal of applying LAB for personalized medicine. It comprises using omics technologies on the LAB as well as on the host and including foodomics and nutrigenomics (Kussmann and Van Bladeren, 2011; Bordoni and Capozzi, 2014), together with appropriate basic and integrative models and tools (Fritz et al., 2013; Daniel et al., 2015; Papadimitriou et al., 2015) to appraise the overall functionality of LAB.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

BF, FG, MT, and FB prepared the manuscript and all co-authors contributed to editing and critical reviewing thereof.

#### ACKNOWLEDGMENTS

We are grateful for the support of the DigestScience Foundation, whose aim is to encourage the research dedicated to digestive diseases and nutrition. We would also like to thank Mr. Basile Laqueteaux for his help.







**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 George, Daniel, Thomas, Singer, Guilbaud, Tessier, Revol-Junelles, Borges and Foligné. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Lactobacillus Dominate in the Intestine of Atlantic Salmon Fed Dietary Probiotics

Shruti Gupta<sup>1</sup> , Adriána Feckaninová ˇ 2 , Jep Lokesh<sup>1</sup> , Jana Košcová ˇ 3 , Mette Sørensen<sup>1</sup> , Jorge Fernandes <sup>1</sup> and Viswanath Kiron<sup>1</sup> \*

<sup>1</sup> Faculty of Biosciences and Aquaculture, Nord University, Bodø, Norway, <sup>2</sup> Department of Food Hygiene and Technology, University of Veterinary Medicine and Pharmacy in Košice, Košice, Slovakia, <sup>3</sup> Department of Microbiology and Immunology, University of Veterinary Medicine and Pharmacy in Košice, Košice, Slovakia

Probiotics, the live microbial strains incorporated as dietary supplements, are known to provide health benefits to the host. These live microbes manipulate the gut microbial community by suppressing the growth of certain intestinal microbes while enhancing the establishment of some others. Lactic acid bacteria (LAB) have been widely studied as probiotics; in this study we have elucidated the effects of two fish-derived LAB types (RII and RIII) on the distal intestinal microbial communities of Atlantic salmon (Salmo salar). We employed high-throughput 16S rRNA gene amplicon sequencing to investigate the bacterial communities in the distal intestinal content and mucus of Atlantic salmon fed diets coated with the LABs or that did not have microbes included in it. Our results show that the supplementation of the microbes shifts the intestinal microbial profile differentially. LAB supplementation did not cause any significant alterations in the alpha diversity of the intestinal content bacteria but RIII feeding increased the bacterial diversity in the intestinal mucus of the fish. Beta diversity analysis revealed significant differences between the bacterial compositions of the control and LAB-fed groups. Lactobacillus was the dominant genus in LAB-fed fish. A few members of the phyla Tenericutes, Proteobacteria, Actinobacteria, and Spirochaetes were also found to be abundant in the LAB-fed groups. Furthermore, the bacterial association network analysis showed that the co-occurrence pattern of bacteria of the three study groups were different. Dietary probiotics can modulate the composition and interaction of the intestinal microbiota of Atlantic salmon.

Keywords: fish, Salmo salar, feed additive, probiotics, intestinal bacteria, Lactobacillus, microbiota, amplicon sequencing

### INTRODUCTION

The ecological community of microorganisms that reside (Marchesi and Ravel, 2015) in the gastrointestinal tract (GIT) of an organism is referred to as the gut microbiota (Lozupone et al., 2012). The GIT of a healthy human harbors a dense (Kelsen and Wu, 2012; Marchesi et al., 2016) and diverse population (Lozupone et al., 2012) of commensal microorganisms, which offer many benefits to the host, including immune homeostasis and health maintenance (Sommer and Bäckhed, 2013). These commensal gut bacteria are also known to aid in amino-acid production

#### Edited by:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### Reviewed by:

Atte Von Wright, University of Eastern Finland, Finland Carmen Wacher, National Autonomous University of Mexico, Mexico

> \*Correspondence: Viswanath Kiron kiron.viswanath@nord.no

#### Specialty section:

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

Received: 06 October 2018 Accepted: 14 December 2018 Published: 11 January 2019

#### Citation:

Gupta S, Feckaninová A, Lokesh J, ˇ Košcová J, Sørensen M, Fernandes J ˇ and Kiron V (2019) Lactobacillus Dominate in the Intestine of Atlantic Salmon Fed Dietary Probiotics. Front. Microbiol. 9:3247. doi: 10.3389/fmicb.2018.03247

(Lin et al., 2017), nutrient metabolism and absorption (Morowitz et al., 2011; Semova et al., 2012), vitamin and bioactive metabolite' synthesis (Cummings and Macfarlane, 1997; LeBlanc et al., 2013), and pathogen displacement (Kamada et al., 2013). An imbalance in the gastrointestinal microbial composition can lead to immune-mediated diseases (Petersen and Round, 2014). A healthy gut bacterial assembly is essential for the well-being of the host organisms including fish, the microbiome of which is shaped by environment- and host-related factors (Wong and Rawls, 2012; Eichmiller et al., 2016; Lokesh et al., 2018).

Probiotics are "living bacteria," and when they are administered as supplements in the right amount they can confer health benefits to humans (FAO and WHO, 2006), by targeting, among others intestinal health through stimulation of intestinal epithelial cell proliferation and differentiation, fortification of intestinal barrier and immunomodulation (Gareau et al., 2010; Thomas and Versalovic, 2010; Hemarajata and Versalovic, 2013). Probiotics also have both direct and indirect effects on the intestinal microbial composition and diversity, and global host metabolic functions (Scott et al., 2015). These live bacteria produce antimicrobial compounds that suppress the growth of other microorganisms and compete for their receptors and binding sites (Spinler et al., 2008; O'Shea et al., 2012); thus altering the gut microbiota (Collado et al., 2007). Members of the genera Lactobacillus and Bifidobacterium are the most commonly used probiotic organisms for humans (O'Toole and Cooney, 2008).

Lactic acid bacteria (LAB) maintain intestinal health by producing lactic acid that can be utilized by short-chain fatty acids (SCFAs)-producing microorganisms. SCFAs (particularly acetate, propionate and butyrate) contribute to host health maintenance; for example, butyrate is used as energy source by the intestinal epithelial cells and also have anti-inflammatory effects on the host cells (Louis et al., 2014). LAB that is generally found in the GIT of endothermic animals have been extensively investigated and their benefits have been reviewed by many researchers (Pavan et al., 2003; Masood et al., 2011; Yang et al., 2015; Karamese et al., 2016). The importance of fish gutdwelling LAB in aquaculture has been described in other reviews (Ringø and Gatesoupe, 1998; Gatesoupe, 2008). Lactobacillus that colonize the intestinal regions of fish are able to evoke immune responses and impart protection against diseases (He et al., 2017).

Feeding diets supplemented with beneficial bacteria such as LAB is being considered as an alternative approach to control diseases in farmed fish (Martínez Cruz et al., 2012; Feckaninová et al., 2017; ˇ Rodriguez-Nogales et al., 2017). Not many studies in fish have employed high-throughput sequencing techniques to understand the changes in bacterial communities following LAB feeding. In this study, we examined the ability of Lactobacillus to modulate the distal intestinal microbiota of Atlantic salmon, a farmed salmonid fish. In addition, we describe the differences in the topology of co-occurrence networks associated with the intestinal bacteria of Atlantic salmon offered feeds with and without Lactobacillus.

## MATERIALS AND METHODS

#### Ethics Statements

This study was approved by the Norwegian Animal Research Authority, FDU (Forsøksdyrutvalget ID-7898). Fish handling and sampling procedures were in compliance with the description in LOVDATA. The rearing water was treated with UV rays to remove substances that could be harmful to the fish. Optimum values for water salinity, oxygen and nitrogen concentration were maintained in the rearing tanks. The temperature of the fish rearing hall was kept stable during the entire feeding experiment.

### Test Probiotics, Feed Type, and Design

Two species of Lactobacillus (RII and RIII) that were previously isolated from the intestinal content of farmed healthy juveniles of rainbow trout (commercial fish farm–Rybárstvo PoŽehy s.r.o., Slovak Republic) were employed in this study. Antimicrobial susceptibility of the microorganisms was assessed based on the "Guidance on the assessment of bacterial susceptibility to antimicrobials of human and veterinary importance" provided by the European Food Safety Authority. Sensitivity or intrinsic resistance of the isolated organisms to a recommended set of antibiotics make them safe for use as probiotics in aquaculture. Both RII and RIII showed antagonistic activity against salmonid pathogens Aeromonas salmonicida subsp. salmonicida CCM 1307 and Yersinia ruckeri CCM 6093 (Fecˇkaninová, 2017). Furthermore, high level of tolerance to different pH, bile, temperature, and high growth properties of the two species were confirmed through in vitro studies (Feckaninová, 2017 ˇ ). The test probiotics were coated on commercial salmon feeds. Briefly, a pure culture of probiotic bacteria that were grown on de Man, Rogosa and Sharpe agar (MRS) plates (HiMedia, Mumbai, India) for 48 h were inoculated into 1,000 ml of MRS broth and incubated for 18 h at 37◦C. The culture was centrifuged at 4,500 rpm for 20 min at 4◦C in a cooling centrifuge (Universal 320 R, Hettich, Germany). The resulting cell pellets were washed twice and resuspended in 30 ml of 0.9% (w/v) sterile saline. The feed (batches of 1,800 g) was thoroughly coated with the bacterial suspensions (Spirit Supreme, Skretting AS, Norway) using a vacuum coater (Rotating Vacuum Coater F-6-RVC, Forberg International AS, Norway). The bacterial counts on feeds were ∼10<sup>8</sup> cells.g−<sup>1</sup> (RII/RIII), as determined by spread plating on MRS agar plates and incubating for 48 h at 37◦C. The control feeds were coated with 0.9% of sterile saline alone. The coated feeds were stored at 4◦C until they were offered to Atlantic salmon.

### Experimental Fish, Feeding Regime, and Environmental Parameters

Atlantic salmon of average weight 522 ± 68 g were maintained in 800 L tanks in a flow-through seawater system, earlier described in Sørensen et al. (2017). A 20-day feeding trial was conducted at the research station, Nord University, Bodø, Norway. Three groups of fish (n = 45 fish/tank; 3 replicate tanks per group) were offered feeds with (RII ∼10<sup>8</sup> cells.g−<sup>1</sup> -RII; RIII ∼10<sup>8</sup> cells.g−<sup>1</sup> - RIII) or without probiotics (Control—C). The fish were fed ad libitum; the feeds were dispensed two times a day, between 08.00–09.00 and 14.00–15.00, using automatic feeders (Arvo-Teck, Huutokoski, Finland). The water flow rate, temperature, salinity and O<sup>2</sup> levels in the tanks were 800 L/h, 6.7–7.1◦C, 33 ppt, >85% saturation measured at the outlet, respectively. A photoperiod of 24:0 LD was maintained throughout the feeding trial.

#### Collection of the Intestinal, Tank Biofilm, and Rearing Water Samples

First, the fish were euthanized using 160 mg/L of MS222 tricaine methanesulfonate (Argent Chemical Laboratories, Redmond, WA, USA). Thereafter, the body surface of the fish was swiped with 70% ethanol. The fish were then dissected to aseptically remove the GIT from the abdominal cavity. The distal intestinal (DI) region was separated from the GIT and the content and surface mucus samples from the DI were collected (n = 18 for each group; 6 fish/tank) using sterile forceps and sterile glass slides, respectively. In addition to these fish samples, we collected environmental samples: water from the main inlet to the rearing hall (inlet water, n = 1), water from the rearing tanks (n = 3) and biofilm from the walls of the rearing tanks (n = 3). From the 3 tanks of each group, one liter of rearing water was filtered using 0.2µm pore-size filters (Pall Corporation, Hampshire, United Kingdom) and the filter paper was stored at −80◦C. The biofilm samples were scraped from the walls of the 3 tanks of each group. The fish and biofilm samples were collected in cryotubes, snap-frozen in liquid nitrogen and stored at −80◦C.

The sample abbreviations reported in this article are: (i) fish samples–Control distal intestine content (CDC), RII distal intestine content (RIIDC), RIII distal intestine content (RIIIDC), Control distal intestine mucus (CDM), RII distal intestine mucus (RIIDM), RIII distal intestine mucus (RIIIDM); (ii) environmental samples– Control tank water (CW), RII tank water (RIIW), RIII tank water (RIIIW), inlet water (IW), Control tank biofilm (CB), RII tank biofilm (RIIB), RIII tank biofilm (RIIIB).

### DNA Extraction and PCR Amplification of Bacterial 16S rRNA Gene for Illumina MiSeq Amplicon Sequencing

Genomic DNA was extracted from the content, mucus and biofilm samples using the Quick-DNATM Fecal/Soil Microbe 96 kit (Zymo Research, Irvine, CA, USA) following the manufacturer's protocol. Metagenomic DNA Isolation kit for water (Epicenter Biotechnologies, Madison, WI, USA) was employed to extract the genomic DNA from the water samples. The quality of the extracted DNA was checked on 1.2% (w/v) agarose gel. Qubit 3.0 fluorometer (Life Technologies, Carlsbad, USA) was employed to quantify the concentration of DNA.

To describe the changes in the intestinal bacteria under the influence of LAB, we amplified the V3–V4 region of the bacterial 16S rRNA gene employing a dual-index sequencing strategy described by Kozich et al. (2013). The PCR reactions were carried out in triplicates, each reaction (25 µl) volume contained 12.5 µl of Kapa HiFi Hot Start PCR Ready Mix (KAPA Biosystems, Woburn, USA), 1.5 µl of each forward and reverse primer (at a final concentration of 100 nM), 3.5 µl of DNAse and nuclease free water (Merck, Darmstadt, Germany) and 6 µl of DNA template and/ or 6 µl of negative PCR control. The thermocycling conditions included initial denaturation at 95◦C for 5 min, followed by 35 cycles of denaturation at 98◦C for 30 s, annealing at 58◦C for 30 s, extension at 72◦C for 45 s, and the final extension performed at 72◦C for 2 min. After performing the PCR, the resulting amplicon triplicates were pooled and visualized on 1.2% (w/v) agarose gel stained with SYBR <sup>R</sup> Safe (Thermo Fisher Scientific, Rockford IL, USA), and the amplicon size was compared to a 1 kb DNA ladder (Thermo Fisher Scientific, Inc.). No amplification was observed in the negative PCR control. Only the amplicons (∼550 bp) with clear visible bands were selected, purified using the ZR-96 ZymocleanTM Gel DNA Recovery Kit (Zymo Research) and eluted in 15 µl of elution buffer. The eluted amplicon library (sequencing library) was quantified by qPCR using the KAPA Library Quantification Kit (KAPA Biosystems). After quantification, each amplicon library was normalized to an equimolar concentration (3 nM) and validated on the TapeStation (Agilent Biosystems, Santa Clara, USA), prior to sequencing. The normalized library pool was further diluted to 12 pM, spiked with equimolar 10% Phix control and then paired-end sequencing was performed using the 600 cycle v3 sequencing kit on the Illumina MiSeq Desktop sequencer (Illumina, San Diego, CA, United States) in 2 runs with inter-run calibrators to reduce eventual differences between sequencing runs.

### 16S rRNA Gene Sequence Data Processing

Sequence data quality check, processing and analyses: The sequence quality of the raw reads generated from the Illumina MiSeq machine was checked using FastQC (Andrews, 2010). The forward reads (R1) corresponding to V3 region were employed for subsequent analyses because they were of better quality than the reverse reads (R2) corresponding to V4 region [Phred quality score (Q) ≤ 15]. Sequence processing was performed using the UPARSE (USEARCH version 9.2.64) software by Edgar (2013); this step included quality filtering and operational taxonomic units OTU clustering. FastQ files were used as the input file for the UPARSE pipeline. The raw reads were truncated to 240 bp and quality-filtered. The reads were truncated to remove the lowquality base pairs at the 3′ -end and to make all samples of same sequence length. Furthermore, chimeric sequences were removed using the UCHIME algorithm (Edgar et al., 2011). The qualityfiltered sequences were clustered into OTUs at 97% sequence similarity level. For taxonomy prediction, we employed the 16S rRNA Ribosomal Database Project (RDP) training set with species names v16. This RDP training set was used as a reference database because the large 16S databases like SILVA, Greengenes, or the full RDP database may give unreliable annotations of short 16S rRNA tags (Edgar, 2018). Taxonomic ranks were assigned to the OTUs using the SINTAX algorithm (Edgar, 2016) using a bootstrap cutoff value of 0.5. Afterwards, OTUs with a confidence score <1 at the domain level and the OTUs belonging to the phyla Cyanobacteria and Chlorophyta were removed to exclude the plant-related sequences from the microbiota analysis. After

constructing the OTU table, the counts were rarefied to the lowest number of sequences per sample to get an even sampling depth to facilitate comparisons between the treatment groups. The OTU count data was divided into 4 sets based on the sample type, namely the DI content, DI mucus, tank water and tank biofilm samples. The downstream analyses were performed separately on these 4 sets. Furthermore, to ensure that we employ content and mucus data from the same fish, only 14 fish from each group were considered for the downstream analyses. In total 103 samples were used for the downstream analyses, including the tank water and biofilm samples. The raw 16S rRNA gene sequence data from this study has been deposited in the European Nucleotide Archive (ENA) under the accession number ERP110004.

Analyses of microbial diversity and composition: R codes were executed in RStudio v3.5.0 (RStudio Team, 2016) and the functions of the R packages "iNEXT" v2.0.12 (Hsieh et al., 2016), "phyloseq" v1.22.3 (McMurdie and Holmes, 2013) and "ggplot2" v2.2.1 (Wickham, 2016) were used to make the rarefaction curves for the species richness, to calculate and visualize diversity indices, and to prepare the abundance plots. Another R package called "microbiome" v1.0.2 (Lahti et al., 2017) was used to make core and rare microbiota (relative abundance of core taxa) plots. Alpha diversities were calculated based on the formula suggested by Jost (2006); for Shannon diversity (effective number of common OTUs) and Simpson diversity (effective number of most abundant OTUs). Beta diversity was examined by conducting weighted UniFrac distance metric (for fish samples)-based PCoA and double principal coordinates analysis (DPCoA, for water and biofilm samples) (Fukuyama et al., 2012).

The feeding design, sample processing and sequencing, and analyses are shown in **Figure 1**.

#### Statistical Analysis of the Bacterial 16S rRNA Gene Amplicon Data

Statistical analysis was also performed in RStudio v3.5.0. Kruskal-Wallis test followed by Dunn's test was employed to detect differences in alpha diversity, and we report statistically significant differences at p < 0.05 and statistical trends at p ≤ 0.15. Betadisper was used to check the assumption of heterogeneity in dispersions; after that Adonis (PERMANOVA) followed by pairwise comparisons was employed (999 permutations) to understand the significant dissimilarities of the communities. "ANCOM" v1.1–3 (Mandal et al., 2015) was used to detect the differentially abundant OTUs in the treatment groups, and "Boruta" v5.3.0 R package (Kursa and Rudnicki, 2010) was employed to find the relevant OTUs that caused the differences in the intestinal bacteria of the three fish groups.

## Microbial Network Construction and Comparison of Topology

We used "SPIEC-EASI" v0.1.4 R package (SParse InversE Covariance Estimation for Ecological Association Inference) for generating the single-domain bacterial network. SPIEC-EASI is a statistical method that assumes the underlying microbial interaction networks to be sparse (Kurtz et al., 2015). In this study, we employed the neighborhood selection (MB) method on the sequenced 16S rRNA gene (V3 region) data of both DI content and mucus samples to understand the community organization.

We explored the co-occurrence networks to uncover the probable biological interactions occurring within the microbial communities. We used the top 200 OTUs for network construction, since it is advised to avoid extremely rare OTUs or OTUs with a large number of zeros (Banerjee et al., 2018). The co-occurrence microbial networks were constructed and analyzed using the functions of the R package "igraph" v1.2.1 and customized ggplot2 commands. A network consists of a set of vertices (commonly called as nodes) and set of edges. The degree of a node is the number of connections it has with the other nodes in the network. Betweenness estimates the number of shortest paths that pass through the nodes in the network and assortativity coefficient quantifies the extent of the selectively connected labeled pair of nodes (Kolaczyk and Gábor, 2014). We compared the topology of the networks of the content and mucus samples separately by analyzing the node degrees and betweenness of the control and LAB-fed groups using Kruskal-Wallis test followed by Dunn's test.

## RESULTS

We analyzed the V3 region amplicons of the 16S rRNA gene that was sequenced on our high-throughput sequencing platform. A total of 28,747,884 high-quality reads were clustered into 1,823 OTUs at 97% identity threshold. These reads were rarified based on sample-size to 12,855 reads/sample; this allowed us to assess most of the underlying microbial diversity (**Supplementary Figures 1A,B**).

The differences in the DI bacterial communities of the LABfed fish compared to the control fish are explained based on the following diversity metrics: overall microbial richness (i.e., counts of individual OTUs), effective number of OTUs (counts of common and dominant OTUs), taxonomic composition, relative abundances of the bacterial taxa. Furthermore, we present the significant and relevant bacterial communities of the DI microbiota. We also describe the topology of the networks of the bacterial communities in the 3 fish groups.

#### Differences in the Microbial Diversity and Composition of the Intestinal and Environmental Microbiota

LAB feeding did not affect the species richness of the bacterial community in the DI content (**Figure 2A**). However, this was not the case for bacteria in the DI mucus; the species richness was found to be higher in the mucus of the RIII-fed group (p = 0.004 for RII vs. RIII and p = 0.071 for RIII vs. C) (**Figure 3A**). We observed differences in the effective number of common and dominant OTUs in the mucus of LAB-fed groups, (p = 0.109 and p = 0.146 for RII vs. RIII; **Figures 2B**, **3B** and **Figures 2C**, **3C**). Comparison of the Faith's phylogenetic diversity (PD) of the DI content did not reveal any significant differences (**Figure 2D**).

For the DI mucus, differences were observed between the PD associated with the three fish groups (p = 0.004 for RII vs. RIII and p = 0.079 for RIII vs. C; **Figure 3D**). It is noteworthy that the median alpha diversities of RII lies below the corresponding values of C although we did not detect a trend or statistically significant difference between the feed groups. PCoA based on the weighted UniFrac distance matrix revealed the beta diversity of the bacterial communities; the differences between the control and LAB-fed groups were statistically significant (**Figure 4A**: F statistic = 9.215, R <sup>2</sup> = 0.320, p < 0.001; and **Figure 4B**: F statistic = 3.114, R <sup>2</sup> = 0.137, p < 0.002).

The beta diversity of the bacterial communities in the rearing tank water and biofilm samples were also analyzed. The bacterial communities in the water of the 3 study groups were not different (**Supplementary Figure 2A**, F-statistic = 0.753, R <sup>2</sup> = 0.273, p = 0.684), as was the case with the bacteria in the biofilm (**Supplementary Figure 3A**, F statistic = 0.681, R <sup>2</sup> = 0.185, p = 0.574). On the other hand, the bacterial communities in the water were significantly different from those of the fish (**Supplementary Figures 2B–G**). Although we did not observe any significant differences between the bacterial communities of the tank biofilm and the intestinal mucus bacteria of the LAB-fed fish (**Supplementary Figures 3B–D,F–G**), the biofilm and mucus bacteria of the control group were different (**Supplementary Figure 3E**, F statistic = 16.29, R <sup>2</sup> = 0.520, p = 0.003).

#### Intestinal Bacterial Composition Under the Influence of LAB

Bacteria belonging to 23 phyla were present in the DI content and mucus (**Figures 5A**, **6A**). Firmicutes, Proteobacteria, Spirochaetes, Tenericutes, and Actinobacteria were found to be dominant in the intestine of the three study groups (**Supplementary Figures 4A,C**). Firmicutes were found to be more abundant than the rest, in both the content and mucus of the LAB-fed fish (**Figures 5A,B** and **Figures 6A,B**). The abundance of the phylum Tenericutes (content and mucus) was higher in RII-fed fish, than in the RIII-fed fish group (**Figures 5A,B** and **Figures 6A,B**). Proteobacteria (content and

mucus) decreased in the LAB-fed groups compared to the control group (**Figures 5A,B**; **Figures 6A,B** and **Table 1**). The abundance of Spirochaetes was higher in the DI mucus of RIII-fed fish and lower in the RII-fed fish (**Figures 6A,B**). The abundant phyla in water is shown in **Supplementary Figure 5A**. The dominant phyla in water were Bacteriodetes and Proteobacteria (**Supplementary Figure 5B**). The changes in the abundance of most bacterial taxa in both DI content and mucus of the LAB-fed groups compared to the control group is shown in **Table 1**.

At the genus level, Lactobacilli (Lactobacillus fermentum and Lactobacillus paraplantarum) were found to be the most dominant bacteria in the content and mucus of LABfed fish (**Figure 5B**, and **Supplementary Figures 4B–D**) and Mycoplasma was also found to be dominant in the DI mucus of LAB-fed fish (**Figure 6B**).

#### Core Bacterial Communities of the Intestinal Microbiota

We identified the core microbiota, i.e., the members of the bacterial communities that were commonly shared among 99% of the samples.The common core taxa–at prevalence (bacterial community population frequency) of 99% and abundance detection threshold of 20%–are shown in **Figures 7A,B**. In the DI content, the abundant genera in the LAB-fed fish, namely Lactobacillus, Ralstonia (L. paraplantarum, R. pickettii) and Mycoplasma were noted to be among the core bacterial members. Bradyrizhobium, Photobacterium, Phyllobacterium, Brevinema, Methylobacterium (B. jicamae, P. phosphoreum, P. myrsinacearum, B. andersonii, M. fujisawaense), and Sphingomonas were also the shared core taxa in the content (**Figure 7A**). In the DI mucus, the genera that had higher abundance in the RIII-fed fish viz. Brevinema and Pelomonas (B. andersonii, P. saccharophila) were observed among the core bacterial members. Photobacterium, Ralstonia, Aquabacterium, Bradyrizhobium, Methylobacterium, Phyllobacterium, (P. phosphoreum, R. pickettii, A. parvum, B. jicamae, M. fujisawaense, P. myrsinacearum), Sphingomonas, and Mycoplasma were also the shared core taxa of the intestinal mucus (**Figure 7B**).

The DPCoA indicated differences in the core members of the LAB-fed and the control group (content: F-statistic: 3.879, R <sup>2</sup> = 0.165, p = 0.004; mucus: F-statistic: 5.844, R <sup>2</sup> = 0.219, p = 0.001; **Supplementary Figures 6A,B**).

#### Significantly Abundant and Relevant Bacterial Taxa of the Intestinal Microbiota

ANCOM analysis detected the significantly abundant bacterial OTU in the DI content, which turned out to be L. fermentum in RIII-fed fish (**Table 1**). However, this bacterium was not detected as a significant feature in the DI mucus.

Boruta analysis gave 9 and 8 relevant OTUs in the intestinal content and mucus, respectively. In the DI content, L. fermentum, L. paraplantarum, Streptococcus sobrinus, Corynebacterium simulans, Lactococcus plantarum, W. cibaria, C. amphilecti, and bacterial taxa belonging to Streptococcus and Xanthomonodales

were the relevant bacteria. L. paraplantarum was found to be abundant in the RII-fed group, whereas L. fermentum and Xanthomonodales were found to be abundant in the RIII-fed group. S. sobrinus, C. simulans, L. plantarum, W. cibaria, C. amphilecti were reduced in abundance in the LAB-fed groups. In the mucus, Lewinella antarctica, L. paraplantarum, L. fermentum, Salinisphaera, Colwellia aestuarii and bacteria belonging to-Gammaproteobacteria, Rhodobacteraceae, and Clostridiales were found to be the relevant bacterial taxa (most of them were abundant in the mucus of the RIII-fed fish–**Table 1**).

#### Association Network of OTUs

#### The DI Content Bacteria

The single-domain bacterial (SDB) network derived from the DI content of the 3 groups comprised of one giant connected component (**Supplementary Figure 7**). The significantly abundant and relevant OTUs were labeled based on their membership in different modules (**Figures 8A–C**). The connectivity pattern of the significantly abundant and relevant OTUs in the phylum-level co-occurrence network is shown in **Supplementary Figures 9 A–C**. The average node degrees were 4.27 (SD: 3.44), 3.71 (SD: 1.52), 4.06 (SD: 2.48) for the control, RII- and RIII-fed fish, respectively. Similarly, the values for betweenness were 370 (SD: 369), 396 (SD: 351), 388 (SD- 391). The average node degrees and betweenness of the three groups were not significantly different. The degree of assortativity (assortativity coefficient ca) of the phylum-level network associated with the three groups (control, RII- and RIIIfed fish) were 0.09, 0.19, and 0.10, respectively. The significantly abundant and relevant OTUs belonged to different phyla and modules (**Figures 8A–C** and **Supplementary Figures 9 A–C**). The degree distribution of the microbial network (for all OTUs) of the study groups (**Supplementary Figure 11A**) revealed that there are many highly connected hub nodes for the bacterial network of the RII-fed fish and the hubs of the control group have more node degrees.

#### The DI Mucus Bacteria

The SDB network derived from the DI mucus of the control, RII, and RIII groups comprised of one giant connected component (**Supplementary Figure 8**). In the bacterial network of RII-fed fish, we observed a singleton (C. aestuarii), a dyad (2 OTUs of Mycoplasma), and a triad (L. paraplantarum, W. cibaria, and P. piscicola) with no connection to the main network (**Supplementary Figure 8**). As for the RIII-fed group, there were 3 dyads (Sphingobacteriales + Myxococcales, 2 OTUs of Mycoplasma, and Xanthomonadales + Gammaproteobacteria) with no connection to the main network (**Supplementary Figure 8**). The significantly abundant and relevant OTUs were labeled based on their membership in different modules (**Figures 9A–C**).

RIIIDC). Principal coordinate analysis plot (B) shows the differences in the composition of the bacterial communities in the intestinal mucus (Control, CDM; RII, RIIDM; RIII, RIIIDM).

The connectivity pattern of the significantly abundant and relevant OTUs in the phylum-level co-occurrence network is shown in **Supplementary Figures 10 A–C**. The average node degrees were 4.12 (SD: 2.20), 2.29 (SD: 2.09), 2.74 (SD: 1.19) for the control, RII- and RIII-fed fish, respectively. The values for betweenness of the control, RII- and RIII-fed fish were 505 (SD: 664), 481 (SD: 596), 613 (SD: 766), respectively. Dunn's test identified significant differences between the LABfed groups, and between control and RIII-fed fish; for node degree, but not for edge betweenness; p = 0.0002, p = 0.003 and p = 0.08, p = 0.07, respectively. The degree of assortativity (assortativity coefficient ca) of the phylumlevel network for the three groups (control, RII- and RIIIfed fish) were −0.01, −0.07, and 0.13, respectively. The degree distribution of the microbial network (for all OTUs) of the three groups is shown in **Supplementary Figure 11B**. The node degree histogram showed that the hubs of the RII-fed groups have higher node degrees than the other groups.

The main results of this study are summarized in **Figure 10**.

#### DISCUSSION

Probiotics are live microbes that can impart health-benefiting effects on host organisms. For instance, feeding of some species belonging to genera Lactobacillus and Bifidobacterium can elicit positive effects on host health (Wang et al., 2015; Bagarolli et al., 2017). Probiotics alter the gut microbiota and interact with them to produce several types of metabolites, vitamins, and antimicrobial agents that affect the host physiology (Saulnier et al., 2011; O'Shea et al., 2012; LeBlanc et al., 2017). In the present study, we investigated the intestinal microbiota changes in Atlantic salmon after feeding them with dietary supplements of two Lactobacillus spp., named RII and RIII. To understand the differences in the microbial community associated with the content and mucus of the DI, the bacteria in the two samples were analyzed separately because the microbial niche in the DI mucus is distinct compared to the intestinal contents.

Feeding LAB to the fish may facilitate their establishment in the intestine, although significant difference was noted for the abundance of only one of the two LAB species. The feeddelivered organisms also altered the diversity and composition of the DI bacteria differently. RIII supplementation caused a significant increase in the species richness and phylogenetic diversity of the bacterial community in DI mucus. Furthermore, both RII and RIII caused a shift in the community composition; bacteria belonging to different genera were altered in the two feed groups. The co-occurrence networks indicated differential bacterial associations in the control and LAB-fed groups.

Water bacterial communities may have an effect on the microbiota of fish. To clarify this, we compared the microbial community composition in the intestinal and environmental samples. Notwithstanding the fact that different extraction methods can cause small variations in the microbial profile (Wagner Mackenzie et al., 2015) studies have shown that rearing water has a minor effect on the GI microbiota in fish (Giatsis et al., 2015; Uren Webster et al., 2018). Betiku et al. (2018) have demonstrated that recirculating water systems have more diverse microbial composition compared to the flowthrough system. However, similar to other reports (Yan et al., 2016, Lokesh et al., 2018, Gupta et al., under review) water bacterial communities might not have affected the intestinal bacterial profile in our study. Also, none of the dominant OTUs of water were detected in the DI of fish, suggesting that host-specific gut microbial species selection is modulated by the host gut habitat and host's genotype (Giatsis et al., 2015).

#### LAB Increases the Microbial Diversity in the Intestinal Mucus

Corresponding to our observation on the content bacteria, a few previous studies have also shown that LAB supplementation does not alter the intestinal bacterial diversity (Chao1 and Shannon diversities); in humans (Van Zanten et al., 2014) and in mice with colon cancer (Mendes et al., 2018). On the other hand, species richness, Shannon and Simpson diversities, and PD of the bacteria in the DI mucus were higher in the RIII-fed

fish. In the case of mucus bacteria of RII-fed fish, we noted a slight decrease (p > 0.05) in alpha diversity compared to the control fish. Previous studies have shown that Lactobacillus can increase the bacterial PD in the gut of mice (Usui et al., 2018) and weaning piglets (Zhao et al., 2016). On the contrary, offering LAB in combination with Bifidobacterium breve and Bifidobacterium longum did not result in greater bacterial species diversity (Chao1, Shannon index and PD) in mice that received antibiotics (Grazul et al., 2016).

### LAB Promotes the Abundance and Dominance of Intestinal Lactobacillus and Other Firmicutes

L. paraplantarum (LP) is related to L. plantarum (Curk et al., 1996). It was dominant in the RII-fed group and L. fermentum (LF) was found dominant in the RIII-fed group. Lactobacilli are a group of gram-positive ubiquitous LAB that produce organic acids as end products of their metabolic activity linked to carbohydrate fermentation (Bernardeau et al.,

2006). LP is known to produce bacteriocins, which are antimicrobial peptides produced as a defense response (Tulini et al., 2013). A Lactobacillus isolate (LP 11-1) stimulated the innate immune system and induced tolerance against the pathogenicity of Pseudomonas aeruginosa in silkworm (Nishida et al., 2017). LF has been found to restore the expression of markers associated with the maintenance of intestinal barrier function, and recover the SCFAs- and lactic acid-producing bacterial populations in mouse suffering from colitis (Rodriguez-Nogales et al., 2017).

Lactobacillus is part of the normal intestinal flora of fish (Ringø et al., 1995; Spanggaard et al., 2000; Ringø and Olsen, 2003). In zebrafish, probiotic Lactobacillus helps to overcome infection (He et al., 2017). In Nile tilapia (Oreochromis niloticus), LF is known to improve fish immune response (Nwanna and Bamidele, 2014). LF (LbFF4 strain) along with



Arrows indicate changes in abundance (blue arrow: increase, red arrow: decrease, bold black line: no change, TND: taxon not dominant).

L. plantarum (LbOG1 strain) exhibit in vitro antibacterial activities against fish pathogens in Clarias gariepinus (Adenike and Olalekan, 2009). The higher abundance of intestinal Lactobacillus members and the altered bacterial abundance in the LAB-fed fish confirms that LAB feeding can change the intestinal microbial composition.

Enterococcus cecorum, was also found to be dominant in the content of the RII-fed group compared to the control group (**Table 1**). Enterococcus spp. isolated from the intestine of rainbow trout (Oncorhynchus mykiss) are used as probiotics due to their antimicrobial activity against fish pathogens (Carlos et al., 2015). The functional potential of E. cecorum in Atlantic salmon has not yet come to light although one particular strain is known to cause infections in broilers (Herdt et al., 2009).

Clostridiales (belonging to Firmicutes) were higher in the mucus of salmon offered diets with RIII. Commensal Clostridiales are known to promote gut health by modulating gut homeostasis and taking part in immune activation (Lopetuso et al., 2013).

#### LAB Favors Certain Members of Tenericutes, Spirochaetes, and Actinobacteria

LAB significantly aided in altering the abundance of the genus Mycoplasma (Tenericutes) and B. andersonii (Spirochaetes) in the mucus, which are the common core members in the DI content of Atlantic salmon (**Figure 7A**). Mycoplasma has consistently been isolated from salmon intestine (Holben et al., 2002; Zarkasi et al., 2014) and its presence as a core microbiota suggests that it may be a commensal organism in the intestinal ecosystem. B. andersonii has been reported in the intestinal microbiota of flatfish, Solea senegalensis (Tapia-Paniagua et al., 2010). Although B. andersonii is known to digest lignocellulose and fix nitrogen in termite guts (Kudo, 2009), their functional importance needs to be elucidated. The abundance of the genus Micrococcus (M. luteus), a member of Actinobacteria, was higher in the DI content of the RIII-fed group (**Table 1**). Though M. luteus is known to be a pathogen for rainbow trout (Salmo trutta L.) and brown trout (Oncorhynchus mykiss) (Pkala et al., 2018) an in vivo feeding study has suggested that they can enhance the growth and health of Nile tilapia (Abd El-Rhman et al., 2009).

### LAB Largely Decreased the Abundance of Proteobacteria

Proteobacteria is the most abundant phylum in many marine and freshwater fishes (Yan et al., 2016; Lokesh et al., 2018) and it is also known to dominate the gut microbiota of Atlantic salmon (Gajardo et al., 2016; Lokesh et al., 2018). Therefore, it was surprising to find this taxon in low abundance in the LABfed and the control fish. A general decrease in the abundance of intestinal Proteobacteria has also been reported in farmed Atlantic salmon that were transferred to seawater (Rudi et al., 2018). Taxa belonging to Proteobacteria are involved in metabolic pathways that participate in carbon and nitrogen fixation and in the stress response regulatory system (Vikram et al., 2016). They are also important in the digestive process in fish (Romero et al., 2014). P. phosphoreum, a known gut symbiont of marine fish, helps in chitin digestion and use luciferase- reoxidize reduced coenzymes and other molecules for metabolism (Nealson and Hastings, 1979). N. sediminicola and P. myrsinacearum are

known as nitrogen-fixing bacteria (Gonzalez-Bashan et al., 2000; Muangthong et al., 2015). On the other hand, R. pickettii formerly known as Burkholderia pickettii has genes to biodegrade aromatic hydrocarbons (Ryan et al., 2007). In the current and in our recent (Gupta et al, under review) studies we found that P. myrsinacearum and R. pickettii are part of the core gut microbiota of Atlantic salmon; N. sediminicola was also significantly abundant in the intestinal mucus of the fish fed oligosaccharide. Functions of the aforementioned bacteria are not yet reported in fish.

#### LAB Affects the Microbial Association

We inferred single-domain networks using the SPEIC-EASI framework, and highlighted the significantly abundant and relevant OTUs in the intestinal microbiota. For DI mucus, the inferred SDB network for RII-fed fish showed lower overall connectivity. The node degree histograms also communicate interesting information about the network; the mucus bacteria of RII-fed group had hubs with more node degree. However, the lower average node degree and lower selective linking of the RII-fed group indicate less interactions among the gut bacteria. Cooperative microbial communities are known to provide microbiome stability because of their functional dependence. Studies have shown that the stability declines with an increase in microbial diversity and proportion of cooperative interactions (Coyte et al., 2015). However, higher cooperating microbial communities can cause a runaway effect that can collapse the competing microbial population due to overrepresentation of the most stable community (McNally and Brown, 2016).

The dyads in the mucus bacterial networks of LAB-fed fish were different, the exception being the one constructed

in callouts.

with 2 OTUs of Mycoplasma which had higher abundance in the RII-fed fish and lower abundance in the RIII-fed fish. This result could be suggesting that intestinal Mycoplasma in the LAB-fed fish was not associated with other gut bacterial communities. In the content of LAB-fed fish, most of the labeled OTUs (except OTU 8) were existing in their respective modules (**Figures 8B, C**). In the mucus of RIII-fed fish OTUs belonging to C. aestuarii, L. paraplantarum and Clostridiales were found to exist in one module. Clostridiales and Rhodobacteraceae, which had same module membership in the network of the control fish were no longer closely associated after LAB feeding. So was the case with L. fermentum and C. aestuarii. Members affiliated to Rhodobacteraceae are known for their denitrification properties, and Kraft et al. (2014) have shown that Clostridiales indirectly participates in nitrate respiration by providing fermentation substrates (e.g., acetate, formate, or hydrogen) to Rhodobacteraceae-like denitrifiers. Our findings suggests that the taxa belonging to the same module can be functionally dependent but the alteration of their membership after LAB feeding has to be further investigated.

The mucus bacteria of RIII-fed fish had higher species richness and PD, and the significantly abundant and relevant OTUs belonged to different modules. For the RIII-associated network, 2 OTUs each belonging to two modules (Rhodobacteraceae and L. fermentum; C. aestuarii, and Clostridiales) had higher abundances compared to the control group. In addition, significantly abundant and relevant bacteria had higher abundance in the RIII-fed fish compared to the control group. This abundance pattern does not indicate negative feedback loops (Coyte et al., 2015). These results of bacterial networks have to be validated through culture-based studies.

## CONCLUSION

In summary, LAB feeding promoted the dominance of intestinal Lactobacillus (Firmicutes) and certain members of the phyla Tenericutes, Spirochaetes, and Actinobacteria. Although the abundances of many members of Proteobacteria were decreased, the phylum remained dominant in the distal intestine of Atlantic salmon. Dietary supplementation with the two LAB strains shifted the intestinal bacterial community composition. Furthermore, the co-occurrence networks of the intestinal bacteria were also different for the LAB-fed fish. Taken together, our results show that the LAB influences the gut microbiota of Atlantic salmon. This information will help in future studies that explore the microbial interactions between LAB-modulated gut microbiota and the host.

### AUTHOR CONTRIBUTIONS

MS and VK procured the funding for the study. VK, MS, JK, AF, and SG designed the study. JK provided the probiotics. AF and SG conducted the feeding experiment. SG performed the 16S rRNA sequencing studies. SG, VK, and JF analyzed the data. SG wrote the manuscript with the guidance of VK. All authors read, revised and approved the manuscript.

## FUNDING

The study was undertaken as part of the project Bioteknologi– en framtidsrettet næring (FR-274/16), funded by the Nordland County Council, Norway.

#### ACKNOWLEDGMENTS

The Lactobacillus strains employed in this study are the property of The University of Veterinary Medicine and Pharmacy in Košice, Košice, The Slovak Republic. We thank Professors Peter Popelka (Department of Food Hygiene and Technology) and Dagmar Mudroová (Department of Microbiology and Immunology), The University of Veterinary Medicine Košice for providing the microorganisms for this study. We are thankful to Ghana Vasanth for her assistance in sample collection, Martina Kopp for her technical help in sequencing the libraries, and Nord University research station staff for their help during the period of fish sampling. Special thanks to Bisa Saraswathy for her support in data analysis, scientific input, helpful discussions and preparation of the manuscript. The authors acknowledge the open access publication funding provided by Nord University.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2018.03247/full#supplementary-material

Supplementary Figure 1 | Sample-size-based rarefaction curves for the reads obtained from the intestinal content (A) and mucus (B). The shaded portion around each line represents the 95% confidence interval. Color code for the feed groups: green lines- control, orange lines- RII, pink lines- RIII. Codes for content samples: CDM-control, RIIDC-RII, RIIIDC-RIII. Codes for mucus samples: CDM-control, RIIDM-RII, RIIIDM-RIII.

Supplementary Figure 2 | Double principal coordinate analysis plots showing the beta diversity of the bacterial communities. Tank and inlet water (A), control intestinal content and control tank water: F-statistic = 4.035, R <sup>2</sup> = 0.211, P = 0.01 (B), RII intestinal content and RII tank water: F-statistic = 2.375, R <sup>2</sup> = 0.136, P = 0.07 (C), RIII intestinal content and RIII tank water: F-statistic = 5.006, R 2 = 0.250, P = 0.002 (D), Control intestinal mucus and control tank water: F-statistic

### REFERENCES


= 16.291, R <sup>2</sup> = 0.520, P = 0.003 (E), RII intestinal mucus and RII tank water: F-statistic = 2.934, R <sup>2</sup> = 0.163, P = 0.051 (F), RIII intestinal mucus and RIII tank water: F-statistic = 3.910, R <sup>2</sup> = 0.206, P = 0.03 (G).

Supplementary Figure 3 | Double principal coordinate analysis plots showing the beta diversity of the bacterial communities. Tank biofilm bacteria (A), Control intestinal content and control tank biofilm: F-statistic = 2.061, R <sup>2</sup> = 0.120, P = 0.082 (B), RII intestinal content and RII tank biofilm: F-statistic = 1.915, R 2 = 0.113, P = 0.015 (C), RIII intestinal content and RIII tank biofilm: F-statistic = 4.171, R <sup>2</sup> = 0.217, P = 0.043 (D), Control intestinal mucus and control tank biofilm: F-statistic = 5.807, R <sup>2</sup> = 0.1279, P = 0.002 (E), RII intestinal mucus and RII tank biofilm: F-statistic = 1.476, R <sup>2</sup> = 0.09, P = 0.146 (F), RIII intestinal mucus and RIII tank biofilm: F-statistic = 2.078, R <sup>2</sup> = 0.121, P = 0.076 (G).

Supplementary Figure 4 | Barplots showing the dominant bacterial phyla and species in the intestinal content (A,B) and mucus (C,D).

Supplementary Figure 5 | Barplots showing the abundance of the bacterial phyla (A), dominant phyla (B) in the tank water. The height of each bar segment represents the abundance of individual operational taxonomic units (OTUs) stacked in order from largest to smallest, and separated by a thin black border line. Color codes: Proteobacteria—green, Bacteroidetes—light blue.

Supplementary Figure 6 | DPCoA showing the differences in the composition of the core members of the intestinal content (A) and mucus (B) samples of the control and LAB-fed groups.

Supplementary Figure 7 | The single-domain network graph of the bacteria in the intestinal content. Nodes represent different phyla shown in different colors. The three panels represent the three feed groups: Control (A), RII (B), RIII (C).

Supplementary Figure 8 | The single-domain network graph of the bacteria in the intestinal mucus. Nodes represent different phyla shown in different colors. The three panels represent the three feed groups: Control (A), RII (B), RIII (C).

Supplementary Figure 9 | Network association graph showing the connectivity pattern of the significantly abundant and relevant OTUs in the intestinal content of the Control (A), RII (B), and RIII (C) groups.

Supplementary Figure 10 | Network association graph showing the connectivity pattern of the significantly abundant and relevant OTUs in the intestinal mucus of the Control (A), RII (B), and RIII (C) groups.

Supplementary Figure 11 | Histograms showing the degree distribution of the bacterial networks associated with the intestinal content (A) and mucus (B).

and realistic safety assessments. FEMS Microbiol. Rev. 30, 487–513. doi: 10.1111/j.1574-6976.2006.00020.x


and short chain fatty acids production in the gut-intestinal tract of weaning piglets. Wei. Sheng Wu Xue Bao. 56, 1291–1300. Available online at: https:// europepmc.org/abstract/med/29738199

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Gupta, Feˇckaninová, Lokesh, Košˇcová, Sørensen, Fernandes and Kiron. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Corrigendum: Lactobacillus Dominate in the Intestine of Atlantic Salmon Fed Dietary Probiotics

#### Edited by:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### Reviewed by:

Atte Von Wright, University of Eastern Finland, Finland

\*Correspondence:

Viswanath Kiron kiron.viswanath@nord.no

#### Specialty section:

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

Received: 08 March 2019 Accepted: 30 April 2019 Published: 15 May 2019

#### Citation:

Gupta S, Feckaninová A, Lokesh J, ˇ Košcová J, Sørensen M, Fernandes J ˇ and Kiron V (2019) Corrigendum: Lactobacillus Dominate in the Intestine of Atlantic Salmon Fed Dietary Probiotics. Front. Microbiol. 10:1094. doi: 10.3389/fmicb.2019.01094

Shruti Gupta<sup>1</sup> , Adriána Feckaninová ˇ 2 , Jep Lokesh<sup>1</sup> , Jana Košcová ˇ 3 , Mette Sørensen<sup>1</sup> , Jorge Fernandes <sup>1</sup> and Viswanath Kiron<sup>1</sup> \*

<sup>1</sup> Faculty of Biosciences and Aquaculture, Nord University, Bodø, Norway, <sup>2</sup> Department of Food Hygiene and Technology, University of Veterinary Medicine and Pharmacy in Košice, Košice, Slovakia, <sup>3</sup> Department of Microbiology and Immunology, The University of Veterinary Medicine and Pharmacy in Košice, Košice, Slovakia

Keywords: fish, Salmo salar, feed additive, probiotics, intestinal bacteria, Lactobacillus, microbiota, amplicon sequencing

#### **A Corrigendum on**

#### **Lactobacillus Dominate in the Intestine of Atlantic Salmon Fed Dietary Probiotics**

by Gupta, S., Feˇckaninová, A., Lokesh, J., Košˇcová, J., Sørensen, M., Fernandes, J., et al. (2019). Front. Microbiol. 9:3247. doi: 10.3389/fmicb.2018.03247

In the original article, there were mistakes in **Table 1** as published. NA was stated as "not present" in the footnote of Table 1 of the original article. For some of the taxa in the table this was not true. Therefore, we replaced NAs with up or down arrows, and indicated TND as taxon not dominant. The corrected **Table 1** appears below. The authors apologize for this error and state that this does not change the scientific conclusions of the article in any way. The original article has been updated.

Copyright © 2019 Gupta, Feˇckaninová, Lokesh, Košˇcová, Sørensen, Fernandes and Kiron. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.



Arrows indicate changes in abundance (blue arrow: increase, red arrow: decrease, bold black line: no change, TND: taxon not dominant).

# Comparative Genome Analysis of Lactococcus lactis Indicates Niche Adaptation and Resolves Genotype/Phenotype Disparity

Michiel Wels1,2, Roland Siezen2,3,4, Sacha van Hijum1,2,3, William J. Kelly<sup>5</sup> and Herwig Bachmann1,2,6 \*

<sup>1</sup> NIZO Food Research B.V., Ede, Netherlands, <sup>2</sup> TI Food and Nutrition, Wageningen, Netherlands, <sup>3</sup> Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Center, Nijmegen, Netherlands, <sup>4</sup> Microbial Bioinformatics, Ede, Netherlands, <sup>5</sup> AgResearch Ltd, Palmerston North, New Zealand, <sup>6</sup> Systems Bioinformatics, Vrije Universiteit Amsterdam, Amsterdam, Netherlands

#### Edited by:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### Reviewed by:

Milan Kojic, University of Belgrade, Serbia Douwe Van Sinderen, University College Cork, Ireland

\*Correspondence: Herwig Bachmann Herwig.Bachmann@nizo.com

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 26 October 2018 Accepted: 07 January 2019 Published: 31 January 2019

#### Citation:

Wels M, Siezen R, van Hijum S, Kelly WJ and Bachmann H (2019) Comparative Genome Analysis of Lactococcus lactis Indicates Niche Adaptation and Resolves Genotype/Phenotype Disparity. Front. Microbiol. 10:4. doi: 10.3389/fmicb.2019.00004 Lactococcus lactis is one of the most important micro-organisms in the dairy industry for the fermentation of cheese and buttermilk. Besides the conversion of lactose to lactate it is responsible for product properties such as flavor and texture, which are determined by volatile metabolites, proteolytic activity and exopolysaccharide production. While the species Lactococcus lactis consists of the two subspecies lactis and cremoris their taxonomic position is confused by a group of strains that, despite of a cremoris genotype, display a lactis phenotype. Here we compared and analyzed the (draft) genomes of 43 L. lactis strains, of which 19 are of dairy and 24 are of non-dairy origin. Machine-learning algorithms facilitated the identification of orthologous groups of protein sequences (OGs) that are predictors for either the taxonomic position or the source of isolation. This allowed the unambiguous categorization of the genotype/phenotype disparity of ssp. lactis and ssp. cremoris strains. A detailed analysis of phenotypic properties including plasmid-encoded genes indicates evolutionary changes during niche adaptations. The results are consistent with the hypothesis that dairy isolates evolved from plant isolates. The analysis further suggests that genomes of cremoris phenotype strains are so eroded that they are restricted to a dairy environment. Overall the genome comparison of a diverse set of strains allowed the identification of niche and subspecies specific genes. This explains evolutionary relationships and will aid the identification and selection of industrial starter cultures.

Keywords: Lactococcus lactis, comparative genomics, gene-trait matching, niche adaptation, dairy fermentation

## INTRODUCTION

Lactococcus lactis is a non-motile, non-spore-forming, low G+C, Gram-positive lactic acid bacterium. It is used as a starter culture in the production of a wide range of fermented dairy products where it contributes to food preservation, flavor and texture formation (Smid and Kleerebezem, 2014). Lactococcus lactis has been isolated from a variety of environments where its primary role is the initiation of fermentation by the rapid utilization of available carbohydrates

to produce lactic acid. Wild-type plant- and animal-associated strains are able to ferment a wide range of mono-, di- and tri-saccharides. Members of this species have also acquired the ability to ferment lactose with some strains rapidly transitioning to become specialists adapted to the homogeneous milk environment. The industrial importance of L. lactis is demonstrated by a global cheese production of close to 2 × 10<sup>7</sup> tons in 2015 (Bulletin of the International Dairy Federation 2016), and based on that we estimate that over 10<sup>20</sup> lactococci are being consumed by humans annually.

The taxonomy of L. lactis is currently phenotypically based (Schleifer et al., 1985; van Hylckama Vlieg et al., 2006; Rademaker et al., 2007) with two main subspecies (lactis and cremoris) and one biovar (lactis biovar diacetylactis). On the basis of their 16S rRNA gene sequence, these two subspecies are estimated to have diverged some 17 million years ago (Bolotin et al., 2004). Several studies have shown that two genotypic lineages (also called subspecies cremoris and lactis) exist, but that strains with the ssp. cremoris genotype can show either phenotype making accurate identification confusing (van Hylckama Vlieg et al., 2006; Kelly et al., 2010; Cavanagh et al., 2015). Attempts to explain this disparity have been made by detailed analysis of selected genes (Godon et al., 1992; Beimfohr et al., 1997; Pu et al., 2002) and comparative genomics (Kelleher et al., 2017), but to date the genes underlying this disparity are unknown. Although both subspecies are used as starters by the dairy industry, defined strains with the ssp. cremoris phenotype are preferred for Cheddar cheese production because of their desirable properties especially in relation to acid production, flavor development and bacteriophage resistance (Limsowtin et al., 1996). Two additional phenotypic subspecies (hordniae and tructae) have been described (Latorre Guzman et al., 1977; Perez et al., 2011), although at the genome level these cluster within subspecies lactis and cremoris, respectively.

Strains with the lactis phenotype show high genetic diversity (Rademaker et al., 2007; Passerini et al., 2010) and have been isolated from a variety of sources, most commonly fresh or fermenting plant material or milk and dairy environments. The cremoris phenotype is only found in dairy environments and many attempts to isolate similar cultures from environmental sources have not been successful (Klijn et al., 1995; Salama et al., 1995). These cremoris phenotype strains cluster closely together (Rademaker et al., 2007) and are likely to have a common origin. There is a considerable body of evidence from comparative and experimental studies that suggests that dairy isolates have evolved from plant isolates (van Hylckama Vlieg et al., 2006; Kelly et al., 2010; Siezen et al., 2010b; Bachmann et al., 2012; Kelleher et al., 2017), and many of the properties that permit rapid growth in milk (lactose utilization, proteolysis, peptide transport) are located on plasmids (Ainsworth et al., 2014b). As many as 12 plasmids (van Mastrigt et al., 2018) and plasmid sizes up to 140 kb have been reported in L. lactis (Kojic et al., 2005). Numerous industrially important traits (bacteriophage resistance, bacteriocin production, exopolysaccharide production) are also frequently plasmid located.

In a previous study comparative genome hybridization was used to assess the gene content of 38 lactococci and the data was matched with 207 phenotypes, which gave insight into niche-specific differences (Siezen et al., 2011). The limitation of comparative genome hybridization is that only genes present on a multi-strain DNA microarray can be detected while new genes can not be detected. This can only be overcome by using genome sequencing data. The draft genomes of the L. lactis strains used by Siezen et al. (2011) were recently sequenced (Backus et al., 2017; Wels et al., 2017) and here we analyzed and compared these strains plus several publicly available fully sequenced L. lactis genomes. We used machine-learning algorithms (Bayjanov et al., 2013) to identify orthologous groups of protein sequences (OGs) that are predictors for either the taxonomic position or the source of isolation. Genome comparison provides evidence of the genetic events that have shaped the evolution of the L. lactis and explains the current genotype/phenotype disparity of ssp lactis and ssp cremoris.

### MATERIALS AND METHODS

#### Strains

All genome sequences of the strains used in this study (**Table 1**) were obtained from the public databases and re-annotated with the same annotation pipeline (Seemann, 2014) to prevent gene calling and function annotation differences between genomes caused by the use of different gene calling software or annotation pipelines.

### Genome Comparison

All protein sequences encoded in the genomes were compared using OrthoMCL (Li et al., 2003). OrthoMCL uses allagainst-all protein BLAST and subsequently performs MCL clustering, grouping proteins with high sequence similarity. In this way orthologs (genes in different species that evolved from a common ancestral gene by speciation) are predicted.

The OrthoMCL output matrix containing OGs, i.e., gene families, was used as an input to generate a database (in MS Excel) in which the information about the location (contig-level coordinates) and length of orthologous proteins of all L. lactis genomes were aligned. Multiple sequence alignment (MSA) files were made where the nucleotide and protein sequences within OGs were aligned, to facilitate identification of pseudogenes (encoding incomplete proteins), and to identify correct start codons.

Detailed examination of specific genome locations and functional analysis was performed within the Integrated Microbial Genome platform (IMG-ER<sup>1</sup> ) (Chen et al., 2017).

### Core and Pan Genome Determination

All OGs with a single gene copy in each of the strains were selected from the OrthoMCL output. The nucleotide sequences of these OGs were aligned using Muscle (Edgar, 2004) and alignment positions with nucleotide differences were isolated and stored in a single artificial sequence. This sequence was

<sup>1</sup>https://img.jgi.doe.gov/

#### TABLE 1 | Strains used in this study.

fmicb-10-00004 January 29, 2019 Time: 16:58 # 3


used as a basis to generate a whole-genome-based phylogeny using FastTree (Price et al., 2009). Re-rooting of the tree was performed using one of the two subspecies (e.g., ssp. cremoris) as an outgroup. The pangenome was scored by counting all OGs within the complete set of studied genomes. Core and pan genome sizes were calculated on defining the average increase

or decrease of the pan/core genome, including the standard variation.

### Subspecies, Niche and Phenotype Matching With the Genotype

For the matching of the phenotypes, described in (Bayjanov et al., 2013), as well as the identification of genes associated with subspecies and niche, we used the Phenolink scripts (Bayjanov et al., 2012), running on a local Linux operating system. For the phenotype matching, pseudogenes were regarded as an intermediate state between presence and absence. In this way, the pseudogene could be regarded as both present as well as absent in the genome. Heatmap figures were prepared with the OGs identified by genotype-phenotype matching as most discriminating for either the subspecies or the isolation source, or for plasmid-derived OGs. Data visualization was done using R 2 . The heatmap function using Manhattan or Euclidean distance matrices and complete hierarchical clustering and data scaling was used for the generation of heatmaps.

### Plasmid Analysis

For plasmid isolation, strains were individually cultured in GM17 medium and pools of equal volumes of fully grown cultures were prepared. Plasmid DNA isolation was performed by an alkaline lysis method as described previously (Vos et al., 1989) followed by phenol chloroform extraction and DNA precipitation (Sambrook and Russell, 2001). The plasmid sequencing was performed at BaseClear B.V. (Leiden, Netherlands) with 100-bp paired-end libraries on an Illumina HiSeq 2000. The sequence reads that resulted from the dedicated plasmid DNA isolation were used to map against the contigs of the genome sequences using breseq (Deatherage and Barrick, 2014). An increased contig coverage was used as the main criterion to determine if contigs were of plasmid origin or not. Manual inspection of these contigs was performed to remove chromosomal repeat elements such as rRNA clusters, transposable elements and phages.

### RESULTS

### Core and Pangenome

We compared the genomes of 44 L. lactis strains (**Table 1**), 36 of which featured in a previous study using comparative genome hybridization and extensive phenotyping (Siezen et al., 2011). The strains were mainly isolated from either a dairy environment or plant material with the exception of one isolate each from human, leafhopper, soil and sink drain water. The draft genomes of 34 strains were recently described (Backus et al., 2017; Wels et al., 2017) while complete genome sequences are available for the other strains.

The average genome size of 43 strains (excluding strain P7266 – see reasoning later) is 2.49 Mb with an average 2548 protein coding sequences (**Figure 1**). The largest draft genome sequence is that of L. lactis ssp. lactis N42 with a genome size of 2.73 Mb while the smallest is L. lactis ssp. lactis K231 at 2.34 Mb. For strains with complete genome sequences the largest chromosome is from strain L. lactis ssp. lactis KF147 (2.60 Mb) and the smallest from L. lactis ssp. cremoris UC509.9 (2.25 Mb). By comparing the 43 L. lactis genomes using OrthoMCL 7795 orthologous groups (OGs) were identified, each corresponding to a different OG (**Supplementary Table 1/Sheet 1**). This set of OGs is often referred to as the pangenome and is considered to be the full complement of genes that can be found in a species. The size of the pangenome of the 43 L. lactis strains levels off at around 7800 OGs and is larger than that found in other LAB (Frese et al., 2011; Siezen and van Hylckama Vlieg, 2011; Dutilh et al., 2013; Smokvina et al., 2013), especially when corrected for genome size. Of the pangenome OGs, about 11% (879) are predicted to be of plasmid origin (see below) which is higher than observed in other species such as Lactobacillus paracasei (Smokvina et al., 2013), where this fraction is ∼5%.

Of the 7795 identified OGs, 1463 were found to be conserved among the 43 strains (**Supplementary Table 1/Sheet 1**). Of these 1463 genes, 70 were identified as being unique to the L. lactis species based on (i) genes in the LaCOG prediction (Makarova and Koonin, 2007) and (ii) conserved among all L. lactis genomes analyzed. Among this set of 70 OGs 46 were annotated as "hypothetical protein" (**Supplementary Table 1/Sheet 2**). The remainder encoded, amongst others functions, a L-lactate dehydrogenase, a cation transporter, other transporters with unknown specificity, an NADH oxidase, several stress proteins and a late competence protein.

### Evolution and Niche Adaptation

Nucleotide variation in the core genes of all 44 strains (including strain P7266), which we defined as OGs occurring with 1 copy per genome, was used to determine the evolutionary relatedness of the strains (**Figure 2** and **Supplementary Figure 1**). The tree was constructed in the same manner as described previously (Smokvina et al., 2013) with the difference that nucleotide sequences were used for alignments of the OGs to obtain a more detailed view of the relationship between genome difference and evolutionary time. **Figure 2** shows strain P7266 forms an outgroup. Manual inspection of the 16S rRNA gene of this strain showed that it is not a L. lactis, but probably a member of another Lactococcus species, closely related to L. taiwanensis (data not shown). This strain was therefore was not included in subsequent analysis.

**Figure 2** clearly shows the separation of strains into the ssp. lactis and cremoris clades which is consistent with earlier genotypic analysis of L. lactis (Rademaker et al., 2007). However, the clear identification of a cremoris clade that overlaps with a ssp. lactis phenotype was not possible with previous genotypic analyses that were either based on small subunit rRNA or on a five-locus Multi Locus Sequence Analysis (MLSA) (Rademaker et al., 2007). The analysis of the 43 genomes suggests that the cremoris genotype/cremoris phenotype strains share a common ancestor with the cremoris/lactis variants and that the cremoris/cremoris strains only evolved once (**Figure 2**). When focusing on the source of isolation, with the exception of KW2, KW10, and N41, all subspecies cremoris strains used in this study are isolated from a dairy environment, while the distribution

<sup>2</sup>https://cran.r-project.org/bin/windows/base/

of isolation source among the subspecies lactis strains is more spread over the dairy and non-dairy niches (**Figure 2**). The dairy isolated ssp. lactis strains cluster together in a clear branch of this subspecies (**Figure 2** from N42 to KLDS). The only L. lactis ssp. hordniae strain (LMG8520) clusters closely to the ssp. lactis (**Figure 2**). The identified L. lactis ssp. lactis biovar diacetylactis strains also cluster within subspecies lactis clade.

While the draft genomes used here do not allow a precise analysis of the number of pseudogenes, an analysis of OGs that occur as pseudogenes in 8 or more strains shows that many more pseudogenes occur in the cremoris/cremoris clade (**Supplementary Table 2**). Consequently, the number of functional genes might be lower in these dairy isolates. The cremoris phenotype strains contain many identical pseudogenes

resulting from adaptation to the dairy environment and indicative of their common ancestry.

Most of the differences above are consistent with the hypothesis that the dairy isolates of L. lactis evolved from plant isolates. This is most clearly shown by the genome degradation that has occurred in the dairy ssp cremoris strains as illustrated by examination of the complement of chromosomal genes that encode glycoside hydrolases (GHs) involved in carbohydrate metabolism. While the wildtype strains have >45 GHs, in the dairy strains many of these have been lost or are now pseudogenes, with HP and FG2 having only 19 intact GHencoding chromosomal genes. This gene loss is most striking in a large chromosomal region (kw2\_1427-1462) that contains 11 GHs as well as genes involved in carbohydrate transport and metabolism. Much of this region has been lost from the cremoris dairy starter strains. A similar chromosomal region is also found in the ssp. lactis strains (LLKF\_1593-1630) but the lactis dairy strains do not show the same extent of gene loss (**Supplementary Table 3**). This clustering resembles the lifestyle adaptation region reported in Lactobacillus plantarum

fmicb-10-00004 January 29, 2019 Time: 16:58 # 6

strains (Siezen and van Hylckama Vlieg, 2011), and several of the genes are likely to be involved in metabolism of plant polysaccharides. In the ssp. cremoris strains the cluster includes a gene encoding a secreted GH11 family endo-1,4-beta-xylanase which is the only example of a xylanase recorded in the lactic acid bacteria. Analysis of the predicted protein sequence shows that it groups with unusual xylanases isolated from uncultured organisms from arthropod guts and with xylanases from rumen bacteria (Brennan et al., 2004). Another feature of this region is the presence of arabinose metabolism genes (including a GH43 alpha-N-arabinofuranosidase) in ssp. lactisstrains KF147, KF282, and LMG8526, and as a pseudogene in ssp. cremoris N41.

#### Plasmid Content

Four strains with complete genome sequences (KW2, IO-1, MG1363, and IL1403) contain no plasmids, although it should be noted that MG1363 and IL1403 are plasmid-cured model organisms. To obtain an overview on the plasmid content of the other strains, we performed a dedicated sequencing effort on pooled plasmid-derived DNA. By determining the contig coverage, a prediction was made whether or not the contig was of plasmid origin. Overall, ∼11% (879 OGs) were predicted to be of plasmid origin. Among these OGs, several well-known plasmidlocated features of L. lactis were found such as the ability to metabolize lactose, several peptidases, oligopeptide transporters and restriction-modification systems. In addition to these wellknown plasmid-derived properties, we also identified 61 OGs that were not found in any of the known L. lactis plasmid sequences available at GenBank, several of which encode potentially industrially relevant functions (**Supplementary Figure 2**). For example, a four gene cluster (containing OGs 2273, 2274, 2276, and 2277) was identified to encode proteins related to capsular or exo-polysaccharide production (CPS, EPS). This gene cluster is found only in the ssp. cremoris strains B40, AM2, and N41 and might produce a new type of EPS. In addition, an OG annotated as nisin resistance protein (OG\_3336) was found. This OG is present in SK11 SK110, AM2, and N42 and ATCC19435.

#### Subspecies Disparity

The taxonomy of L. lactis is currently phenotypically based (Schleifer et al., 1985; van Hylckama Vlieg et al., 2006; Rademaker et al., 2007) with two main subspecies (lactis and cremoris) and one biovar (lactis biovar diacetylactis). The lactis and cremoris phenotypes are differentiated on the basis of arginine utilization, maltose utilization, growth temperature, and salt tolerance, whereas the biovar diacetylactis strains have the additional ability to metabolize citrate. Lactis phenotype strains can produce ammonia from arginine, ferment maltose, tolerate higher temperatures (40◦C) and higher levels of NaCl (3%). Genotype-phenotype matching (Bayjanov et al., 2013) was used to identify the most discriminating OGs for the subspecies lactis and cremoris in this dataset (**Figure 3**) and allowed us to predict the subspecies with a 0% error. The analysis of the OGs that are specific for subspecies cremoris/lactis strains showed that many genes code for hypothetical proteins, regulators or transporters with unknown specificity (**Figure 3**).

Production of ammonia from arginine via the arginine deiminase system is the main property used to separate the two phenotypes (Reddy et al., 1969), and it was shown to play a role in acid tolerance (Cotter and Hill, 2003). This system is encoded by a cluster of nine chromosomal genes, and examination of the ssp. cremoris genome sequences shows that all nine genes are intact in the lactis phenotype strains. However, in the cremoris phenotype strains either arginine deiminase (arcA) is a pseudogene or the gene cluster contains a transposon insertion (**Supplementary Table 4**).

Several wild-type strains of L. lactis ssp. cremoris have been shown to produce putrescine from agmatine via the agmatine deiminase (AGDI) pathway (del Rio et al., 2015), and this may also contribute to acid tolerance. These genes are clustered and are present in 4 cremoris/lactis strains but are not found in the genomes of the cremoris/cremoris strains (**Supplementary Table 4**). The three gene glutamate decarboxylase (GAD) system has also been shown to be important for acid tolerance. The system is intact in most of the lactis phenotype strains, but in all the cremoris phenotype strains gadB is a pseudogene (**Supplementary Table 4**). The GAD system is missing from the KW10 genome, although the surrounding glutamate metabolism genes are all present.

Malolactic fermentation (MLF) is a secondary fermentation in which L-malate is converted to L-lactate and CO2, and is also believed to contribute to acid tolerance in L. lactis (Poolman et al., 1991). Three genes are involved, a LysR-family transcriptional activator (mleR), the malolactic enzyme (mleS) and a malate transporter (mleP). All of the cremoris phenotype strains contain pseudogenes in these genes (**Supplementary Table 4**). Based on the prevalence of pseudogenes in the these systems we predict that the acid tolerance of the cremoris phenotype strains is impaired and suspect that this contributes to the reduced tolerance of these strains to elevated temperatures and NaCl levels. Reduced salt tolerance has been correlated with absence or reduced activity of the betaine transporter BusA (Obis et al., 2001). The Bus operon (busRAB) and a similar gene cluster annotated as an osmoprotectant transport system (choQS) have been lost from strains HP and FG2 but are present in the other cremoris strains. We could not identify a specific OG or group of OGs that might be responsible the reduced temperature and NaCl tolerance. However, there are several transcriptional regulators unique to the cremoris group of which some are involved in the regulation of genes involved in stress response. Regulators specific for the ssp. cremoris include 2 Xre-type regulators (OGs 2327 and 2368) which are known to be temperature-sensitive repressors (Wood et al., 1990) and 2 TetR-family repressors (OGs 2301 and 2412) which are described to be involved in response to osmotic stress (Ramos et al., 2005).

While the lack of maltose utilization is a phenotypic determinant of the ssp. cremoris we found that all L. lactis strains analyzed harbor full length maltose utilization genes such as maltose phosphorylase (OG\_1365), trehalose 6-phosphate phosphorylase (OG\_1490) and beta-phosphoglucomutase (OG\_991).

The ssp. cremoris genomes were searched for maltose phosphorylase genes (glycoside hydrolase family 65) and this

highlighted three gene clusters potentially involved in maltose transport and metabolism (**Supplementary Table 5**). One of these gene clusters has previously been shown to be a trehalose PTS system (Andersson et al., 2001). Most of the cremoris phenotype strains have pseudogenes in this gene cluster.

Maltose utilization in L. lactis IL1403 has been linked with a 10 gene cluster (Gabrielsen et al., 2012) that contains several glycoside hydrolases (GH13 and GH65 families) and a maltose ABC transporter (malEFG). The substrate-binding protein (malE) belongs to COG2182 which is associated with maltose transport. Most of these genes have been lost from strains HP, FG2, B40 and LMG6897, and UC509.9 contains several pseudogenes. An unusual feature is that while the glycoside hydrolase genes are orthologous, the transporter genes belong to two separate groups which show only ∼60% amino acid identity. One group includes most of the lactis phenotype strains along with 11 lactis/lactis strains, while the other group includes KW10, the cremoris phenotype strains and 17 lactis/lactis strains. Whether this difference is reflected in substrate specificity remains to be determined.

A third system potentially involved in maltose or maltodextrin utilization is found as a seven gene cluster only in the cremoris strains with a lactis phenotype and in three of 28 lactis/lactis strains (KF7, LMG8526, and ATCC19435). The genome context is the same in all these strains suggesting that these genes have been acquired as a single block. The substrate-binding protein for the ABC transporter found in this gene cluster belongs to COG1653 which is usually associated with oligosaccharide transporters.

#### Lactose and Citrate Metabolism

Lactose is taken up and phosphorylated by the lactose PTS system, and then metabolized via the tagatose pathway (de Vos et al., 1990). All genes required for uptake, conversion and regulation are organized in the lacR-lacABCDEFGX gene cluster, which is found to be present and complete in many L. lactis strains, presumably all on plasmids (high coverage contigs). All L. lactis ssp cremoris strains have this plasmidencoded gene cluster, except for the plasmid-free strains MG1363, KW2, and KW10. Moreover, these lactose genes are also present on plasmids in L. lactis ssp. lactis strains ATCC19435, DRA4, Li-1, ML8, N42, and UC317. These ssp. lactis strains are isolates from dairy starters, except strains Li-1 and N42 which are grass isolates and may come from a dairy environment (**Figure 4**).

Lactococcus lactis ssp. lactis bv. diacetylactis is widely used in the food industry because it can convert citrate to aroma compounds, such as diacetyl. Citrate uptake is mediated by the pH-sensitive citrate permease P (CitP) (Sesma et al., 1990; Garcia-Quintans et al., 1998; Magni et al., 1999), which exchanges extracellular citrate for intracellular lactate. In the cytosol citrate is split into acetate and oxaloacetate by the citrate lyase complex CitCDEF (Magni et al., 1999; Martin et al., 2004). Oxaloacetate is converted to pyruvate by oxaloacetate decarboxylase (CitM). In L. lactis biovar diacetylactis strains the citrate gene cluster is on the chromosome, immediately downstream of the als gene encoding acetolactate synthase, which converts pyruvate to α-acetolactate, while an intact citP gene is found on a plasmid in a citQRP operon (Passerini et al., 2013; Falentin et al., 2014).

In the present study, all L. lactis strains were found to have the als gene, but only four strains (IL1403, DRA4, M20, and KF67) contain the chromosomal citrate utilization genes (OG3408-OG3414) (**Supplementary Table 6**). In strains IL1403 and DRA4, fragments of a citrate/malate permease (citP/mleP) gene were found between citM and citR while an intact citP gene appears to be on a plasmid in strain DRA4. L. lactis IL1403 is a plasmid-free diacetylactis strain that can no longer utilize citrate, as it has lost the pIL2 plasmid of parent strain IL594 with the citQRP operon (Gorecki et al., 2011). In contrast, strains M20 and K67 have an intact citP/mleP gene between the citM and citR genes, instead of a plasmid-encoded CitP. Since strain K67 was isolated from grapefruit juice, this appears to be an excellent example of adaptation to a niche that is rich in citrate. The citrate metabolism genes in KF67 are not found adjacent to acetolactate synthase and they are likely to have been horizontally acquired from a different lactic acid bacterium. As the different strains assigned to be of the subspecies diacetylactis, except for IL1403 and DRA4, are not monophyletically clustered (**Figure 2**) and the genes involved in citrate transport and metabolism are different in the different strains, it can be argued that the diacetylactis phenotype was acquired by these strains by different evolutionary events. This observation supports the current description of the strains as a dedicate biovar and not of a designated subspecies.

#### Proteolytic System

The proteolytic system found in dairy starter strains of L. lactis consists of an extracellular, peptidoglycan-bound proteinase which cleaves milk proteins into oligo-, tri- and dipeptides. These then enter the cell via peptide transport systems and are further degraded into amino acids by dedicated peptidases (Grappin et al., 1985; Liu et al., 2010).

Most dairy starter L. lactis strains were found to have the prtP gene (encoding extracellular serine proteinase PrtP) and adjacent prtM gene (encoding its maturase PrtM) located on a plasmid (**Supplementary Figure 3**). An intact prtP gene was found only in strains NCDO763, SK11 (Siezen et al., 2005), SK110, AM2, UC509.9 (Ainsworth et al., 2013), N41, N42, and UC317. In strains FG2, LMG6897, B40, and ML8 the prtP gene is present but was not recognized by the Prokka annotation system, because the prtP gene was fragmented on different small contigs. The prtP gene is known to have repeats near its C-terminus, which could cause sequence assembly problems. In addition, two different chromosome-encoded intracellular serine endoproteinases were found (OG\_4412, OG\_4413); in strains ATCC19435, N42, and LMG9447, and a different one in strain CV56.

Nearly all intracellular peptidases of known specificity (Liu et al., 2010) are chromosome-encoded; these are present and complete in all L. lactis genomes: PepC, PepN, PepM, PepA (pseudogene in KF147), PepV, PepT, PepXP, PepQ, and PepP. Two paralogs of PepD1, encoding a dipeptidase, are present in all strains, but one variant is a pseudogene in strains AM2, SK110, SK11, and LMG8520. A plasmid-located pcp gene, encoding pyrrolidone-carboxylate peptidase, is only present in most ssp. cremoris strains (except LMG6897, A76, KW2, KW10, and V4), but located on the chromosome of strain MG1363. Absent in all genomes are the peptidases PepE/PepG, PepI, PepR, and PepL, which are more commonly found in other lactic acid bacteria (Liu et al., 2010). Another five peptidases of unknown specificity are present in all strains.

Two paralogs of PepF, encoding oligoendopeptidase F, are found based on differences in amino acid sequence. All strains except ATCC19435 contain a chromosomally encoded PepF (pseudogene in AM2, SK110, and KLDS). An additional plasmidencoded PepF is found in most ssp. cremoris strains (absent in LMG16897, V4, KW2, KW10; pseudogene in AM2, SK110) and in ssp. lactis ATCC19435. Two paralogs of PepO, encoding neutral oligoendopeptidase O, are found in most strains. They are difficult to distinguish since their amino acid sequences are nearly identical, and that could lead to assembly difficulties. A chromosome-encoded PepO paralog appears to be present in all ssp. cremoris and many ssp. lactis strains, but appears to be a pseudogene in several strains. The presence of an intact

pepO gene is generally linked to the presence of a complete oppACBFD operon encoding the ABC transporter for uptake of oligopeptides. A plasmid-located pepO paralog is present in most ssp. cremoris strains (absent in KW10, KW2), and is directly adjacent to an additional opp operon. L. lactis ssp. lactis strains appear to have only one opp operon, either on the chromosome or on a plasmid, and it many cases it is not clear whether this transporter is functional due to putative pseudogenes. All L. lactis strains have a chromosome-encoded di/tripeptide transporter DtpT (pseudogene in ATCC19435, LMG8520). The advantage of having multiple genes for certain proteolytic functions is assumed to provide more efficient utilization of peptides derived from milk proteins.

#### Bacteriocins – Nisin

fmicb-10-00004 January 29, 2019 Time: 16:58 # 11

A complete nisin biosynthesis cassette nisABTCIPRKFEG (Kuipers et al., 1993) is present on the chromosome of ssp. lactis strains CV56 and IO-1, and is flanked by transposase fragments. A complete nis gene cluster was found, at the same chromosomal insert position as in strain CV56, in 11 ssp. lactis strains: KF134, KF146, KF196, KF282, K231, KF24, K337, KF67, KF7, Li-1, and LMG8526. All the strains of plant/vegetable origin can produce nisin Z. Four strains (V4, LMG14418, LMG9446, and KF147) have an incomplete chromosomal nis gene cluster and cannot produce nisin, but they have retained some immunity genes (i.e., nisFEG and/or nisI). L. lactis ssp. cremoris strains FG2 and N41 have a fragment of the nis gene cluster encoding only nisP, nisI and a truncated nisC, which should also confer nisin immunity; this organization is typical of plasmid localization (Tarazanova et al., 2016).

#### Novel 2-Component Lantibiotic

One of the contigs of the genome of L. lactis ssp. lactis KF146 is a 37-kb plasmid fragment (high sequence coverage) that contains a 7-gene cluster encoding a class I 2-component lantibiotic. This gene cassette is very similar to the smbM1FTM2GAB cluster in Streptococcus mutans, in which SmbA and SmbB are lantibiotic precursors with similarity to lacticin A1 and A2 (Yonezawa and Kuramitsu, 2005). We have adopted the same nomenclature, whereby the L. lactis bacteriocin genes llbM1 and llbM2 encode lantibiotic biosynthesis proteins involved in dehydration and cyclization of the lantibiotic precursors, while llbF, llbT, and llbG encode components of an ABC transporter, of which LlbG has an N-terminal peptidase domain for cleavage of the pre-peptide and activation of the lantibiotic subunits. Similar gene clusters are present in various strains of S. mutans, S. gallolyticus, S. suis, and S. rattus (Hyink et al., 2005; Hinse et al., 2011). The lantibiotic precursor LlbA is 52% identical to SgbA, and 46% to SmbA found in several S. mutans strains, while LlbB is 65% identical to SgbB, 37% to SmbB, and 68% to the SsbB of S. suis (**Supplementary Figure 4**).

### Non-ribosomal Peptide (NRPS)/Polyketide (PKS) Synthesis

A NRPS/PKS gene cluster has been identified in L. lactis KF147 (Siezen et al., 2008, 2010a). It was hypothesized that the NRPS/PKS product in L. lactis functions in microbe–plant interactions (defense or adhesion) or that it facilitates iron uptake from the environment. In the present study, this complete NRPS/PKS gene cluster is found to be present in five other L. lactis strains, i.e., the plant strains KF147, KF146, KF134, KF196, and Li-1 (**Supplementary Table 7**). In each strain the gene cluster has been inserted at the same position on the chromosome downstream of the acetolactate synthase gene. It is likely that this region is a hotspot for gene insertion as the citrate metabolism genes are found at the same location in biovar diacetylactis strains. There is little sequence diversity between these NRPS/PKS gene clusters in these L. lactis strains, suggesting that they are recently acquired from the same unknown host.

A highly similar NRPS/PKS gene cluster occurs in many S. mutans strains (Aikawa et al., 2012). The order of the genes is exactly the same as in L. lactis, and the individual proteins are 69–74% identical to those of L. lactis. Therefore, although the function of the NRPS/PKS product in these two species is likely to be the same, the gene cassette was probably not recently horizontally transferred between species. In S. mutans UA140, the NRPS/PKS locus was demonstrated to produce a metabolite that contributes to oxidative stress tolerance and biofilm formation (Wu et al., 2010). As L. lactis KF147 was recently shown to express this locus during growth in association with plants, this could indicate that the end metabolite might confer protection against reactive oxygen species encountered in plant fermentations as well as on living plant surfaces (Golomb and Marco, 2015).

### Exopolysaccharides (EPS)

Lactococcal cell-wall polysaccharides decorate the peptidoglycan network, and in some cases form a thin outer layer termed the polysaccharide "pellicle" (PSP) (Delcour et al., 1999; Kleerebezem et al., 2010). These polysaccharides have been implicated in bacteriophage recognition and attachment (Forde and Fitzgerald, 1999; Dupont et al., 2004; Mahony et al., 2013; McCabe et al., 2015). L. lactis has two known chromosomal loci for cell-wall polysaccharides biosynthesis (called RGP and EPS cluster) and one gene cluster for teichoic acid biosynthesis (Siezen et al., 2011). These regions show a lot of variation in gene order and composition between L. lactis strains. In a few strains an EPS biosynthesis cluster can be found on plasmids. This variability of cell-wall polysaccharide genes suggests a rich variety in structures of the produced exopolysaccharides in these L. lactis strains.

Rhamnose is one of the major components of these exopolysaccharides (Sijtsma et al., 1991; Looijesteijn et al., 1999). L-dTDP-rhamnose is formed in a four-step enzymatic reaction from glucose 1-phosphate, which involves the activities of glucose-1-phosphate thymidylyl transferase, dTDP-glucose-4,6-dehydratase, dTDP-4-keto-L-rhamnose-3,5-epimerase, and dTDP-L-rhamnose synthase, encoded by the genes that are commonly designated rfbABCD, respectively (Shibata et al., 2002).

The RGP gene cluster for biosynthesis of rhamnoseglucose polysaccharides, also called CWPS (cell wall-associated polysaccharide) gene cluster, appears to consist of three subclusters (Dupont et al., 2004; Mahony et al., 2013; Sadovskaya et al., 2017). The first cluster contains the rmlA-D (or rfbA-D) rhamnose biosynthesis genes, which are found to be highly conserved in all L. lactis strains. The second cluster contains the rgpABCDEF genes, which encode polysaccharide biosynthesis and export, including the priming glycosyltransferase RgpA and ABC transporter subunits RgpC and RgpD. The exact function

of rgpE is still unclear (Shibata et al., 2002). The rpgA-D genes are also highly conserved in all L. lactis strains. Large differences between strains are found in the third sub-cluster which encodes a variety of different sugar transferases, thereby highlighting the diversity and complexity of lactococcal polysaccharides. Based on the presence and absence of all genes found in these RGP/CWPS clusters, three genotypes (A, B, and C) were first defined (Mahony et al., 2013), while C-genotype strains were further grouped into 5 subtypes (Ainsworth et al., 2014a). Polysaccharide structures of members of CWPS classes A, B, and C have recently been determined (Ainsworth et al., 2014a; Vinogradov et al., 2018a,b). In our present study, the composition of the RGP gene clusters in the 43 L. lactis strains allows a tentative assignment of the produced polysaccharides to the CWPS subgroups A, B, and C (**Supplementary Table 8/Sheet 2** and **Table 9**).

The second biosynthesis cluster is the so-called "EPS cluster" of 13 genes epsXABCDEFGHIJKL, as in the plant-derived strain KF147 (Siezen et al., 2008, 2010a, 2011). In this cluster the epsA and epsB gene products determine the length of the EPS and are essential for the biosynthesis of EPS. The epsC gene product is not necessary for the biosynthesis and plays a role in regulation of phosphorylation of the epsB protein. EpsD is the priming glycotransferase, which is anchored in the membrane and essential for the production of EPS (Boels et al., 2004). EpsI is believed to be responsible for the polymerization, and epsJ and epsK gene products play a role in the export of EPS.

Most L. lactis strains do not contain an EPS gene cluster. The EPS cluster similar to strain KF147 was only found (in a conserved position on the chromosome) in seven other L. lactis strains, all plant isolates. Clusters with similar EPS genes are found in 10 other strains, mainly ssp. cremoris, but are located elsewhere on the chromosomes or on plasmids (**Supplementary Table 8/Sheet 1**). Putative plasmid-located EPS clusters are present in strains B40 [on pNZ4000 (van Kranenburg et al., 1997, 2000), AM2, and N41]. Of the chromosomal EPS cluster the epsX, epsA-D, epsK, and epsL gene are highly conserved in these strains. Only in strain LMG9446 the epsA-D and the epsX genes are absent. The genes epsE-J are highly variable between strains. Only strains KF147, KF146, and KF196 have an identical EPS cluster.

In L. lactis one teichoic acid (TA) biosynthesis gene cluster is known (**Supplementary Table 8/Sheet 3**). This cluster is completely absent in 7 L. lactis strains. In other strains the number of genes varies between 4 and 18; several groups can be made, based on the presence and absence of these genes. Of these groups, the cremoris/cremoris genotype/phenotype strains presumably do not have a functional TA biosynthesis, since most genes are absent or pseudogenes, and the essential export genes tagG and tagH are absent. These appear to have been replaced by transposases.

#### Prophages

Bacteriophages are the leading cause of fermentation problems in the dairy industry, with L. lactis phages and the identification of phage-resistant strains being a focus of study since the 1930s (Lawrence et al., 1978). Many L. lactis strains are known to be lysogenic, and analysis of eight complete genome sequences for P335-type prophages identified seven chromosomal integration locations (Kelly et al., 2013). The presence of prophage or prophage remnants at these locations in the 43 genomes is shown in **Supplementary Table 10**. The 43 genomes were also searched for prophage-specific genes (encoding portal or terminase proteins) which resulted in identification of a further five integration locations (**Supplementary Table 10**). Of the 43 genomes, 42 contain between one and five prophages. The most common location for phage integration is site 6 (23/43 strains) between the sunL and fmt genes, with all subspecies cremoris dairy starter strains having a prophage or a prophage remnant at this location. No prophage sequence could be identified in KW10 which originated from fermented corn. Kelleher et al. (2018) recently examined 30 complete L. lactis genomes and found a similar distribution of prophage sequences, although they did not check their genomic context. It can be concluded that prophages are an integral part of the genome of most L. lactis strains, and that they have co-evolved with their bacterial host. The inability of most L. lactis prophages to be induced (Kelleher et al., 2018) suggests they are stable residents within the genome, although they retain the potential to exchange genes with other P335-type phages or to mediate rearrangements of the bacterial chromosome (Kelly et al., 2013).

#### Protein Secretion Systems

Protein secretion has a key role in determining how bacteria interact with their environment and in lactic acid bacteria the majority of proteins are secreted by the conserved Sec pathway. However, several other bacterial secretion systems are known, and the genome of strain KW2 (isolated from fermented corn) encodes two additional systems which have not previously been described in this species.

SecA2 system: KW2 has a gene cluster (kw2\_0790-\_0804) located between mutY (OG\_1194) and pepV (OG\_0624) that make up an accessory SecA system similar to those found in streptococci and several other Gram-positive species (Bensing et al., 2014). In KW2 these genes are predicted to encode proteins involved in the glycosylation and export of a large (2338aa) serine-rich glycoprotein (**Supplementary Figure 5**). The secreted protein has an atypical signal sequence (TIGRFAM number TIGR03715) and a LPXTG-motif C-terminal cell-wall anchor. KW2 is the only strain where this gene cluster appears intact but remnants of this locus are present at the same location in all the L. lactis ssp. cremoris strains as well as in the ssp. lactis strains M20, KF201, K337 and LMG8520.

Ess (ESX secretion) pathway: ESX protein (type VII) secretion systems were initially identified in Mycobacterium tuberculosis, and have subsequently been found in a variety of Gram-positive bacteria (Unnikrishnan et al., 2017). KW2 has a gene cluster (kw2\_1252-\_1268) that includes the secretion proteins EssA-C and a WXG100 family protein that characterize this category of secretion system (**Supplementary Figure 5B**). In KW2 this gene cluster is found next to the prophage integrated between suf B and fabL. Remnants of this locus are present at the same location in all the L. lactis ssp. cremoris strains, and found next to a prophage in strains A76 and V4. The ssp. lactis strain M20 (isolated from soil) has a similar but shorter (10 genes compared to 17 in KW2) intact ESX system at the same location.

The role of these secretion systems in L. lactis and the proteins they secrete is not known, but in other organisms similar systems are known to be involved in bacterial colonization and adhesion, and in several cases play an important role in bacterial pathogenesis (Bensing et al., 2014; Unnikrishnan et al., 2017). Curiously, there have been a small number of reports of human infections caused by L. lactis ssp. cremoris (Hadjisymeou et al., 2013), some of which also make a link to the consumption of unpasteurized dairy products. None of the infection-associated strains have had their genomes sequenced, but it would be valuable to know if these secretion systems occur in these strains.

#### Sex Factor

The sex factor of L. lactis is a chromosome-located mobile genetic element involved in high-frequency conjugation, which has previously been identified in L. lactis strain NCDO712 and MG1363 (Gasson et al., 1992; Godon et al., 1994; Wegmann et al., 2007). It can excise from the chromosome and was found to form plasmid co-integrates, e.g., with the lactose plasmid pLP712 (Gasson and Davies, 1980; Walsh and McKay, 1982; Godon et al., 1995). Sex factor integration and excision is sitespecific involving an identical 24 bp sequence on both the sex factor and the chromosome. The sex factor encodes a relaxase MobD which can nick duplex DNA and is essential for horizontal transfer of the element. Based on the presence of a mobD gene, excisionase/integrase and a 24-bp repeat sequence at the boundaries, we identified regions resembling a chromosomelocated sex factor in L. lactis strains MG1363, NCDO763, ATCC19435, N42, KLDS, E34, and IO-1 (Supplementary **Table 11**). These regions varied considerably in size (about 40– 80 kb), chromosomal insertion position, and composition of encoded functions. The sex factor of strain NCDO763 is basically identical to that of strain MG1363, but lacks 12 consecutive genes including a 6-gene ter operon encoding a membraneassociated stress response complex (Anantharaman et al., 2012). The sex factor of MG1363/NCDO763 encodes a cell-membraneanchored protein CluA that facilitates cell-to-cell contact and can cause a cell-aggregation/clumping phenotype (Godon et al., 1995), but this gene is absent in the other sex factors.

The 71-kb sex factors of strains ATCC19435 and N42 are entirely identical, but share only 10 orthologs with strain MG1363. The 83-kb element of strain KLDS resembles the sex factor of the latter two strains (35 shared orthologs). These sex factors of strains ATCC19435, N42, and KLDS each encode abortive infection proteins AbiGI and AbiGII, and are flanked by genes encoding an excisionase and an integrase, which is typical of a transposon-like element. A putative sex factor on a 66-kb contig of strain LMG8526 shares only 20 orthologs with strain MG1363; this contig has a higher sequence coverage and encodes various plasmid replication proteins, suggesting it could be a plasmid co-integrate instead. Putative sex factors in the plasmidfree strains E34 and IO-1 are smaller but similar and share 25 orthologs.

All these putative chromosomal sex factors carry the essential relaxase mobD gene, which is interrupted by a group II intron (Dunny and McKay, 1999) in strains MG1363, NCDO763, and KLDS, but not in the other strains. The mobD gene with group II intron also appears to be present on plasmids in all dairy L. cremoris strains and in L. lactis UC317. The group II intron encodes a reverse transcriptase/maturase LtrA which plays an enzymatic role in splicing and genetic mobility.

### Phenotype/Genotype Matching

A large number of the strains described in this study have been extensively studied for the presence of different phenotypes related to carbon source utilization, metal resistance and antibiotic resistance (Bayjanov et al., 2013). These phenotypes were previously matched with genotype information based on comparative genome hybridization (CGH) data obtained from microarrays containing the genes of a selected number of reference strains [IL1403, MG1363 and incomplete genomes of KF147 and KF282 (Bayjanov et al., 2009)]. In the original study, several phenotypes could be matched with genetic differences in the strains, explaining the observed phenotype. For example, genes were identified that are associated with the ability of the strains to grow on arabinose, sucrose, lactose and melibiose as well as the resistance to copper and arsenite (Bayjanov et al., 2013).

We here re-examined the phenotypes from Bayjanov et al. (2013) by matching them with the draft genome sequences. As the genome data is more comprehensive compared to the CGH data from the previous study, this should allow to find more genotype/phenotype matchings.

#### Carbon Source Utilization

In addition to the phenotypes described in the Bayjanov paper, we identified 5 more phenotype/genotype matches with other carbon sources. For gentiobiose, starch, ribose and salicin, clear matches were found with genes in the genomes that have a link with carbon source utilization. For gentiobiose, which is a glucose disaccharide, several carbohydrate utilization genes were found to be correlated with the phenotype; an uronate isomerase (EC 5.3.1.12), xylulokinase, aldose-1-epimerase and a xylose transporter. Interestingly, apart from the xylose utilization genes, all the genes are located on different locations on the genome. Within the xylose gene cluster, multiple other genes related to xylose degradation were found, including a beta-xylosidase. If this gene cluster could also (or primarily) act on gentiobiose would need to be validated.

For both starch and ribose, we identified a ribokinase among the best scoring hits in the gene-trait matching. Interestingly, this ribokinase is actually found in most strains, but truncated in four strains (AM2, FG2, HP, and LMG6897) and absent in LMG8520. The only strain with a complete copy of the ribokinase that was not positive for growth on ribose was strain SK11, which could point at another gene being absent or disrupted in this strain. Indepth analysis on the ribose metabolism genes showed that the permease component of the ribose ABC transporter (OG\_1649) has been truncated in SK11. The genotype/phenotype relation found in this particular case was not identified in the original study because of the truncation of the gene, causing it to still be identified in the CGH as present. Strain KF7 is also unable to grow on starch, although the ribokinase and the ABC transporter are present on the genome.

Growth on salicin is correlated with 15 genes with approximately the same presence/absence pattern. Among these genes is a cluster of sugar ABC transport genes present in combination with an alpha-mannosidase. This transporter could well function as a transport system for salicin, while the mannosidase could cleave the salicyl alcohol from the glucose. Alternatively, salicin could be degraded by GHI family glycoside hydrolases. Several of these systems have been identified in the on different locations on the genomes of the different L. lactis strains.

#### Heavy Metal Resistance

fmicb-10-00004 January 29, 2019 Time: 16:58 # 14

The original phenotype/genotype matching paper already describes the presence of copper and arsenite resistance genes in the genomes of several L. lactis strains. Those observations were confirmed when performing the GTM on the genome sequences. In addition, we could identify a set of genes related to the resistance of cadmium, another heavy metal that was described in the results of Bayjanov et al. (2013). The highest scoring gene in that gene-trait matching is annotated as a cadmiumtransporting ATPase. Combined with an efflux accessory protein, which is also among the top-10 scoring genes, this transport system could function as a highly efficient cadmium resistance system. Manual inspection of the GTM results also pointed out why this genotype to phenotype matching was not successful in the original publication, as these transport genes are not found in the reference genomes present on the CGH microarray.

### DISCUSSION

The comparative analysis of 43 Lactococcus lactis genomes revealed ∼ 7800 (orthologous groups and highlights the extensive pangenome of this species). This is considerably higher than the 3877 OGs found in an earlier comparative genome hybridization study that used many of the same strains (Bayjanov et al., 2010, 2013; Siezen and van Hylckama Vlieg, 2011), but used arrays based on the sequences of only 5 strains. The pangenome size reported here is also almost 1900 OGs larger than that reported by Kelleher et al. (2017) which was based on the genome sequences of 30 lactococcal strains. In their analysis, plasmids were not included in the study. Out of the 7800 orthologous identified in our study, 879 were of plasmid origin, which explains part of the discrepancy between our analysis and the Kelleher manuscript. In their study 22 out of the 30 strains were dairy isolates. Our study includes 24 non-dairy isolates and the larger pangenome size found here is presumably caused by additional OGs found in the non-dairy strains. As the pangenome of L. lactis is much larger than that of other lactic acid bacteria (Smokvina et al., 2013) the question arises if this is the result of horizontal gene transfer or a relatively broad species definition. The analysis of the pangenome per subspecies showed that their sizes are 4968 and 6545 OGs for ssp. cremoris and ssp. Lactis, respectively, arguing for L. lactis having a rather broad species definition. Another difference in the assessment of the pan- and coregenome as presented here compared to the Kelleher study is that our pangenome levels off toward a maximum of OGs while it seems to keep on increasing in the Kelleher paper. One of the explanations between the difference in pangenome graphs could be that the genomes used here represent a broader population of L. lactis, resulting in more overlap in gene content between individual genomes. Another difference between the analyses in the two studies was that all genomes compared in this study were re-annotated with the same pipeline. This results in less bias in the gene calling between the different genomes and more overlap in shared genes.

The broad occurrence of plasmid DNA especially in the dairy strains of L. lactis resulted in ∼10% of OGs being associated with plasmids. The observation that plasmids encode several functions that enable growth in milk makes them highly relevant for fermentation applications. While on one hand it makes sense that dairy relevant traits like lactose utilization are on mobile elements it is surprising that they have not been integrated into the genome in any of the analyzed strains. It would be interesting to trace the origin of the genes that mediate lactose fermentation with Enterococcus being the most likely candidate.

The documented discrepancy between the genotype and the phenotype of the subspecies cremoris and lactis is still unresolved at the genome level and in a recent paper by Kelleher et al. (2017) the authors suggest that the solution to this might not be possible without the use of transcriptome and/or metabolome data. Our analysis shows that we were able to fully distinguish the true cremoris strains being a subclade of the complete cremoris subspecies in the set of strains used. Genes that are specifically lost (or are considered as pseudogene) in the true cremoris subclade include a maltose transport system and transcriptional regulators that are described to be involved in osmotic and temperature stress, which are properties used to distinguish the lactis and cremoris subspecies at a phenotypic level.

The development of dairying that followed the domestication of ruminant animals has had a major impact on the selection and evolution of sheep, goats and cattle (Larson and Fuller, 2014) and forage plants (Glémin and Bataillon, 2009). Humans have also been affected as evidenced by studies of the spread of genes for lactase persistence (Leonardi et al., 2012). It is believed that Neolithic humans would have been unable to digest the lactose in milk and so fermented dairy products would be an early development (Curry, 2013) with the earliest evidence for cheese making dating from the 6th millennium BC (Salque et al., 2013). Lactococcus lactis is a key organism involved in dairy fermentation and the strains used as industrial cheese starter cultures also show evidence of domestication. This domestication is best shown by L. lactis strains with the cremoris phenotype. These have no documented counterpart in nature but by comparing the genomes of dairy and non-dairy strains we provide evidence of the genetic events that have shaped their evolution resulting highly specialized dairy strains. Generally, the wild relatives of industrially used microbes have not been studied in detail, but there is interest in how these valuable organisms arose and how they may be further improved. The dairy strains have gained the ability to use casein and lactose through acquisition of plasmid-encoded genes, but at the cost of an extensive multiplication of mobile genetic elements resulting

in pseudogenes and the loss of numerous functions. Similar expansion of IS elements has been observed in host-restricted pathogens (Mira et al., 2006) where it is accompanied by loss of genes by deletion. The evolutionary outcome of this is that now that many genes are not needed they become dispensable and are lost or degraded, a process described as reductive evolution (Kelleher et al., 2017). In this comparison we have focused on accessory protein secretion systems, carbohydrate utilization, exopolysaccharides and other genes that give rise to the cremoris phenotype. In all categories the genome of the dairy starter strains appears extensively degraded predominantly as a consequence of insertion and deletion of mobile genetic elements. Not only have insertions and deletions occurred but there are also changes in gene order through inversions or translocations of parts of the lactococcal chromosome (Kelly et al., 2010). Clustering of these strains suggest that the industrial used starter cultures have all evolved from a small number of lineages as has often been suggested from phage work. The population of cremoris phenotype is small and these strains can be seen as true domesticated microbes whose genome is now so degraded that they seem restricted to a man-made environment. The development of starter cultures and their industrial selection and use over the last century has contributed to this specialization. These observations explain why it has not proved possible to isolate novel L. lactis ssp. cremoris strains with properties similar to dairy starters from environmental sources.

Overall the analysis of full genome sequences of a diverse set of Lactococcus lactis strains allowed us to identify niche and subspecies specific genes that could not be identified earlier. Besides the uncovering of evolutionary relationships, the analysis of functional properties is anticipated to be useful for industrial strain discovery and selection processes.

#### REFERENCES


### DATA AVAILABILITY STATEMENT

The datasets generated for this study can be found in Zenodo, https://doi.org/10.5281/zenodo.1471674.

### AUTHOR CONTRIBUTIONS

MW, RS, and HB conceived the study. All authors analyzed the data. MW, RS, WK, and HB wrote the manuscript.

### FUNDING

The project was funded by TI Food and Nutrition, a publicprivate partnership on pre-competitive research in food and nutrition. The public partners are responsible for the study design, data collection and analysis, decision to publish, and preparation of the manuscript. The private partners have contributed to the project through regular discussion.

### ACKNOWLEDGMENTS

We would like to thank Bernadet Renckens and Beerd-Jan Eibrink for help with genome analysis.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2019.00004/full#supplementary-material

lactis can be reproduced by experimental evolution. Genome Res. 22, 115–124. doi: 10.1101/gr.121285.111



of immunity. Eur. J. Biochem. 216, 281–291. doi: 10.1111/j.1432-1033.1993. tb18143.x


to the plant niche. Appl. Environ. Microbiol. 74, 424–436. doi: 10.1128/AEM. 01850-07


Lactococcus lactis UC509.9. Carbohydr. Res. 461, 25–31. doi: 10.1016/j.carres. 2018.03.011


**Conflict of Interest Statement:** MW, SvH, and HB are employed by NIZO Food Research B.V.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Wels, Siezen, van Hijum, Kelly and Bachmann. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Lactococcus lactis Pan-Plasmidome

Philip Kelleher1,2, Jennifer Mahony1,2, Francesca Bottacini<sup>2</sup> , Gabriele A. Lugli<sup>3</sup> , Marco Ventura<sup>3</sup> and Douwe van Sinderen1,2 \*

<sup>1</sup> School of Microbiology, University College Cork, Cork, Ireland, <sup>2</sup> APC Microbiome Ireland, University College Cork, Cork, Ireland, <sup>3</sup> Laboratory of Probiogenomics, Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma, Italy

Plasmids are autonomous, self-replicating, extrachromosomal genetic elements that are typically not essential for growth of their host. They may encode metabolic capabilities, which promote the maintenance of these genetic elements, and may allow adaption to specific ecological niches and consequently enhance survival. Genome sequencing of 16 Lactococcus lactis strains revealed the presence of 83 plasmids, including two megaplasmids. The limitations of Pacific Biosciences SMRT sequencing in detecting the total plasmid complement of lactococcal strains is examined, while a combined Illumina/SMRT sequencing approach is proposed to combat these issues. Comparative genome analysis of these plasmid sequences combined with other publicly available plasmid sequence data allowed the definition of the lactococcal plasmidome, and facilitated an investigation into (bio) technologically important plasmid-encoded traits such as conjugation, bacteriocin production, exopolysaccharide (EPS) production, and (bacterio) phage resistance.

#### Edited by:

Kimberly Kline, Nanyang Technological University, Singapore

#### Reviewed by:

Milan Kojic, University of Belgrade, Serbia María de Toro, Centro de Investigación Biomédica de La Rioja, Spain

\*Correspondence:

Douwe van Sinderen d.vansinderen@ucc.ie

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 17 January 2019 Accepted: 20 March 2019 Published: 04 April 2019

#### Citation:

Kelleher P, Mahony J, Bottacini F, Lugli GA, Ventura M and van Sinderen D (2019) The Lactococcus lactis Pan-Plasmidome. Front. Microbiol. 10:707. doi: 10.3389/fmicb.2019.00707 Keywords: lactococcal, plasmid, SMRT sequencing, dairy fermentation, conjugation, lactic acid bacteria

## INTRODUCTION

Lactococcus lactis is globally applied as a starter culture for dairy-based food fermentations, such as those involved in the production of Cheddar, Colby, Gouda and blue cheeses, and from an economic and (food) biotechnological perspective represents one of the most important bacteria (Ainsworth et al., 2014a). It is widely accepted that L. lactis originated from a plant-associated niche (Price et al., 2012; Wels et al., 2019) and, whilst the majority of sequenced lactococcal representatives are isolated from the dairy environment, this is not representative of the presumed diversity of the taxon. It is evident from genome analyses of L. lactis strains isolated from the dairy niche that genome decay (due to functional redundancy) (Makarova et al., 2006; Goh et al., 2011; Ainsworth et al., 2013; Kelleher et al., 2015, 2017), in parallel with the acquisition of novel plasmidencoded traits played a significant role in their adaptation to the nutrient-rich environment of milk. Analysis of the plasmid complement has revealed a relatively low abundance of plasmids among lactococcal strains isolated from non-dairy niches (Makarova et al., 2006; Kelly et al., 2010;

**Abbreviations:** BLAST, basic local alignment search tool; CDS, coding sequence; HCL, hierarchical clustering; MCL, Markov clustering algorithm; NGS, next generation sequencing; ORF, open reading frame; PFGE, pulse field gel electrophoresis; qPCR, quantitative polymerase chain reaction; R-M, restriction modification; SMRT, single molecule real time sequencing.

Ainsworth et al., 2013, 2014c). Since various dairy-associated phenotypes are encoded by plasmids, horizontal acquisition to adapt to the dairy environment is likely to be one of the major drivers of plasmid transfer in L. lactis (Ainsworth et al., 2014c) with dairy strains containing up to twelve plasmids (Van Mastrigt et al., 2018a). Plasmid transfer in L. lactis is believed to be predominantly governed by conjugation and transduction (Ainsworth et al., 2014c), but may also occur as a result of transformation (David et al., 2017; Mulder et al., 2017) Transduction is a process in which DNA transfer is carried out by a (bacterio)phage (i.e., a virus that infects a bacterium) due to unintentional packaging of host DNA, and has previously been observed in L. lactis (Ammann et al., 2008; Wegmann et al., 2012). Conjugation involves the transfer of plasmid material via a conjugative apparatus (Grohmann et al., 2003) and is of particular importance as it represents a natural phenomenon that is suitable for the transfer of genetic traits such as phage resistance systems in food grade processes, bacteriocin production (including nisin), proteinases, and citrate utilization (Neve et al., 1987; Kojic et al., 2005; Mills et al., 2006; Van Mastrigt et al., 2018b). Extensive research into the technological traits of L. lactis has been carried out in the past with a significant focus on lactose utilization (Van Rooijen and De Vos, 1990; Van Rooijen et al., 1992), casein metabolism (Siezen et al., 2005), citrate metabolism (Drider et al., 2004; Van Mastrigt et al., 2018a), flavor formation (McSweeney and Sousa, 2000; McSweeney, 2004), and phage resistance mechanisms (Labrie et al., 2010), all of which represent properties that are commonly plasmid-encoded. Lactose utilization in L. lactis is governed by the lac operon, which provides dairy strains with the ability to rapidly ferment lactose and grow in milk. The L. lactis lac operon, which consists of the genes lacABCDEFGX, is generally plasmid-borne and is regulated by a repressor, encoded by the adjacent lacR gene (Van Rooijen and De Vos, 1990; Van Rooijen et al., 1992). Citrate metabolism is conducted by citrate-positive (Cit+) lactococci and is important as it leads to the production of a number of volatile flavor compounds (McSweeney and Sousa, 2000). Citrate uptake and subsequent diacetyl production is governed by the plasmid-encoded citQRP operon in lactococcal species (Drider et al., 2004). Proteolysis also significantly contributes to flavor production in fermented dairy products, although high levels of proteolysis may cause bitterness in cheese (Broadbent et al., 2002). The plasmid-encoded extracellular cell wall proteinase (lactocepin) has been shown to be directly associated with the bitter flavor defect in Cheddar cheese varieties, specifically involving starters which produce lactocepin of the so-called a, e, or h groups, and its characterization is of particular importance when selecting novel starter cultures (Broadbent et al., 2002).

Lactococcal phages are recognized as the main cause of fermentation problems within the dairy industry with concomitant economic problems. Lactococcal strains possess an arsenal of phage defense mechanisms, such as R-M systems and abortive infection (Abi) systems, many of which are plasmidencoded. In the current study, we assess the genetic content of lactococcal plasmids, define the current pan-plasmidome of L. lactis, and investigate plasmid-encoded (and technologically relevant) traits.

### MATERIALS AND METHODS

#### Sequencing

In total, 83 plasmids (81 plasmids and 2 megaplasmids, the latter defined as plasmids that are >100 Kbp in length) were sequenced in the context of this study (**Table 1**). Sequencing of sixteen lactococcal strains was performed as previously described (Kelleher et al., 2017) utilizing the SMRT sequencing approach on a Pacific Biosciences RS II sequencing platform (executed by GATC Biotech Ltd., Germany). De novo assemblies were performed on the Pacific Biosciences SMRTPortal analysis platform (version 2.3.1), utilizing the RS\_HGAP\_Assembly.2 protocol. Assemblies were then repeated with a reduced minimum coverage threshold adjusted to 15X to ensure all plasmid-associated contigs had been detected.

In parallel with SMRT sequencing, an Illumina-based approach was applied to the sixteen lactococcal strains to identify strains where plasmids were potentially absent from the completed assemblies. Re-sequencing of genomes was performed on an Illumina MiSeq platform (executed by GenProbio S.R.L., Parma, Italy), to an average coverage of ∼100–125×. Sequences obtained were first quality checked using IlluQC.pl from the NGS QC Toolkit (v2.3) (Patel and Jain, 2012) and assembled with AbySS (v1.9.0) (Simpson et al., 2009). Based on whole genome alignments contigs absent from the SMRT assemblies were identified. Remaining low quality regions and sequence conflicts were then resolved by primer walking and Sanger sequencing of PCR products (performed by Eurofins MWG Operon, Germany).

### General Feature Predictions

Annotation of plasmid sequences was performed on both newly sequenced and publically available plasmid sequences using the following protocol. ORF prediction, defined as a continuous stretch of codons without a stop codon was performed with Prodigal v2.5 prediction software<sup>1</sup> with a general minimum cutoff of >50 bp and confirmed using BLASTX v2.2.26 alignments (Altschul et al., 1990). ORFs were automatically annotated using BLASTP v2.2.26 (Altschul et al., 1990) analysis against the nonredundant protein databases curated by the National Centre for Biotechnology Information (NCBI)<sup>2</sup> . Artemis v16 genome browser and annotation tool was used to manually curate identified ORFs<sup>3</sup> and for the combination and inspection of ORF results. The final ORF annotations were refined where necessary using additional software tools and database searches, such as Pfam (Bateman et al., 2004), Uniprot/EMBL<sup>4</sup> and Bagel3 (Van Heel et al., 2013).

### Pan-Plasmidome Analysis

Pan-plasmidome analysis was performed utilizing the PGAP v1.0 pipeline (Zhao et al., 2012) according to Heaps law pan-genome model (Tettelin et al., 2005). The ORF content of each plasmid was organized into functional gene clusters via the Gene Family

<sup>1</sup>http://compbio.ornl.gov/prodigal/

<sup>2</sup>https://www.ncbi.nlm.nih.gov/

<sup>3</sup>http://www.sanger.ac.uk/science/tools/artemis

<sup>4</sup>http://www.uniprot.org/

#### TABLE 1 | Characteristics of the plasmids analyzed in this study.


#### TABLE 1 | Continued

fmicb-10-00707 April 2, 2019 Time: 17:29 # 4


#### TABLE 1 | Continued

fmicb-10-00707 April 2, 2019 Time: 17:29 # 5


#### TABLE 1 | Continued

fmicb-10-00707 April 2, 2019 Time: 17:29 # 6


<sup>∗</sup>Plasmids sequenced in the context of the current study (PacBio SMRT). \$Plasmids sequenced in the context of the current study (Illumina MiSeq).

method. ORFs which produced an alignment with a minimum of 50% sequence identity across 50% of the gene or protein length (both nucleotide and amino acid sequences are applied in parallel) were clustered and a pan-plasmidome profile was subsequently generated (Tettelin et al., 2005).

#### Comparative Genomics

Tandem Repeats Finder v4.02 (Benson, 1999) was applied to identify nucleotide tandem repeats at a potential plasmid origin of replication. Plasmids were assigned to be employing a Theta mode of replication where the gene encoding replication protein is preceded by 3.5 iterations of a 22 bp tandem repeat with an A/T rich 10 bp direct repeat located further upstream (Kiewiet et al., 1993). Alternatively, plasmids that replicate by rolling circle replication (RCR) can be identified because they rely on a replication protein and a double-stranded origin of replication (dso). Putative dso replication sites were identified based on nucleotide conservation to previously identified dso's, containing a nic site composed of one or more inverted repeats, and a Repbinding site consisting of 2–3 direct repeats or an inverted repeat (Del Solar et al., 1993; Mills et al., 2006).

All sequence comparisons at protein level were performed via all-against-all, bi-directional BLAST alignments (Altschul et al., 1990). An alignment cut-off value of >50% amino acid identity across 50% of the sequence length was used (with an associated E-value of <0.0001). For analysis and clustering of these results, the MCL was implemented in the mclblastline pipeline v12-0678 (Enright et al., 2002). TM4 MeV, MultiExperiment Viewer v4.9 was used to view MCL clustering data and conduct hierarchal clustering (HCL)<sup>5</sup> . The HCL analysis was exported from TM4 MeV in Newick tree format and visualized using ITOL (Interactive Tree of Life) (Letunic and Bork, 2016).

#### Pulsed Field Gel Electrophoresis (PFGE)

Lactococcus lactis subsp. cremoris strains JM1 and JM2 were cultured in M17 broth (Oxoid) supplemented with 0.5% (w/v) lactose at 30◦C without agitation overnight. PFGE plugs were then prepared and restricted with SI nuclease (Thermo Fisher Scientific, Ireland) as previously described (Bottacini et al., 2015).

<sup>5</sup>http://www.tm4.org/mev.html

A 1% (wt/vol) PFGE agarose gel was prepared in 0.5X TBE [89 mM Tris-borate, 2 mM EDTA (pH 8.3)] buffer and the PFGE plugs were melted in and sealed with molten agarose in 0.5X TBE buffer. A CHEF-DR III pulsed-field system (Bio-Rad Laboratories, Hercules, CA, United States) was used to resolve the DNA fragments at 6 V/cm for 18 h in 0.5X TBE running buffer maintained at 14◦C with linear increment (interpolation) of pulse time from 3 to 50 s. DNA ladder (Chef DNA lambda) was included in each gel (number 170-3635; Bio-Rad Laboratories). The gels were stained in ethidium bromide (10 mg/ml) (25 µl/500 ml dH2O) for 120 min under light-limited conditions and destained in distilled water for 60 min. Gels were visualized by UV transillumination.

#### Bacteriocin Assays

Lactococcal strains were cultured in M17 broth (Oxoid) supplemented with 0.5% (w/v) lactose or glucose (straindependent) at 30◦C without agitation overnight. 3 µl of overnight culture was spotted on M17 agar supplemented with 0.5% (w/v) glucose and left at 30◦C overnight. Cells that had grown on the spotted areas were inactivated by exposure to UV light for 30 min. Plates were then overlaid with a semi-solid M17 agar (0.4% agarose) containing indicator strain L. lactis HP. Zones of inhibition were visualized and measured after 24 h.

### Genbank Accession Numbers of Applied Strains

Lactococcus lactis subsp. lactis IL1403: AE005176; L. lactis subsp. lactis IO-1: AP012281; L. lactis subsp. lactis 184: CP015895; L. lactis subsp. lactis 229: CP015896; L. lactis subsp. lactis 275: CP015897; L. lactis subsp. lactis UC06: CP015902; L. lactis subsp. lactis UC08: CP015903; L. lactis subsp. lactis UC11: CP015904; L. lactis subsp. lactis UC063: CP015905; L. lactis subsp. lactis UC77: CP015906; L. lactis subsp. lactis UL8: CP015908; L. lactis subsp. lactis C10: CP015898; L. lactis subsp. cremoris SK11: CP000425; L. lactis subsp. cremoris MG1363: AM406671; L. lactis subsp. cremoris NZ9000: CP002094; L. lactis subsp. cremoris A76: CP003132; L. lactis subsp. cremoris UC509.9: CP003157; L. lactis subsp. cremoris KW2: CP004884; L. lactis subsp. cremoris 158: CP015894; L. lactis subsp. cremoris UC109: CP015907; L. lactis subsp. cremoris JM1: CP015899; L. lactis subsp. cremoris JM2: CP015900; L. lactis subsp. cremoris JM3: CP015901; L. lactis subsp. cremoris JM4: CP015909; L. lactis subsp. cremoris 3107: CP031538; L. lactis subsp. cremoris IBB477: CM007353; L. lactis subsp. lactis A12: LT599049; L. lactis subsp. lactis biovar. diacetylactis FM03: CP020604; L. lactis subsp. lactis 14B4: CP028160; and L. lactis subsp. cremoris HP: JAUH00000000.1.

### RESULTS

#### Plasmid Sequencing

In this study the sequences of 83 plasmids were elucidated utilizing a combined PacBio SMRT sequencing and Illumina MiSeq approach, and represent the detected plasmid complement of 16 lactococcal genomes (Kelleher et al., 2017). Initially 69 plasmids were identified from the SMRT sequencing data by modifying the RS\_HGAP\_assembly protocol in SMRT portal to a reduced minimum coverage cut-off of 15-fold coverage. To ensure complete coverage of the full plasmid complement the complete genomes of all 16 strains were re-sequenced utilizing an Illumina MiSeq approach which resulted in the eludication of a further 14 plasmids (indicated in **Table 1**) that had not been detected based on the original SMRT assemblies. These 14 plasmids ranged in size from 6 to 62 Kbp, indicating that their absence from the SMRT dataset was in the majority of cases not associated with exclusion from the library based on their small size. Therefore, it was hypothesized that the absence of some plasmids from the SMRT dataset was either due to a lower plasmid copy number (SMRT library preperation does not incorporate an amplification step) or due to a bias in the DNA extraction protocol. Conversely, no plasmids present in the SMRT assemblies, were absent from the Illumina data, however, Illumina sequencing generated heavily fragmented assemblies (∼100–250 contigs per strain), making eludication of complete plasmid sequences, particular for larger plasmids significantly more challenging if not impossible. The main advantage of SMRT technology is the long read length it achieves. Due to the high frequency of repetitive transposable elements, assembly of lactococcal genomes and plasmids is cumbersome. SMRT sequencing was shown to be very useful in obtaining reliable and accurate assemblies, being particularly beneficial for assembling larger lactococcal plasmids which frequently possess a mosaic type structure and contain multiple identical IS elements (Ainsworth et al., 2014c). Therefore, a combined sequencing approach is suggested as the most effective strategy for the complete sequencing of lactococcal strains.

#### General Plasmid Features

The sequenced plasmid dataset was combined with a further one hundred and seven plasmids retrieved from the NCBI database (National Centre for Biotechnology Information) (**Table 1**). In total, the features of one hundred and ninety plasmids derived from fifty three lactococcal strains in addition to seventeen lactococcal plasmids without an assigned strain were investigated. This extra-chromosomal DNA complement amounts to 4,987 Kbp of DNA and is predicted to represent 4,905 CDSs (i.e., ORFs that encode protein products), thus contributing very substantially to the overall genetic content of L. lactis.

The vast majority of currently sequenced plasmids originate from strains that were isolated from the dairy niche (149 out of 190 analyzed plasmids). These dairy lactococci carry between one and twelve plasmids (the latter in L. lactis biovar. diacetylactis FM03P), accounting for up to 355 Kbp of extra-chromosomal DNA in a given strain (as is the case for L. lactis JM1). The size of individual lactococcal plasmids varies widely from the smallest L. lactis KLDS4.0325 plasmid 2, with a size of 0.87 Kbp, to the two megaplasmids, each maintained by L. lactis JM1 and L. lactis JM2, with a size of 193 and 113 Kbp, respectively. The GC content of lactococcal plasmids ranges from ∼30–38%, whilst the average GC content of previously sequenced chromosomes is more constrained (34–36%). Only three lactococcal plasmids deviate from this range; pWC1 29.48,

pIL105 29.79, and pHP003 40.05%, where the latter is closer to Streptococcus thermophilus genomic GC-content, which ranges from 39 to 40% (Fernández et al., 2011).

Lactococcal plasmids are known to replicate via either of two alternative methods, RCR or theta-type replication (Mills et al., 2006; Ainsworth et al., 2014c). Based on predicted plasmid replication proteins/origins it appears that the majority of lactococcal plasmids (174 of the current data-set) replicate via the theta-type mechanism, while only a small proportion appears to utilize RCR (sixteen of the current data-set). The relatively small number of plasmids utilizing RCR may be attributed to a number of factors, such as the fact that RCR plasmids can only support a limited replicon size (<10 Kbp), incompatibility with other RCR type plasmids (Leenhouts et al., 1991), and/or intrinsic structural and segregational instability (Ainsworth et al., 2014c). In three instances, the analysis identified plasmids for which the replication mode could not be clearly determined as the origin of replication of these plasmids did not conform to the typical origin of replication associated with RCR or theta replication.

#### Pan-Plasmidome Calculation

The pan-plasmidome calculation provides an overview of the overall genetic diversity of the L. lactis plasmidome, the latter representing the total plasmid content harbored by (sequenced) members of the L. lactis taxon. To calculate the pan-plasmidome, a pan-genome analysis approach was applied using the PGAP v1.0 pipeline (Zhao et al., 2012). The resultant pan-plasmidome graph (**Figure 1**) displays an asymptotic curve rising steadily as each of the one hundred and ninety plasmids included in the analysis is added until a total pan-plasmidome size of 1, 315 CDSs was reached. The trend observed in the pan-genome indicates that the pan-plasmidome remains in a fluid or open state, and that, therefore, continued plasmid sequencing efforts will further expand the observed genetic diversity among lactococcal plasmids. The PGAP pipeline was also used to determine the core genome of the lactococcal plasmid sequence data set. Interestingly, no single CDS is conserved across all plasmids resulting in an empty core genome.

The L. lactis pan-genome, based on chromosomal sequences only, has previously been calculated to constitute 5,906 CDSs (Kelleher et al., 2017). When compared with the calculated lactococcal plasmidome (1,315 CDSs), it is obvious that the lactococcal plasmidome contributes very substantially to overall lactococcal genetic diversity.

#### MCL Analysis of the Pan-Plasmidome

To explore the genetic content of the one hundred and ninety plasmids employed in this study, all-against-all reciprocal BLASTP analysis and MCL (Markov clustering) was conducted (Altschul et al., 1990; Enright et al., 2002). The plasmidome was determined to comprise 885 protein families, of which 413 represented single member protein families, evidence of the divergent nature of the plasmid sequences. Furthermore, 421 of these families constitute hypothetical protein families, being represented by a total of 1,341 individual proteins. These hypothetical proteins encompass 22.7% of the total CDSs in the lactococcal plasmidome.

The second largest constituent of the lactococcal plasmidome is that represented by transposable elements. Transposable elements encompass 825 CDS, or 15.7% of the plasmidome, with members of the IS6, IS30, IS982, and ISL3 insertion families being among the most dominant genetic elements. These mobile elements are responsible for the transfer and recombination of DNA (Nicolas et al., 2007; Machielsen et al., 2011; Alkema et al., 2016) and are likely to contribute to a fluid lactococcal plasmidome.

Following MCL analysis, HCL of the pan-plasmidome was used to cluster plasmids based on their genetic content (**Figure 2**). The high level of diversity within the pan-plasmidome is demonstrated by the observed disparity within the HCL matrix. HCL analysis resulted in thirteen clusters with three outliers; pMPJM1, pWVO2, and pQA504 (**Figure 2B**). Plasmid pWVO2

encodes a single replication gene, pQA504 contains three CDS (rep gene, mob gene, and hypothetical gene), while pMPJM1 encodes 188 CDS and shares little homology with other lactococcal plasmids. The remaining thirteen clusters did not display subspecies specificity, each cluster containing plasmids from both subsp. lactis and subsp. cremoris hosts.

#### Lactococcal Megaplasmids

Typically, L. lactis plasmids range in size from 1 to 50 Kbp, and, prior to this study, the largest plasmid identified in L. lactis was the self-conjugative mega-plasmid of 155,960 bp in L. lactis subsp. lactis bv. diacetylactis S50 (Kojic et al., 2005). L. lactis S50 p7 represents the first lactococcal megaplasmid and encodes genes for Proteinase PI and lactococcin A and is part of a larger plasmid complement of 7 plasmids totaling 336 Kbp (Kojic et al., 2005). Recently (May 2018) the plasmid complement of L. lactis subsp. lactis KLDS 4.0325 (Yang et al., 2013) has been updated in the public NCBI data base with three additional plasmid sequences, the largest plasmid measuring 109 Kbp (plasmid 6). In the current study, whole genome sequencing efforts resulted in the identification of two plasmids that were larger than 100 Kbp, namely pMPJM1 (193 Kbp) and pMPJM2 (113 Kbp) from L. lactis JM1 and L. lactis JM2, respectively, and owing to their size are defined as megaplasmids (Anton et al., 1995; Barton et al., 1995; **Figures 3A,B**). Pulsed field gel electrophoresis also identified bands which would be consistent with plasmids of that size, although unambiguous validation will require Southern hybridization analysis (**Figure 3C**).

The larger of the two megaplasmids, pMPJM1, encompasses 186 CDSs and is presumed to replicate (as expected for such a large replicon) via the theta-type replication mechanism [based on the identification of the origin of replication (ori), comprised of an AT-rich region plus three and a half iterons of 22 bp in length] (Seegers et al., 1994). pMPJM1 encompasses, among others, gene clusters predicted to be responsible for (exo)polysaccharide biosynthesis, conjugation and nisin resistance, while it also specifies an apparently novel type I RM shufflon system (as well as a high proportion of unique/hypothetical CDSs). The overall sequence of the plasmid shows little homology to previously sequenced plasmids in the NCBI databases, however, it shares 24% sequence coverage with 99% nucleotide identity to the other identified megaplasmid pMPJM2, which indicates that they share a common ancestor. pMPJM2 encodes 123 CDSs and BLASTN analysis identified sequence identity to a number of different lactococcal plasmids indicating a mosaic genetic structure commonly seen in large lactococcal plasmids (Ainsworth et al., 2014c). pMPJM2 also encodes a putative conjugation operon and a very close homolog of the type I RM shufflon system of pMPJM1. The third lactococcal megaplasmid KLDS 4.0325 plasmid 6 (109 Kbp) encodes 119 CDSs including the lac operon and associated opp oligopeptide uptake system.

#### Technological Properties

Strains of L. lactis are commonly used as starter cultures employed by the dairy industry (Beresford et al., 2001), and

Laboratories, Hercules, CA, United States) DNA ladder is displayed in lane 1.

their dairy adaptations such as citrate metabolism and lactose utilization are frequently plasmid-encoded. In L. lactis, citrate uptake and subsequent diacetyl production is governed by the plasmid-encoded citQRP operon (Drider et al., 2004; Van Mastrigt et al., 2018b). In the current data set, only four plasmids contain the citQRP operon, L. lactis CRL1127 plasmid pCRL1127, L. lactis IL594 plasmid pIL2 (Górecki et al., 2011), L. lactis FM03 plasmid pLD1 and L. lactis 184 plasmid p184F. However, the latter operon in p184F appears to lack citQ which encodes a leader peptide. Lactose metabolism is controlled by the lac operon consisting of the genes lacABCDEFGX and is regulated by a repressor, encoded by the adjacent lacR gene (Cords et al., 1974), both citrate and lactose utilization have previously been described in detail (Cords et al., 1974; Górecki et al., 2011).

The lac operon was found to be present on twenty four plasmids (in 24 different strains) (**Table 2**). The plasmids analyzed were derived from 53 lactococcal strains in addition to 17 lactococcal plasmids unassigned to a particular strain, and represented the total plasmid complement of 26 such strains. In all cases bar two, the strains were isolated from the dairy environment with the exception of L. lactis NCDO1867 isolated from peas and L. lactis KLDS 4.0325 isolated from fermented food (**Table 1**). Alternative lactose metabolism methods have previously been observed in L. lactis. For example, L. lactis MG1363 does not harbor the lac operon, yet is capable of growth on lactose-supplemented media due to the activity of a cellobiosespecific phosphotransferase system (PTS), which can act as an alternative lactose utilization pathway (Solopova et al., 2012).



Another example of an alternative lactose metabolic pathway is found in the slow lactose fermenter L. lactis NCDO2054, which metabolizes lactose via the Leloir pathway (Bissett and Anderson, 1974). Plasmid integration events have also resulted in the integration of the lac operon in the chromosome of L. lactis SO, where it is located 20 Kbp downstream of an integrated opp operon, sharing significant homology with (the lac operons of) plasmids pCV56B, pSK08, pKF147A, and pNCDO2118 (Kelleher et al., 2017). Due to the lack of sequencing projects that report fully sequenced genomes, defining the true frequency of lactose utilization is challenging. However, of those strains for which complete genome sequencing projects have been described [30 strains in Kelleher et al. (2017)], 22 were found to be capable of metabolizing lactose based on growth in lactose supplemented broth, 19 via plasmid-encoded lac operons, one via a chromosomally encoded lac operon and two by an alternative pathway. This analysis included 12 subsp. cremoris strains, of which all but one possessed genes for a lactose utilization mechanism, the exception being strain KW2, which lacks a plasmid complement.

#### Conjugation

Conjugation and transduction are believed to be the dominant mechanisms of plasmid transfer in L. lactis (Ainsworth et al., 2014c). Particular emphasis has been placed on conjugation as it is considered a naturally occurring DNA transfer process and for this reason may be used in food-grade applications to confer beneficial traits to industrial strains (Mills et al., 2006). Generally, during conjugation the AT-rich, so-called "origin of transfer" or oriT of the conjugative plasmid is nicked by a nickase, and the resulting ssDNA strand is passed to a recipient cell (Grohmann et al., 2003). The tra (transfer) locus is believed to be responsible for the donor-to-recipient DNA transfer process of conjugation, though the precise mechanistic details of the conjugation process in L. lactis has not yet been fully elucidated. Plasmids which do not encode the tra operon, may also be co-transferred by conjugation in instances where a plasmid contains an oriT sequence and at least one mobilization gene (mobA, B, C, or D). Additional genes can also be involved in conjugation in L. lactis; an example of this is cluA, which encodes a cell surface-presented protein, and which is involved in cell aggregation and thought to be essential for high efficiency conjugal transfer (Stentz et al., 2006). Furthermore, a chromosomally associated, so-called sex factor in L. lactis has been shown to facilitate transfer of chromosomal genes during conjugation (Gasson et al., 1995).

The tra locus, which encodes the protein complex responsible for donor-to-recipient DNA transfer has as yet been fully eludicated. Previous studies have identified the role of traF as encoding a membrane-spanning protein involved in channel formation and membrane fusion. In addition, the traE and traG genes have been proposed to encode proteins involved in the formation of the conjugal pilus similar to type IV secretion systems (O'Driscoll et al., 2006; Górecki et al., 2011). Typically, the three tra genes (i.e., traE, traF, and traG) are part of a larger gene cluster (consisting of up to 15 genes; **Figure 4**), including traA, which encodes a DNA relaxase. In the current data set, 34 genes with homology to traG were identified on 27 plasmids (present in duplicate on seven plasmids) along with five occurrences of traE/F also being present (in the case of plasmids pIBB477A, pUC08B, pUC11B, pAF22, and pMRC01).

The precise functions for the remainder of the genes in the tra gene cluster have yet to be elucidated, though additional tra-encoded functions have been predicted in a small number of cases, the majority based on homology to the trs operon in Staphylococcus (Sharma et al., 1994). For example, traJ and traL were identified on plasmids pAF22, pIBB477a and pMRC01, and traB, traC, traD, traF (mating channel formation) and traK (P-loop NTPase) on plasmids pUC08B, pIBB477a, pUC11B, pAF22, and pMRC01. Plasmids pAF22, pMRC01, and pNP40 have all previously been demonstrated to be capable of conjugation (Harrington and Hill, 1991; Coakley et al., 1997; O'Driscoll et al., 2006; Fallico et al., 2012). However, the annotation(s) of the operons involved in conjugation is not well defined and they are currently poorly characterized. This is also amplified by both a lack of sequence conservation and limited synteny within the genes that make up these conjugationassociated genetic clusters (**Figure 4**).

While the tra operon is thought to be responsible for the formation of conjugal pili, previous studies have identified a number of genes believed to play a role in the mobilization of other (non-self-transmissible) plasmids in L. lactis (Mills et al., 2006; O'Driscoll et al., 2006; Millen et al., 2012); principal among these are the mob (mobilization) genes. Mobilization genes are responsible for nicking the plasmid's dsDNA at a particular

site and forming a relaxosome, which allows the transfer of a single stranded template to a recipient cell. Variants of four main mob genes are distributed throughout the lactococcal plasmidome; mobA and mobD encode nickases, and mobB and mobC, whose protein products are thought to form a relaxosome with an associated nickase (either mobA or mobD) are typically present in the genetic configuration mobABC or mobDC. Comparative analysis identified 422 occurrences of mob genes (any of the afore mentioned mob genes) distributed across the 190 plasmids assessed in this study, including 15 occurrences of a predicted retron-type reverse transcriptase or maturase (located between mobD and mobC) believed to play a role in DNA recombination. The results indicate that 59.5% of plasmids in the lactococcal plasmidome carry at least one or more genes encoding mobilization proteins.

The lactococcal megaplasmids pMPJM1 and pMPJM2 harbor two (16 Kbp) regions putatively involved in conjugation and/or mobilization. In the case of pMPJM2 the predicted region was found to contain homologs of mobC and mobD, encoding a nickase and an associated relaxase near a possible secondary replication origin. However, the presence of five transposaseencoding genes and the lack of predicted tra genes with conserved functions suggest that this plasmid is not capable of autonomous conjugation (though mobilization is possible).

Conversely, analysis of pMPJM1 identified a more divergent system to that typically found in lactococcal plasmids. Three hypothetical proteins were found to contain the PFAM domain usually conserved in conjugation proteins (pfam12846), in addition to a homolog of virB11, whose deduced product acts as a type IV secretory pathway ATPase (pfam00437). Cellular localization analysis of the operon using PsortB was also indicative of a transmembrane complex composed of cytoplasmic, membrane bound, signal and extracellular proteins (**Figure 5**). The divergence of both operons from typical lactococcal conjugative operons suggests that these two megaplasmids have lost their conjugative ability or may possess

a conjugation system with very few identifiable similarities to currently known systems.

### Cell Surface Interactions (Adhesion & EPS)

Mucin-binding proteins, i.e., those allowing adhesion to the mucin layer of the gastrointestinal tract, are considered essential for stable and extended gut colonization by LAB (Von Ossowski et al., 2010). While lactococci are typically not associated with the human gut and do not have a growth temperature profile that would be inconsistent with GIT colonization., instances of such proteins encoded by lactococcal plasmids have been reported (Kojic et al., 2011; Lukic et al., 2012 ´ ; Le et al., 2013). Mucoadhesive proteins are considered of paramount importance for the efficacy of probiotic bacteria (Von Ossowski et al., 2010) and the presence of such elements in L. lactis may have significant commercial impact for their role in functional foods. Analysis of the plasmids assessed in our study identified a number of strains with predicted novel muco-adhesive elements, similar to those found in pKP1 (Kojic et al., 2011). Plasmid pKP1 encodes two proteins, a mucin-binding domain-containing protein and an aggregation-promoting protein AggL, which promotes its binding to colonic mucosa (Lukic et al., 2012 ´ ). While no direct homolog of AggL was detected, mucus-binding proteinencoding genes were identified on plasmids p14B4, p275A, p275B, pUC08B, and pUC11B perhaps reflecting a potential for gastrointestinal persistence conferred to the strains that carry these plasmids. A number of additional proteins predicted to be host cell surface-associated were detected during the analysis. For example, pUC11C encodes two class C sortases, which are commonly involved in pilus biosynthesis (Von Ossowski et al., 2010; Lebeer et al., 2012), while p275A encodes an LPXTG anchor domain, cell surface-associated protein. Interestingly, each of these strains belongs to subspecies lactis and is capable of growth at 37◦C, which would impede growth of their cremoris counterparts, which are generally less thermo-tolerant. L. lactis JM1 is the sole cremoris strain that is predicted to encode proteins directly involved in host cell surface alterations. This plasmid encodes five putative proteins containing a 26-residue repeat domain found in predicted surface proteins (often lipoproteins) and one collagen-binding domain protein.

The plasmid encoded lactococcal cell wall anchored proteinase, PrtP, involved in the breakdown of milk caseins in dairy lactococci, has previously been shown to cause a significant increase in cell adhesion to solid glass and tetrafluoroethylene surfaces (Habimana et al., 2007). More recently, L. lactis subsp. cremoris IBB477 was found to contain two plasmids, pIBB477a and pIBB477b, which encode cell wall-associated peptidases that have been shown to mediate adhesion to bare mucin and fibronectin coated polystyrene and HT29-MTX cells (Radziwill-Bienkowska et al., 2017). Analysis of the current data-set which contains a large number of dairy derived plasmids, identified a further 194 CDS homologous to the cell wall-associated peptidase S8 (PrtP) of IBB477. Whilst extracellular cell wall proteinases have been shown to be directly associated with the bitter flavor defect in Cheddar cheese varieties (Broadbent et al., 2002), a potential role for these peptidases in gut adhesion may present a more positive view of these elements.

Exopolysaccharide production by L. lactis is a characteristic trait of strains isolated from viscous Scandinavian fermented milk products and is widely reported as a plasmid-encoded trait

(Vedamuthu and Neville, 1986; Von Wright and Tynkkynen, 1987; Neve et al., 1988; Kranenburg et al., 1997). EPS production by L. lactis strains is of particular importance for functional foods, as the EPS produced by these strains is considered to be a food-grade additive that significantly contributes to properties such as mouth-feel and texture in fermented dairy products (Kleerebezem et al., 1999). The L. lactis EPS biosynthesis gene cluster (eps) contained on pNZ4000 has previously been characterized (Kranenburg et al., 1997) and consists of 14 genes, namely epsRXABCDEFGHIJK. Comparison of the eps gene cluster from pNZ4000 with all sequenced plasmids in the current dataset identified a further four plasmids harboring eps clusters, namely pUC77D, p229E, pJM3C, p275B, and pMPJM1 (**Figure 6**). In pNZ4000, EPS production is regulated by epsRX, EPS subunit polymerization and export is believed to be executed by the encoded products of epsABIK, while the proteins encoded by epsDEFGH are responsible for the biosynthesis of the EPS subunit (Kranenburg et al., 1997). Homology-based analysis with the five newly identified gene clusters shows that in all cases epsRXABCD are conserved (except in pMPJM1 where epsR is absent), while the remainder of the gene cluster in each case consists of variable genes. These eps gene clusters consist of a highly conserved region at the proximal end of the cluster and a variable distal region, which is not unlike other lactococcal polysaccharide biosynthesis clusters (Mahony et al., 2013; Ainsworth et al., 2014b; Mahony et al., 2015). The conserved epsRX genes are responsible for transcriptional regulation, the products of epsAB are required for EPS export, while the deduced proteins of epsCD are putative glycosyltransferases of which EpsD (priming glycosyltransferase) has previously been demonstrated to be essential for EPS subunit biosynthesis (Kranenburg et al., 1997). The variable region, epsEFGHIJKLP in pNZ4000, encodes predicted or proven functions, such as an acetyltransferase (epsE), glycosyltransferases (epsGHIJ) and a flippase (epsK), together representing the presumed enzymatic machinery responsible for EPS biosynthesis through the addition and export of sugar moieties.

In the case of p229E, the variable eps region is composed of CDSs predicted to encode products with functions are similar to the chromosomally located cwps gene cluster in strain 229. Plasmid pJM3C contains genes predicted to encode a rhamnosyltransferase, UDP-glucose dehydrogenase, capsular biosynthesis protein and five glycosyltransferases. The p275B variable region is heavily rearranged due to the presence of nine transposase-encoding genes. The megaplasmid pMPJM1 encodes a 9 Kbp predicted EPS region with well conserved functional synteny to that of pNZ4000, although with relatively low homology (**Figure 6**). Plasmid pUC77D appears to contain the shortest eps gene cluster of 7 Kbp due to the absence of epsFGHIJ genes. Further analysis of these plasmid-borne eps gene clusters revealed that in all cases mob elements are present, indicating that they may be mobilisable via conjugation. To assess if these plasmids have a common lineage, nucleotide homology based analysis was conducted utilizing BLASTN (Altschul et al., 1990). This analysis, however, did not identify significant homology or common hits between the plasmids outside of the conserved region of the EPS gene cluster. Phenotypic analysis of strains L. lactis 275, 229, JM1, JM3, and UC77 indicated a mucoid EPS phenotype in strains 275, 229, and JM3. While strains JM1 and UC77 did not show any EPS production which is probably attributed to the lack of the regulator epsR in strain JM1 and the absence of epsFGHIJ genes in UC77.

#### Bacteriocins

Bacteriocins are a diverse group of ribosomally synthesized bacterial peptides, which when secreted inhibit growth of other bacteria by interfering with cell wall biosynthesis or disrupting membrane integrity (Dobson et al., 2012). The production of

bacteriocins by lactococcal strains has been widely reported, including the strain L. lactis subsp. cremoris 9B4 which contains three separate bacteriocin operons, named lactococcins A, B, and M/N are located on one plasmid (Van Belkum et al., 1989, 1991). To investigate bacteriocin production in the lactococcal plasmidome, all the available strains were screened for bacteriocin production against an indicator strain L. lactis subsp. cremoris HP. In total six strains were found to produce clearly defined zones of inhibition, indicating bacteriocin production, namely L. lactis subsp. lactis IO-1, 184, UC06, UC08, UC11, and L. lactis subsp. cremoris 158. Analysis of the plasmid complement of each of these strains indicated that strains 158, UC06 and UC08 each possess a plasmid-borne bacteriocin gene cluster, while IO-1, 184, and UC11 contain a bacteriocin gene cluster of chromosomal origin. In each case, these were identified as lactococcin producers: p158A is predicted to be responsible for lactococcin A and B production, pUC08A for lactococcin A production, and pUC06C for lactococcin B biosynthesis. Lactococcin has a narrow spectrum of activity, targeting predominantly closely related lactococcal species (Geis et al., 1983) and, as such, is an important consideration when selecting strains for application in mixed starter cultures.

Sequence analysis of the remaining plasmids in the current study (for which strains were not available for phenotypic analysis) identified additional putative bacteriocin-encoding gene clusters (**Table 3**), which were found to be responsible for the production of lactococcin A or B, and in one case (pMRC01) for the lantibiotic lacticin 3147 (**Table 3**; Dougherty et al., 1998).

#### Phage-Resistance Systems

Lactococcal strains typically possess a variety of phage defense mechanisms including superinfection exclusion systems (Sie) (encoded by integrated prophages) (Kelleher et al., 2018), clustered regularly interspaced short palindromic repeats (CRISPR), restriction-modification (R-M), and abortive infection (Abi) systems. Sie systems are a prophage-encoded defense mechanism (Mcgrath et al., 2002; Mahony et al., 2008) and have been reviewed extensively in these strains as part of an investigation into lactococcal prophages (Kelleher et al., 2018). CRISPR and CRISPR-associated (cas) genes specify an acquired adaptive immunity system against invading DNA in

TABLE 3 | Predicted plasmid-encoded antimicrobial peptides.


\$N/A, host strain unavailable to screen phenotypically.

bacteria (Horvath and Barrangou, 2010). To date, only one such system has been characterized in Lactococcus on a conjugationtransmissible plasmid, pKLM, which encodes a novel type III CRISPR-Cas system (though it is unable to incorporate new spacers) (Millen et al., 2012). Analysis of plasmid sequences in this study did not detect any further instances of CRISPR systems in lactococci, suggesting CRISPR are not a widespread phenomenon in domesticated lactococci.

Restriction-modification systems are extremely diverse and widespread and are encoded by approximately 90% of all currently available bacterial and archaeal genomes (Roberts et al., 2003). R-M systems are frequently observed in the lactococcal plasmidome and some examples have previously been characterized including the Type II system LlaDCHI from pSRQ700 (Moineau et al., 1995) and LlaJI from pNP40 (O'driscoll et al., 2004). The current dataset holds nine apparently complete Type II systems on plasmids pCV56A, p275D, pJM1D, pUC08B, pUC11B, pNP40, pSRQ700, KLDS 4.0325 plasmid 5, and pAF22; along with multiple orphan methylases and solitary restriction endonucleases. The most commonly encountered R-M systems in lactococcal plasmids are Type I systems. These systems are often incomplete and represented by solitary specificity subunits (77 such orphan specificity subunit-encoding hsdS genes were identified in the current analysis). The high frequency of these systems in lactococcal plasmids is indicative of host adaptation as they predominantly act as a host defense mechanism against phage infection.

Abortive infection systems represent an abundant phage defense mechanism in L. lactis (Chopin et al., 2005) and are frequently plasmid-encoded (Mills et al., 2006). To date, 23 Abi systems have been identified in L. lactis, of which 21 are plasmidencoded (Ainsworth et al., 2014c). Most are typically single gene systems, with the exception of three multigene systems, AbiE (Garvey et al., 1995), AbiR (Twomey et al., 2000), and AbiT (Bouchard et al., 2002). Analysis of the plasmids in this study identified eight Abi occurrences based on homology, namely AbiF, AbiC, AbiK, AbiQ, and two occurrences of the two component system AbiEi- AbiEii, alongside twelve predicted uncategorized Abi's (**Table 4**), based on amino acid homology to unclassified Abi's in the NCBI database. The relatively low observed abundance of Abi's in such a large plasmid dataset is surprising and may be the result of the diversity of Abi's with the possibility of as yet unidentified systems.

### DISCUSSION

The advent of NGS technologies has made genome sequencing much more accessible and has led to a dramatic rise in the number of available genome sequences. In the current study one such technology, SMRT sequencing was applied for the elucidation of 69 novel lactococcal plasmids. However, during the course of the current study some cautionary notes also emerged. These were predominantly related to smaller plasmids and plasmids with lower average consensus coverage, which could potentially be filtered out under standard assembly parameters. It was found that by performing the assembly using a reduced



<sup>∗</sup>Uncharacterized Abi, based on amino acid homology to unclassified Abi's in the NCBI database.

minimum coverage cut-off to 15-fold coverage detection of some of these plasmids was possible. In fact, in order to ensure detection of a given strain's total plasmid complement we found it necessary to use a combined sequencing approach. This point is strongly supported by the elucidation of a further 14 plasmids from this dataset using an Illumina MiSeq approach which were completely absent from the SMRT assemblies.

The overview of plasmid replication systems presented shows that theta-type replication is the dominant way of replication used in L. lactis. These plasmids are usually viewed as being intrinsically more stable than RCR-type plasmids. However, a recent study of the dynamics of plasmid copy-number in L. lactis FM03-V1 demonstrated that the theta-type replicating plasmid (pLd10) was lost in a retentostat cultivation, while an RCR plasmid was maintained (Van Mastrigt et al., 2018c). During the course of that study, it was found that the reduced copy number of larger theta replicating plasmids increased the likelihood of the loss of these plasmids compared to smaller plasmids regardless of replication type (Van Mastrigt et al., 2018c), while the presence of the partition system (parA and parB) on these plasmids should also be considered as it has been shown to contribute to the stability and maintenance of large plasmids without selection (O'Driscoll et al., 2006). Interestingly, of the 16 plasmids not detected by SMRT sequencing in this study, five were theta replicating plasmids larger than 25 Kbp. This suggests that the lack of an amplification step during library preparation for SMRT sequencing may be a factor in detecting larger plasmids that may have a low copy number.

In the course of this study, the pan-plasmidome of L. lactis was calculated and found to be in a fluid state, making it likely that continued sequencing efforts would expand the diversity of this data set and lead to an increase in the identification of novel plasmid features. At present, the lactococcal plasmidome was found to consist of over ∼5000 Kbp of extra-chromosomal DNA encoding an arsenal of diverse features. Significantly, the current open plasmidome contributes the equivalent of 22.26% of the CDSs contained in the pan-genome of the L. lactis chromosomes that is in a closed state (Kelleher et al., 2017). BLAST-based analysis of these features identified 885 protein families, of which 413 represented unique families, evidence of the divergent nature of the plasmid sequences. There is, however, a skew in the data set toward the dairy niche, which has arisen due to a number of factors. Primarily, the majority of strains sequenced to date have been sequenced due to their commercial value in the production of fermented dairy products. The impact of these strains on the overall data set is further amplified as these strains generally carry a larger plasmid complement than their nondairy counterparts (Kelleher et al., 2017), since many desirable dairy-associated traits are typically plasmid-encoded (e.g., lac operon). As such, these features account for a large proportion of the plasmidome. However, as efforts to isolate new starter cultures for the dairy industry continue (Cavanagh et al., 2015), screening of more diverse cultures, particularly from the plant niche, is expected to lead to increased novelty and diversity in the lactococcal plasmidome.

Megaplasmids have been found in LAB previously, in particular in members of the Lactobacillus genus (Muriana and Klaenhammer, 1987; Roussel et al., 1993; Claesson et al., 2006; Li et al., 2007; Fang et al., 2008). In the current study, sequencing efforts resulted in the identification of two examples of lactococcal megaplasmids (>100 Kbp), with pMPJM1 (193 Kbp) substantially surpassing the size of the previously largest sequenced plasmid in this taxon L. lactis S50 p7 (155 Kbp) (Kojic et al., 2005), and providing further diversity within the plasmidome. While megaplasmids are not expected to be essential for growth of their host, they can encode additional metabolic capabilities. The lactococcal megaplasmids were also examined for the presence of conjugation machinery. A novel gene cluster encoding a number of conjugation-related proteins located in pMPJM1 suggests that this plasmid is or has been involved in conjugal transfer. Further analysis of mob and tra genes across the plasmidome identified a number of genes predicted to encode proteins involved in conjugal transfer. The frequency (422 mob/tra genes across 190 plasmids) of these genes is indicative of the self-transmissible and/or mobilizable nature of lactococcal plasmids.

There has been limited research performed to date in the area of lactococcal gut adhesion as L. lactis is not commonly associated with the human gut. In this study, the lactococcal plasmidome was shown to contain potential gut adhesion factors, which may allow colonization and/or persistence in the gastrointestinal tract. This trait may offer opportunities for the application of L. lactis as a vector for vaccine and biomolecule delivery (Bermúdez-Humarán, 2009; Bermúdez-Humarán et al., 2013). Further technological properties of L. lactis were investigated

including EPS production. Analysis of a large dataset of newly sequenced plasmids facilitated the identification and comparison of a number of novel EPS gene clusters. The major outcome of this work was the definition of "conserved" and "variable" regions within these EPS clusters. The conserved region encodes the transcriptional regulation, export and biosynthesis initiation machinery, while the variable region contains various genes that are predicted to encode glycosyltransferases, which are believed to be responsible for the production of a diverse set of EPS subunits, and thus a polysaccharide with a distinct composition and perhaps different technological properties.

Finally, phage-resistance mechanisms were assessed with particular emphasis on Abi systems. Abi systems confer defense against phage infection and are commonly found in lactococcal strains where they are frequently plasmid-encoded (Mills et al., 2006). Analysis of the plasmid sequences identified 22 plasmidencoded Abi systems, while further analysis also identified frequent occurrences of these systems within the lactococcal chromosomes (Chopin et al., 2005). The presence of these systems and a range of R-M systems is evidence for the adaptation of these strains toward phage-resistance.

Discovery of the first lactococcal megaplasmids along with a host of novel features is evidence that the diversity of the lactococcal plasmidome represents a significant amount of unexploited genetic diversity, and suggests that continued future sequencing efforts and subsequent functional analysis will increase the observed diversity carried by these elements, potentially leading to new avenues of research, and applications. The current plasmidome contributes the equivalent of 22.26% of the CDSs contained in the pan-genome of the L. lactis chromosomes demonstrating its significant value to this taxon.

#### REFERENCES


The importance of which has been built on a long history of use in food fermentations, particularly in the dairy industry. The fact that both the opp and lac operons which have led to this adaptation remain largely plasmid encoded only further demonstrates the fundamental importance of the lactococcal plasmidome in terms of the evolution, adaptation, and application of lactococci.

#### DATA AVAILABILITY

The datasets generated for this study can be found in NCBI Genbank, CP034577, CP034578, CP034579, CP034580, CP034581, CP034582, CP034583, CP034584, CP034585, and CP034586.

#### AUTHOR CONTRIBUTIONS

PK carried out the data analysis with FB. PK performed the experiments. DS and JM provided materials and strains. PK, JM, and DS wrote the manuscript. All authors read and approved the final manuscript.

#### FUNDING

This research was funded by Science Foundation Ireland (SFI) grant numbers 15/SIRG/3430 (JM), 12/RC/2273, and 13/IA/1953 (DS). PK was funded by the Department of Agriculture, Food and the Marine under the Food Institutional Research Measure (FIRM) (Ref: 10/RD/TMFRC/704–"CheeseBoard 2015" project).

mediterranei strains. Syst. Appl. Microbiol. 18, 439–447. doi: 10.1016/S0723- 2020(11)80436-4


Lactococcus lactis. J. Bacteriol. 184, 6325–6332. doi: 10.1128/JB.184.22.6325- 6332.2002


Lactococcus lactis IL594 strain encoded in its 7 plasmids. PLoS One 6:e22238. doi: 10.1371/journal.pone.0022238




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Kelleher, Mahony, Bottacini, Lugli, Ventura and van Sinderen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Response of Lactobacillus plantarum WCFS1 to the Gram-Negative Pathogen-Associated Quorum Sensing Molecule N-3-Oxododecanoyl Homoserine Lactone

#### Joseph R. Spangler<sup>1</sup> , Scott N. Dean<sup>1</sup> , Dagmar H. Leary<sup>2</sup> and Scott A. Walper<sup>2</sup> \*

<sup>1</sup> National Research Council Postdoctoral Fellowships, NRC Research Associateship Programs, Washington, DC, United States, <sup>2</sup> United States Naval Research Laboratory, Center for Biomolecular Science and Engineering, Washington, DC, United States

#### Edited by:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### Reviewed by:

Mansel William Griffiths, University of Guelph, Canada Sang Jun Lee, Chung-Ang University, South Korea

> \*Correspondence: Scott A. Walper scott.walper@nrl.navy.mil

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 16 January 2019 Accepted: 21 March 2019 Published: 05 April 2019

#### Citation:

Spangler JR, Dean SN, Leary DH and Walper SA (2019) Response of Lactobacillus plantarum WCFS1 to the Gram-Negative Pathogen-Associated Quorum Sensing Molecule N-3-Oxododecanoyl Homoserine Lactone. Front. Microbiol. 10:715. doi: 10.3389/fmicb.2019.00715 The bacterial quorum sensing phenomenon has been well studied since its discovery and has traditionally been considered to include signaling pathways recognized exclusively within either Gram-positive or Gram-negative bacteria. These groups of bacteria synthesize structurally distinct signaling molecules to mediate quorum sensing, where Gram-positive bacteria traditionally utilize small autoinducing peptides (AIPs) and Gram-negatives use small molecules such as acyl-homoserine lactones (AHLs). The structural differences between the types of signaling molecules have historically implied a lack of cross-talk among Gram-positive and Gram-negative quorum sensing systems. Recent investigations, however, have demonstrated the ability for AIPs and AHLs to be produced by non-canonical organisms, implying quorum sensing systems may be more universally recognized than previously hypothesized. With that in mind, our interests were piqued by the organisms Lactobacillus plantarum, a Gram-positive commensal probiotic known to participate in AIP-mediated quorum sensing, and Pseudomonas aeruginosa, a characterized Gram-negative pathogen whose virulence is in part controlled by AHL-mediated quorum sensing. Both health-related organisms are known to inhabit the human gut in various instances, both are characterized to elicit distinct effects on host immunity, and some studies hint at the putative ability of L. plantarum to degrade AHLs produced by P. aeruginosa. We therefore wanted to determine if L. plantarum cultures would respond to the addition of N-(3-oxododecanoyl)-Lhomoserine lactone (3OC12) from P. aeruginosa by analyzing changes on both the transcriptome and proteome over time. Based on the observed upregulation of various two-component systems, response regulators, and native quorum sensing related genes, the resulting data provide evidence of an AHL recognition and response by L. plantarum.

Keywords: Lactobacillus plantarum, quorum sense, Pseudomonas aeruginosa, transcriptomics, proteomics, homoserine lactone

## INTRODUCTION

fmicb-10-00715 April 3, 2019 Time: 21:14 # 2

Bacteria have been understood to possess a basic level of cell to cell communication for decades (Miller and Bassler, 2001; Waters and Bassler, 2005). The communication phenomenon known as quorum sensing allows organisms to coordinate growth and gene expression efforts, for example, toward a common goal to survive an increasingly harsh environment (Reading and Sperandio, 2006). The traditional model for quorum sensing has been described as the luxR system in multiple Vibrio spp. (Gray et al., 1994; Bassler et al., 1997; Zhu et al., 2002; Hammer and Bassler, 2003; Waters and Bassler, 2006), where a signaling molecule such as an N-acyl homoserine lactone (AHL) is produced and exported into the microenvironment at some undetected basal level. The signaling molecule accumulates over time with proliferation of the producing species until a threshold is reached, whereupon the signal is detected and organisms respond with targeted gene expression. Quorum sensing systems following this model are commonly seen in Gram-negative bacteria such as the aforementioned Vibrio spp. to coordinate luminescent protein production or in pathogens such as Pseudomonas aeruginosa to coordinate virulence and survival (Gray et al., 1994; Storey et al., 1998; Rumbaugh et al., 1999a,b; Winzer et al., 2000; Rampioni et al., 2006, 2007, 2009; Gao et al., 2017; Kariminik et al., 2017). The downstream genetic response to the signal varies among different organisms, but the general model of signal amplification remains consistent and depends on (1) the synthesis and export of the signal molecule into the environment, (2) environmental accumulation of the signal molecule, (3) diffusion of the signal molecule into neighboring cells, and (4) the interaction of the signal molecule with specific transcription factors resulting in the activation of gene expression. Synthesis of the signal molecule is generally upregulated by transcription factor activation as part of a positive feedback loop, and the number of promoters subsequently induced varies widely among organisms and the specific quorum sensing system involved such that induction rarely results in the upregulation of a single gene (Brint and Ohman, 1995; Medina et al., 2003a,b; Jensen et al., 2006).

Gram-positive organisms are capable of quorum sensing by a different mechanism (Miller and Bassler, 2001). Rather than using AHLs with varying acyl chain lengths as signal molecules, Gram-positive organisms employ the use of short peptides known as autoinducer peptides (AIPs), some of which contain unconventional bonding between specific amino acids to produce unique structures. Considering the multicharge potential of peptides, specific proteins are devoted to the export of autoinducers into the microenvironment. These peptides do not typically diffuse through the membranes like their lactone counterparts given their charged characteristics and the peptidoglycan layer of Gram-positive organisms, thus signal recognition occurs by the activation of a two-component system (Kleerebezem et al., 1997; Sturme et al., 2005; Sturme et al., 2007). The agr system in Staphylococcus aureus is a well-studied example of peptide-mediated quorum sensing in Gram-positive organisms (Peng et al., 1988; Janzon and Arvidson, 1990; Booth et al., 1995; Papakyriacou et al., 2000; Jensen et al., 2008; Reyes et al., 2011; Marchand and Collins, 2013, 2016), and homologs have been identified in various Lactic Acid Bacteria such as Lactobacillus sakei, Lactobacillus acidophilus, and Lactobacillus plantarum (Kanehisa and Goto, 2000; Sturme et al., 2005, 2007; Fujii et al., 2008). The Lactic Acid Bacteria are known to produce antimicrobial peptides as one response to specific AIP activity as a way to coordinate a defense or fitness mechanism (Diep et al., 1994; Rekhif et al., 1995; Anderssen et al., 1998; Atrih et al., 2001; Maldonado et al., 2002, 2004a,b), while AIP activity in S. aureus is reported to contribute to virulence (Booth et al., 1995; Papakyriacou et al., 2000) similar to P. aeruginosa. Similar to AHL-mediated sensing, the baseline of genes affected by AIP is unknown given the diversity of organisms that employ this mechanism, and the induction of a single target gene is likely rare.

Quorum sensing traditionally has been split into two classes consisting of AHL- and peptide-mediated signaling assigned to Gram-negative and Gram-positive bacteria, respectively (Miller and Bassler, 2001; Reading and Sperandio, 2006). However, both Gram-negative and Gram-positive bacteria participate in twocomponent system quorum sensing using autoinducer-2 (AI-2) class molecules, which are structurally unrelated to AHLs and peptides (Reading and Sperandio, 2006). Autoinducer-2 molecules are furanose derivatives of the coenzyme S-adenosyl methionine (SAM) whose final active structures vary expectedly among organisms, with some examples containing the element boron (Schauder et al., 2001; Taga et al., 2001; Semmelhack et al., 2004; Vendeville et al., 2005; Pereira et al., 2013). The synthesis of AI-2 involves the luxS gene detected in a variety of diverse organisms, leading to the theory that AI-2 molecules represent a universal language for bacterial interspecies communication (Vendeville et al., 2005). The emergence and prevalence of the AI-2 systems seemed to offer a more complete picture of the concept of bacterial communication, where the major bacterial groups of Gram-positive and Gram-negative had their own quorum sensing systems, and a third system existed to facilitate interspecies communication. The existence of such an AI-2 system potentially circumvented the curiosity of whether Grampositive or Gram-negative organisms could respond to each other's exclusive signaling molecules. Examples that challenge this presumed exclusivity of AHL and AIP quorum sensing, however, have been identified in the last decade, where AHL production has been noted in Cyanobacteria (Sharif et al., 2008), Archaea (Zhang et al., 2012), and a marine Gram-positive organism from the genus Exiguobacterium (Biswa and Doble, 2013). Such discoveries suggest a more universal recognition and utilization of AHLs within the microbial world than previously assumed.

Indeed, AHL recognition by higher order species has been previously observed (Smith et al., 2002a,b; Mathesius et al., 2003; Ritchie et al., 2003, 2005, 2007; Bauer and Mathesius, 2004; Chun et al., 2004; Wagner et al., 2007; Jahoor et al., 2008; Teplitski et al., 2011; Kariminik et al., 2017) with the rationale that the effects imparted by AHLs are due to their structural and functional resemblance to hormones and phytohormones (Teplitski et al., 2011). Within the bacterial kingdom, investigators have previously noted cross-species effects of a Yersinia enterocolitica AHL on enterohemorrhagic

Escherichia coli O157:H7 (Nguyen et al., 2013) and S. aureusderived peptides on both Enterococcus spp. (Firth et al., 1994) and Lactobacillus reuteri (Lubkowicz et al., 2018). While such responses were mostly unexpected and inexplicable, the latter example could be attributed to the close relation of the Lactobacillus and Staphylococcus organisms that diverge at the Class level and the apparent endogenous recognition of the S. aureus AIP-I despite a lack of annotation for agr homologs in L. reuteri (Kanehisa and Goto, 2000). Previous attempts to engineer the agr system into Firmicutes such as Bacillus megaterium were successful and proved the system to be unique to the host (Marchand and Collins, 2013), but the response of L. reuteri to the S. aureus AIP-I in contrast resulted in repression of an exogenous agr promoter rather than stimulation, implying cognate parts of the agr-like signaling system may exist in the Lactobacillus species used for different purposes than in S. aureus. This unexpected result exemplifies the complexity of interspecies quorum sense recognition despite the established genomic characterization of the involved species (Kanehisa and Goto, 2000).

The effects of specific AHLs are presumably localized to cognate and closely related species, at least in terms of an optimized and targeted response based on the number of studies probing the activity of heterologous luxR and lasR systems in E.coli without notable off-target responses (Seed et al., 1995; Latifi et al., 1996; Collins et al., 2005, 2006; Goodson et al., 2015). However, evidence exists of the potential cross-talk, albeit weak, between different AHL-mediated systems based on activity observed in cell-free experiments (Wen et al., 2017; Halleran and Murray, 2018). Considering the above points, it is hard to say universally whether bacteria can sense and respond to any present signaling cues in the microenvironment. In terms of signal fidelity, it makes sense that bacterial species would evolve a unique and uninterceptable signaling regime to coordinate the survival and fitness of itself over others in the community, especially in the context of virulence coordination. P. aeruginosa has been well noted to utilize its lasR-dependent quorum sensing system to establish biofilm formation and antihost immunity measures as well as cyanide production to solidify colonization in the face of both host responses and microbial competition (Storey et al., 1998; Pessi and Haas, 2000; Winzer and Williams, 2001). Furthermore, L. plantarum has been noted to produce the antimicrobial peptide class of Plantaricins as a result of AIP signaling to similarly diminish bacterial populations in its proximity (Anderssen et al., 1998; Atrih et al., 2001), but this seems to be more of an altruistic endeavor in order to cull the prevalence of organisms potentially harmful to the host (Maldonado et al., 2004b). While the driving force behind the P. aeruginosa signaling system may be attributed to cell density given its occurrence in pure cultures (Valdez et al., 2005), the initiation of the aforementioned response by L. plantarum is unknown, and could be due to either cell density or accumulation of some unrelated environmental cue (Maldonado et al., 2003, 2004a,b).

We therefore set out to investigate the potential for quorum sensing cross-talk between the two organisms P. aeruginosa and L. plantarum considering their activity as quorum sensing bacteria, their contrasting roles in the human gut (Matsumoto et al., 1997; Xia et al., 2011), and the effects they elicit on the host immune system (Schultz et al., 2002; Ko et al., 2007; Puertollano et al., 2008; Peral et al., 2010). Furthermore, there has been nominal evidence that L. plantarum is putatively capable of degrading AHLs produced by P. aeruginosa (Valdez et al., 2005; Peral et al., 2009; Ramos et al., 2012; Ramos et al., 2015), suggesting there could be some mechanism for AHL recognition by L. plantarum. The use of deep analytical techniques involving transcriptomics and proteomics allowed us to gather a detailed global picture of the response of L. plantarum to the presence of the P. aeruginosa AHL N-3-oxo-dodecanoyl homoserine lactone (3OC12). The wealth of resulting data provided us the opportunity to speculate on both the intracellular ripple effect and timeline for the interspecies response to a predominantly Gram-negative signal by a Grampositive commensal.

#### MATERIALS AND METHODS

#### Reagents

All reagents were obtained from Sigma unless specified otherwise. N-3-oxododecanoyl homoserine lactone (3OC12, Sigma-Aldrich cat# o9139 – manufacturer reported purity of ≤100%) was maintained in 100 mM stocks in molecular biology grade dimethyl sulfoxide (DMSO, purity ≥99.9%, Sigma Aldrich) at −20◦C until use.

### Culture Methods

Lactobacillus plantarum WCFS1 (BAA-793) was obtained from American Type Culture Collection and maintained in De Man Rogosa Sharp (MRS) media alone or supplemented with 1.5% agar at 37◦C in air.

## 3OC<sup>12</sup> Challenge

All experiments carried out in triplicate. Individual colonies of L. plantarum from MRS agar were grown shaking aerobically at 37◦C in 5 mL MRS overnight. The following morning, 1 mL overnight cultures were added to 100 mL fresh MRS and grown similarly for approximately 3 h until an OD<sup>600</sup> of 0.5, when samples were split into two aliquots of 45 mL of either treated samples with the addition of 3OC<sup>12</sup> to a final concentration of 100 µM, or control samples with the addition of 0.1% DMSO (v/v). Samples continued to grow at 37◦C shaking for aliquot removal and processing at +1, +4, and +7 h following treatment. Proteomic samples were archived by removing 0.5 mL culture in quadruplicate, flash freezing decanted cell pellets in liquid nitrogen and storing at −80◦C before analysis. Transcriptomic samples were archived by pelleting 1.5 mL culture and resuspending in 300 µL RNA Later (Qiagen) and storing at −80◦C before library preparation.

#### RNA Preparation and RNAseq

Frozen sample aliquots were thawed at room temperature and pelleted to remove residual RNA Later before treating with 400 µL Lysozyme Solution (1 mg mL−<sup>1</sup> lysozyme, 40 mM

EDTA, pH 8) for 1 h at 37◦C. Lysozyme treated cells were pelleted and resuspended in RNAzol RT (Molecular Research Center) following the manufacturer's protocol for large RNA isolation. Resulting RNA was resuspended in 20 µL pure water and quantified using a NanoDrop2000 instrument (Thermo Fisher Scientific) and calculations based on absorbance at 260 nm. DNase I (Thermo Fisher Scientific) was added to RNA at 0.5 U µg −1 for 30 min at 37◦C followed by ethanol precipitation. RNA was again resuspended in 20 µL pure water, quantified by NanoDrop and stored at −80◦C. Approximately 1 µg of RNA was subjected to ribosomal RNA depletion using the Ribo-Zero Bacteria kit (Illumina) following the manufacturer's protocol. Samples were ethanol precipitated after rRNA depletion using glycogen as a carrier and resuspended in 10 µL pure water, quantified by NanoDrop and stored −80◦C. Approximately 100 ng rRNA-depleted RNA was used for Library preparation with the NEBNext Ultra Directional RNA Library Prep Kit for Illumina (New England Biolabs), AMPure XP beads (Beckman Coulter, Inc.), and NEBNext Multiplex Oligos for Illumina (New England Biolabs) containing Nextera i7 sequences following the manufacturer's protocol. Resultant libraries were quantified by the Qubit broad-range protocol (Thermo) and visualized by agarose gel before sequencing by an Illumina MiSeq platform (MiSeq Analyzer v2.5.1.3) using the MiSeq Reagent Kit v3 (Illumina) and the Illumina RNAseq protocol with paired end reads of 60 bp. Libraries were analyzed by two replicate MiSeq runs generating 47.3 million raw filtered reads.

#### RNAseq Analysis

Generated MiSeq reads were then analyzed via the RNAseq pipeline described elsewhere (Li et al., 2018). Read quality control was performed where Phred scores of <20 were trimmed using FastQC and Cutadapt as part of Trim Galore (Krueger, 2015). EDGE-pro (Estimated Degree of Gene Expression in PROkaryotes) (Magoc et al., 2013) for paired end reads was used with default settings on the remaining reads for alignment to the L. plantarum WCFS1 genome from NCBI<sup>1</sup> generating RPKM files and count tables. Each count table was read into R where DESeq2 (Love et al., 2014) was used for differential expression analysis and generating associated statistics as a function of treatment with 3OC<sup>12</sup> and time. The RNA sequence data is available at NCBI GEO accession GSE124050.

#### General Data Analysis

In-house R scripts were used for data analysis and visualization of both RNAseq and proteomics data, including the addition of KEGG annotation (Cock et al., 2009), the generation of boxplots and bar graphs using ggplot2 (Wickham, 2016), and UpSet plots using UpSetR (Lex et al., 2014).

#### Proteomics Analysis

Snap frozen pellets were resuspended in 200 µL of 10% n-propanol in 50 mM ammonium bicarbonate, vortexed and sonicated. Cells were lysed in microtubes with 100 µL caps (Pressure Biosciences, Inc., Easton, MA, United States). The tubes were then placed into the HUB-440 Baro-cycler (Pressure Biosciences, Inc., Easton, MA, United States) and lysis was performed by 30 s cycles (20 s ON, 10 s OFF) for 60 cycles at 45 kpsi and 25◦C. After lysis, total protein amount in all samples was estimated and adjusted to 12 µg prior to digestion by trypsin in barocycler (60 s cycles – 50 s pressure ON, 10 s pressure OFF, 45 kpsi, 50◦C). Digests were dried on speed vac and stored at −20◦C prior to LC-MS/MS analysis. Samples were reconstituted in 30 µL of 0.1% formic acid in water and 0.2 µg of total protein was analyzed by LC-MS/MS using U3000 LC coupled to Orbitrap Fusion Lumos mass spectrometer (Thermo Scientific, Waltham, MA, United States). Autosampler loaded sample onto trap column (PepMap 100, C18, 300 µm ID × 5 mm, 5 µm, 100A) via loading pump at 5 µL min−<sup>1</sup> flow rate and 2% solvent B. Analytical pump set to 300 nL min−<sup>1</sup> was used to elute peptides from the trap onto analytical column (Acclaim PepMap RSLC, 75 µm ID × 150 mm, C18, 2 µm, 100A). A gradient of 2–60% B in 90 min was used for peptide separation. Solvent A was 0.1% formic acid in water and solvent B was 0.1% formic acid in acetonitrile. Mass spectra were acquired on Fusion Lumos Orbitrap equipped with a Nanospray Flex Ion Source in data-dependent acquisition mode with 3 s cycle times. A survey scan range of 400– 1,600 Da was acquired on the Orbitrap detector (resolution 120 K). Maximum injection time was 50 ms and AGC target was 400,000. The most intense ions with charges of 2–7 were fragmented using HCD (higher-energy collisional dissociation), and ions were excluded for 30 s from subsequent MS/MS submission. MS/MS detector was IonTrap with 35 ms injection time and AGC target of 10,000. Resulting spectra were extracted, converted into mgf by ProteoWizard software and searched by Mascot (Matrix Science Inc., London, United Kingdom) against a database containing common standards and contaminants, i.e., trypsin, keratin, etc. (190 protein sequences) and database containing all predicted proteins from L. plantarum genome (3,063 sequences). Oxidation of methionine and deamidation of glutamine and asparagine were selected as variable modifications, enzyme was set to trypsin and 3 missed cleavages were allowed. Precursor ion tolerance was set to 100 ppm and fragment ion to 1 Da. Protein identifications were further validated by Scaffold (Proteome Software Inc., Portland, OR, United States). Protein identifications were accepted if they could be established at greater than 90.0% probability and contained at least 2 identified peptides. P Protein probabilities were assigned by the Protein Prophetalgorithm (Nesvizhskii et al., 2003). Proteins that contained similar peptides and could not be differentiated based on MS/MS analysis alone were grouped to satisfy the principles of parsimony. Quantitative analysis was done in Scaffold using emPAIs as an input. Only emPAIs satisfying the probability settings were considered for the analysis (lower scoring matches and probabilities <5% were not included). t-test, fold change, and other calculations were performed on emPAIs using Scaffold (Searle, 2010). In order to avoid divide-by-zero errors caused the absence of proteins in fold change calculations, we set missing values to 0.3, as previously described (Bible et al., 2015). The proteomics data is available

<sup>1</sup>RefSeq NC\_004567.2, NC\_006375.1, NC\_006376.1, NC\_006377.1.

at ProteomeXchange repository with identifier PDXD012232 and 10.6019/PXD012232.

### RESULTS AND DISCUSSION

Cultures of L. plantarum WCFS1 were grown from single colonies in overnight cultures of nutrient rich MRS media at 37◦C shaking and used the following day to inoculate 100 mL MRS media (1% v/v). The OD<sup>600</sup> was monitored throughout growth and when it reached 0.5 (approximately 3 h) cultures were split into 45 mL aliquots for treatments. Cultures continued to be incubated at 37◦C and aliquots were removed at 1, 4, and 7 h post-treatment for sample preparation. Transcriptomic data was acquired through replicate RNAseq protocols using Illumina MiSeq and resulted in over 47 million raw filtered reads covering the majority (∼93.6% of the annotated 3,174 genes) of the L. plantarum WCFS1 genome and its three plasmids (**Figure 1A**). Proteomic analysis resulting from tandem mass spectrometry on the Orbitrap Fusion Lumos (Thermo) were able to corroborate up to 589 of the identified genes from RNAseq at certain time points yielding coverage illustrated in **Figure 1B**. The resulting datasets from the two techniques were evaluated both independently and as an integrated response in order to dissect the changes displayed by the organism.

#### Transcriptomic Response

The genomic coverage of RNAseq was relatively consistent throughout all time points (**Figure 2A**). There were 2,947 genes identified by RNAseq that were common to all three time

Log<sup>2</sup> fold change of 3 biological replicates.

points, and all genes identified at 1 and 4 h were accounted for at the other times. A large number of the identified genes were determined to be upregulated in comparison to controls (**Figure 2B**). Of these upregulated genes, 338 were unique to the 1 h samples, 264 were unique to the 4 h samples, and 484 were unique to the 7 h samples, implying that each time point

FIGURE 4 | Fatty acid metabolism changes. (A) Changes observed by RNAseq within fatty acid metabolism subcategory at 1, 4, and 7 h plotted as Mean Log<sup>2</sup> fold change of 3 biological replicates. (B) RNAseq changes of individual genes involved in Malonyl-CoA synthesis, fatty acid synthesis, and the conversion of propionyl-CoA to Lipoamide E plotted as Mean Log<sup>2</sup> fold change of 3 biological replicates at 1, 4, and 7 h.

might offer a snapshot of the overall response to 3OC12. In addition to genes identified that were unique to specific times, there were also 372 genes observed to be upregulated at all times. It is interesting that there are sets of genes upregulated consistently as well as those unique to specific timepoints, especially considering that our experiments consisted of a single addition of 3OC12. Based on the hypothesized longevity of AHLs (Yates et al., 2002; Ramos et al., 2010), such an addition might be characterized as an acute environmental change that would elicit an immediate response that then subsides. These initial results, however, implied that this may have consisted of a cascade of genetic changes over the course of the entire experiment where our sampling times were only able to capture small glimpses of the full response. Additionally, the consistent upregulation of one set of genes might indicate a long-term response relative to what might be expected for purportedly transient stimulus.

All genes identified throughout the experiments were grouped according to KEGG Pathway Categories (**Figure 3A**) in order to determine where the majority of the transcriptomic activity was occurring. A large portion of identified genes from all samples fell under the Genetic Information Processing group consisting of Transcription, Translation, Folding, Degradation, Replication and Repair. Nearly a quarter of identified genes were assigned to this category at 1 h, and that abundance decreased as the experiments progressed. Samples analyzed at 1 h after 3OC<sup>12</sup> addition showed no real changes in Genetic Information Processing, but there was a marked decline in treated samples at 4 h which could indicate the diversion of cellular resources elsewhere for the response. Examples of reductions at 4 h are seen

with DNA repair genes such as exoA, recJ, tag2, and mutL, protein export genes such as yidC1, and RNA degradation genes such as recQ2 and rnj. The activity in General Information Processing at 7 h, however, was the lowest of the whole experiment but equivalent in both treated and control samples indicating a restoration to background levels. The response at 1 h showed a decrease in Cellular Growth genes as cells were initially reacting to the AHL stimulus, but by 4 h these genes had returned to the levels observed in the controls. Increases in Membrane Transport and Signal Transduction categories were also observed at 1 h such as upregulation of the transport associated genes mtsABC, metN, and livB and two-component system genes rrp11, hpk1, aad, citCEFX, and pltKR. Increases in Cellular Community genes were also observed at 1 h based on the activity of the known quorum sensing genes lp\_0783 (oligopeptide transport) and oroP. Along the same lines of signaling capabilities, there were increases in the gene sip1 that is annotated to function in AIP maturation and export. This might be expected from a cell adjusting to a recently changed environment and initiating signal cascades as a response. Both categories of Membrane Transport and Signal Transduction were returned to control levels after 1 h. The most abundant category of genes noted throughout all time points was that of Global Metabolism, accounting for at least half of all identifications. The amount of RNA devoted to Global Metabolism increased at 4 h in treated samples with increases noted for nagA, argCJ, pts9AB, pgk, tpiA, enoA, galU, pgm, galE1, luxS, adk, and multiple acc and fab genes for example. The amount of total RNA accounted by this category continued at 7 h, but comparisons with control samples showed it to be of background level abundance.

Further investigation of the Global Metabolism response is outlined in **Figure 3B**. When partitioning the genes of Global Metabolism into 9 further subcategories, it was clear that the increase noted at 4 h was due to activity within Fatty Acid Metabolism. It can be noted here that all subcategories of Global Metabolism revert to a downregulated state at 7 h. **Figure 4A** shows the further subdivision of Fatty Acid Metabolism into 9 more categories, wherein the Biosynthesis of Unsaturated Fatty Acids and Fatty Acid Biosynthesis groups seem to be the main contributors to the changes noted in Global Metabolism. Both of these groups show upregulation in treated samples at 4 h and downregulation at 7 h. **Figure 4B** illustrates the components of these two groups and the individual changes noted from treated samples. The incorporated genes make up three groups that represent the conversion from Acetyl-CoA to Malonyl-CoA as a lipid biosynthesis precursor (acc genes), the actual synthesis of fatty acids via acyl-carrier protein-containing genes (fab genes), and the interconversion of Propionyl-CoA and Lipoamide E (pdhD, pflF, and pta). Each of the genes involved show an increase in activity at 4 h and a decrease at 7 h, with the exception of the Propionyl-CoA involved genes pdhD and pflF. Previous investigators have noted that environmental stresses can impart changes on membrane hydrophobicity and adhesion capabilities of Lactobacillus ssp., indicating that these organisms are prone to alter their fatty acid makeup in response to their environment (Haddaji et al., 2017), therefore this comprehensive activation of Fatty Acid Synthesis genes likely follows this reasoning.

#### Proteomic Response

The proteomic response of L. plantarum WCFS1 cultures to 3OC<sup>12</sup> was assessed by LC-MS/MS of trypsinized samples on the Orbitrap Fusion Lumos, wherein the output spectra were assigned to known annotated proteins using Mascot (Perkins et al., 1999) and normalized with the Exponentially Modified Protein Abundance Index (emPAI) (Ishihama et al., 2005) to determine a relative abundance of protein per sample similar to strategies used in RNAseq methods. The resulting analysis identified 300 proteins common to samples at all times (**Figure 5A**). In addition, there were 20 proteins detected unique to the 1 h time point, 70 proteins unique to the 4 h time point, and 126 proteins unique to the 7 h time point. Furthermore, there were 160 proteins that were exclusively identified in the two later time points, and only 3 proteins that were registered at 1 and 7 h only. When comparing treated and control samples, a pattern emerged similar to that seen with RNAseq where each time point contained a unique set of proteins that seemed to increase in number as the experiment progressed (**Figure 5B**). There were 38 unique upregulated proteins identified at 1 h, 67 at 4 h, and 221 by the 7 h time point. Overall, there were 31 proteins that remained upregulated throughout the experiment. This sweeping trend of increasing translational activity similarly seemed to indicate a 3OC<sup>12</sup> response.

Further investigation into the nature of the identified proteins showed an interesting devotion of resources in both control and treated samples (**Figure 6A**). Treated samples showed a higher abundance of protein at 1 h contributing to Cellular Processes. In particular, genes relating to processes in the Cellular Community category were higher in treated samples including an increased abundance in Lon protease and the oligopeptide transporter lp\_0018, but Cell Growth genes were lower (ClpP, ClpX, FtsZ, and FtsA). At 4 h, however, the Cell Growth genes were comparable in both treated and control samples, and Cellular Community genes in treated samples had reduced to below the levels observed in controls. Detected at a lower abundance than the Cellular Process genes were those related to Environmental Information Processing. A clear increase in this category consisting of Signal Transduction and Membrane Transport was observed for treated samples at 1 h based on the abundance of proteins such as DltD, and greatly diminished thereafter. The second highest category of identified proteins was that of Genetic Information Processing. Treated and control samples had devoted roughly an equivalent amount of resources toward this group at 1 h, but at 4 h the translation in treated cells had diminished to nearly half of that observed in controls. This drop could be attributed to reductions in identifications of chaperones DnaK and Hsp3, as well as repair genes XseAB and MutS2, and large ribosomal proteins RplAMNOBU. Considering the values of top upregulated genes identified at all times (**Table 1**), however, the abundance of some ribosomal proteins (RplKQTVW, RpmABI, RplF, and RpsKNS) and chaperones (Tig) decreased from 1 h to 4 h but remained above the levels of the control samples,

#### TABLE 1 | Upregulated proteins at all timepoints identified by proteomics.


RNAseq values represent Log<sup>2</sup> fold change of gene abundance. Proteomic values represent Log<sup>2</sup> fold change of emPAI values.

Descriptions of the 31 genes identified as upregulated by proteomics at all timepoints along with their corresponding RNAseq Log<sup>2</sup> fold changes.

indicating a sort of turnover of proteins that are likely involved in the same process over time.

The most abundant proteins at all times belonged to the Metabolism category, specifically Global Metabolism, accounting for at least half of the proteins identified similar to what was seen with RNAseq data. While approximately equivalent in abundance at 1 h, proteins identified in this category from treated cells increased notably at 4 h before finally returning to levels similar to controls at 7 h. The activity noted in the Global Metabolism category appeared to come from genes functioning in Metabolic Pathways (**Figure 6B**), where at 4 h the abundance of proteins in treated samples were near double those identified from controls. Proteins encoded from genes such as xylH (putative tautomerase EC 5.3.2.6, lp\_1712), dltC1 (D-alanyl carrier protein 1, lp\_2017), iolE (inosine dehydratase, lp\_3607), fabI (enoyl-ACP reductase, lp\_1681), and dak2 (dihydroxyacetone phosphotransferase, lp\_0169) are examples of contributors to this increase in abundance. While FabI and Dak2 can both be attributed to different aspects of lipid metabolism, XylH is an enzyme participating in Xylene Degradation, and the lack of other enzymes that L. plantarum possesses in this pathway suggests it may be a participant of a community effort

to degrade aromatic compounds. The presence of IolE is also curious considering its role in inositol phosphate metabolism, a similarly incomplete pathway within L. plantarum itself. DltC1, however, poses an interesting observation given its role in cationic antimicrobial peptide resistance. The expression of dltC1 is controlled by a two-component system along with other dlt genes all found in close proximity to each other on the chromosome, and showed a similar upregulation profile as it's cohort dltD. Indeed the scope of two-component systems is vast and diverse among studied bacteria and have been known to commonly participate in cross-talk (Procaccini et al., 2011), implying an activation of a two-component system by some facet of 3OC<sup>12</sup> treatment whose downstream cascade may resemble that of cationic antimicrobial peptide resistance.

The top identified proteins throughout the experiment outlined in **Table 1** are a more detailed reflection of the summarized categorical changes above. Very few of the proteins identified with the criteria of having an emPAI Log<sup>2</sup> fold change > 0.25 at one or more time points belong to an Energy Metabolism pathway. The only protein that meets these criteria is GapB, which seemed to steadily increase in abundance throughout the experiment. While there were a number of ribosomal proteins detected at all time points, they do not all maintain similar expression trends based on observed spectra. The most abundant group of proteins identified either belonged to Lipid Metabolism or were involved in cell membrane architecture. Previous investigators have noted L. plantarum to undergo changes in membrane composition as a response to environmental stressors (Haddaji et al., 2017), which was interesting based on the presence of LuxS and ClpC in this list of identified proteins. While the latter is a characterized member of the CtsR regulated stress response pathway (Fiocco et al., 2010), the former is a member of various pathways of interest. The LuxS enzyme is a well-characterized S-ribosylhomocysteine lyase that plays a part in various amino acid syntheses, but is also responsible for the synthesis of the bacterial interspecies signaling molecule autoinducer-2 (AI-2). Recently LuxS has been linked with the acid stress response in certain strains of L. plantarum as well as the increased adherence capabilities of cells (Jia et al., 2018) which has been characterized as a basic bacterial survival mechanism (Hunt et al., 2004). It is intriguing, however, to consider the potential that L. plantarum is initiating a quorum sensing event in response to sensing the pathogen-associated 3OC12. Based on recent studies, it is not farfetched to deduce that if LuxS is used in the face of environmental stress to increase the fitness of L. plantarum, and 3OC<sup>12</sup> is causing the upregulation of luxS in our cultures, then L. plantarum possesses some way of recognizing the AHL as an indication of environmental stress.

#### Integrated Transcriptomic and Proteomic Response

The integration of both proteomic and RNAseq datasets resulted in the reinforcement of previous hypotheses generated from independent analyses. The cross-referencing of both datasets allowed the verification by proteomics of 337 genes previously detected by RNAseq at 1 h, 544 at 4 h, and 589 detected at 7 h (**Figures 7A,C,E**). The threshold values considered in our analyses were Log2-fold change > | 0.25| for proteomic data and > | 0.5| for transcriptomic data in order to gain a broader view of the response using both methods. Correlation plots at specific time points show the number of genes satisfying both threshold criteria at given timepoints (**Figures 7B,D,F** and **Table 1**), where the biggest overlap of identifications from proteomics and transcriptomics occurs at 4 h (**Figure 7D**). When analyzing the data produced across all timepoints by both analyses, we noticed a trend of genes that show upregulation via RNAseq that later appear as protein identifications via proteomics in the next timepoint. We therefore decided to additionally check correlation plots of identifications that met our threshold criteria with a 3 h time offset in mind (**Figure 8**). As a result of analyzing the correlation of changes in RNAseq at 1 h and proteomics at 4 h (**Figure 8A**), we identified 13 genes satisfying our previously described criteria wherein all RNAseq results (except those for lp\_2433 and lp\_0091) met the threshold of p < 0.05 (Wald test, **Table 2**). Using a similar approach to analyze identifications by RNAseq at 4 h and proteomics at 7 h (**Figure 8B**), we noted 45 genes of interest, of which 71% similarly showed p < 0.05 (Wald test, **Table 2**).

The genes identified satisfying Log<sup>2</sup> fold change thresholds for both RNAseq and proteomics with offset times are summarized in **Table 2**. Changes in RNAseq that were verified with proteomics consisted of upregulation of membrane-associated proteins encoded by lp\_3679 (CscB family cell surface protein) and mtsA (metal transporter), chaperones GroES and GroEL, and lp\_1173 (UDP-N-acetylglucosamine 2-epimerase) that could be involved in membrane architecture. The transcriptional changes at 4 h that were corroborated with a similar protein response at 7 h consisted of a variety of processes. There seemed to be an emphasis on ATP generation and conservation among the observed changes in abundance. Genes such as mtlD involved in fructose/mannose metabolism were downregulated, as were genes such as mdxE responsible for maltooligosaccharide transport and lp\_3211 involved in cystine transport. ABC transporters similar to these have been shown to be downregulated during the onset of stationary phase-related environmental stressors (Cohen et al., 2006) based on the assumption that they require ATP, which at that time is a precious commodity. When checking the rest of our data for this trend of ATP conservation we found that of the 93 ABC transporters detected by RNAseq at 4 h, all but 16 were downregulated compared to controls, and 11 of the 14 detected by proteomics at 7 h were also downregulated. While L. plantarum cells remained in log phase of growth throughout these experiments, it's possible a similar environmental pressure was asserted on the cells causing them to reprioritize their use of ATP. The genes pgk, tpiA and pmg9 involved in glycolysis were upregulated, presumably to facilitate an influx of metabolites into this pathway to more readily produce ATP. Also noted was an increase in the abundance of adk. This gene encodes the adenylate kinase enzyme responsible for generating ADP and dADP as substrates for ndk to convert into ATP. Increases were also noted in the levels of hprT, which is responsible for the generation of GMP from guanine as a precursor to GTP synthesis. In addition to energy conservation, genes involved

544 genes in common, and (E) 7 h with 589 genes in common. Correlation plots of genes identified by RNAseq and Proteomics using threshold cutoffs for genes

with Log<sup>2</sup> fold change > 0.25 for proteomics and > 0.5 for RNAseq (blue dots) at (B) 1 h, (D) 4 h, and (F) 7 h.

in nucleotide sugar metabolism such as nagA, galE1, and galU were all observed to be upregulated during this portion of the response as well. UDP-glucose 4-epimerase, encoded by galE1, is also a participant in the Leloir pathway and has been observed to be upregulated during late log phase of growth as cells alter their membrane structure to deal with the changing environment (Cohen et al., 2006). Like what was observed with the ABC transporters, cells did not enter stationary phase but it is possible they are adapting to environmental stressors by altering their membrane composition (Haddaji et al., 2017). This hypothesis is reinforced by the upregulation of a number of genes involved in the conversion of Acetyl-CoA to Malonyl CoA (accA2, accB2, accC2, and accD2) and the subsequent conversion of Malonyl-CoA to different fatty acids (fabZ1, fabI, and lp\_3045).

#### Stress Response

Stress responses consist of a variety of differentially expressed genes and can originate from a number of different stimuli as mentioned above. While there have been a number of studies on Lactobacillus ssp. responding to various stresses such as oxidative stress (Serrano et al., 2007), general acid stress (Heunis et al., 2014; Seme et al., 2015), phenolic acid stress (Gury et al., 2009), lactic acid stress (Pieterse et al., 2005), alkaline stress (Lee et al., 2011), metal stress (Tong et al., 2017), growth phase transition stress (Cohen et al., 2006), and the general Classes I and III stress responses (Van Bokhorst-van de Veen et al., 2013), most of these studies indicate a large number of participating genes with no apparent timeframe on induction of the response or the restoration to the pre-stimulus state presumably due to the nature of the individual stress tested. Although 3OC<sup>12</sup> is not traditionally considered a stressor, we observed similarities in gene and protein activity with stress responses characterized in similar organisms when we analyzed our cultures after exposure to this AHL.

One of the top identified genes in both RNAseq and proteomics was that of mrsB (**Table 2**). The enzyme peptidemethionine (R)-S-oxide reductase encoded by this gene is responsible for reducing oxidized methionine caused by reactive oxygen species. Aside from the specific activity of MrsB, thioredoxin is a more general oxidative-stress response protein whose activity has been established in L. plantarum (Serrano et al., 2007). While the investigators emphasized trxB1 to be the key player involved in the stress response, our data showed significant upregulation of the genes trxB, trxA2, and trxH at 1 and 4 h in transcriptomics. The collective upregulations of trxB, trxA2, trxH, and mrsB indicate the cells to be responding similarly as they might in oxidative stress conditions.

The universal stress protein (Usp) family involves nucleotidebinding proteins that function in various non-specific stress conditions. Research has been done in particular on their involvement with phenolic acid stress in L. plantarum (Gury et al., 2009) focusing on the activity of Usp1. There are 10 uncharacterized universal stress proteins in L. plantarum WCFS1, and all but 1 show upregulation to some extent at 1 h by RNAseq. Of those 9, 4 are upregulated still at 4 h while 3 are downregulated, and 5 are upregulated at 7 h while only 1 is downregulated. Without further elucidation of the roles of these individual universal stress proteins, no conclusions can be drawn about the nature of the stress response displayed. The only thing that can be concluded is that it does not follow the pattern observed from stress brought on

#### TABLE 2 | Genes identified by integrated time offset analysis between RNAseq and proteomics.


#### TABLE 2 | Continued

fmicb-10-00715 April 3, 2019 Time: 21:14 # 17


RNAseq values represent Log<sup>2</sup> fold change in transcript abundance. Proteomic values represent Log<sup>2</sup> fold change in emPAI values.

Descriptions of genes reaching threshold criteria of Log<sup>2</sup> fold change > 0.25 in proteomic analysis and > 0.5 in RNAseq analysis for both the 1 h RNAseq and 4 h proteomics set and the 4 h RNAseq and 7 h proteomics set.

by the presence of organic acids, as the Usp immediately downstream of the PadR repressor studied (Gury et al., 2009) displayed changes opposite to those that would be expected from such a response.

Studies on the acid stress response in L. plantarum strains (Heunis et al., 2014) have shown increasing activity in a number of seemingly unrelated genes, characterizing a profile for the adaptation to the acidic environment that includes changes in energy metabolism, for example. Of the 18 genes previously observed as upregulated in response to acid stress (Heunis et al., 2014), only 3 are upregulated in our RNAseq data at 1 h (pmi or lp\_2384, ldhL1 or lp\_0537, and pta or lp\_0807), 2 at 4 h (nagB or lp\_0226, and ldhL2 or lp\_1101), and 1 at 7 h (acdH or lp\_0329). The genes upregulated at 1 h and 4 h are involved in carbohydrate metabolism, while acdH encodes a putative acetaldehyde dehydrogenase. Furthermore, the ldhL gene observed as upregulated at multiple times encodes the enzyme lactate dehydrogenase responsible for the reduction of pyruvate to lactate, which is the hallmark of the Lactic Acid Bacteria as a method of regenerating NAD+. Another acid-induced stress marker in L. plantarum has been shown to be LuxS (Jia et al., 2018). While it didn't reach the thresholds set above (Log<sup>2</sup> fold change > 0.5), RNAseq was able to identify the slight but significant luxS upregulation occurring at 1 h (Log<sup>2</sup> fold change = 0.25, or +19% increase; p = 2 × 10−<sup>5</sup> ) and at 4 h (Log<sup>2</sup> fold change = 0.44, or +37% increase; p = 0.002). Proteomic analysis was further able to confirm LuxS upregulation throughout all treated samples. While previous studies have shown LuxS to be involved in acid response, we did not observe any other significant markers for such stress. Based on the function of LuxS, however, it is highly probable that other factors could be responsible for its induction.

Studies on transcriptomic activity during growth phase transitions in L. plantarum provide a long list of differentially expressed genes (Cohen et al., 2006). While cultures in our experiments never entered stationary phase of growth, there were some similar changes in gene expression observed in our experiments to those noted in late-log phase and stationary phase. Examples include dacA1, hpk11, and rrp11 which all were noted to increase at 1 and 4 h in RNAse q. The Rrp11 protein functions as a two-component system response regulator that was also identified by proteomics at 7 h, further reinforcing its upregulation over time. These three genes have been hypothesized to be involved in a two-component system involved in cell wall maintenance, adding to the list of similarly tasked genes that have been previously discussed above. Cohen et al. (2006) also noted the high abundance of Plantaricin genes throughout growth that decreased with the onset of stationary phase. Our RNAseq data showed that of the 21 annotated pln genes, all were downregulated at 1 h, 5 were upregulated at 4 h, and by 7 h there were 13 that were upregulated. As all time points in our experiments represent log phase of growth, the increasing expression of Pln genes over time does not disagree with previous observations of their phase-specific expression patterns. Their upregulation in comparison to controls, however, makes the activity of these Plantaricin genes interesting. Considering previous instances when L. plantarum has employed the use of Plantaricins such as when grown in co-culture with other bacteria (Maldonado et al., 2003, 2004a,b), the idea of 3OC<sup>12</sup> stimulating a Plantaricin response is an intriguing example of a stress-induced defense mechanism. Cohen et al. (2006) also noted a small group of stress-related proteins that peak during stationary phase, specifically dps1, grpE, clpP, and kat, all of which are upregulated at 1 h in our experiments. With exception of clpP, all of these genes are also upregulated at 4 h as well.

General stress responses can be grouped into classes based on their regulators. The causes for such responses are traditionally attributed to heat shock or general stress conditions (Derre et al., 1999). The Class I and III stress response regulons have recently been characterized in L. plantarum by a transcriptomic analysis of single- and double-mutants of their respective regulators HrcA and CtsR (Van Bokhorstvan de Veen et al., 2013). The Class I response regulon governed by the HrcA repressor showed the involvement of

genes such as the chaperones groS and groL, the hsp1 small heat shock protein, and three putative genes annotated as an integrase/recombinase (lp\_1268), a CAAX family membranebound protease (lp\_0726), and an uncharacterized protein (lp\_1880). Each of these Class I regulon members with the exception of the protease and the recombinase were observed to be upregulated via RNAseq at 1 and 4 h. Proteomics further confirmed increased numbers of GroES and GroEL at 4 h and increased Hsp1 at 7 h. The Class III response regulon that is controlled by the repressor CtsR has been studied a couple of times in L. plantarum (Fiocco et al., 2009; Fiocco et al., 2010; Van Bokhorst-van de Veen et al., 2013), and consists mainly of the Clp proteases and Clp ATPases. The existing clp genes in the L. plantarum WCFS1 genome (clpPCEBXL) were all determined to be upregulated by RNAseq at 1 h (with the exception of clpL) and at 4 h (with the exception of clpE), and all were downregulated compared to controls at 7 h in our experiments. Proteomic analysis was further able to confirm the upregulation of ClpP at 7 h. Aside from clp genes, CtsR has also been shown in L. plantarum to control the small heat shock protein encoded by hsp1, the protease subunits encoded by hslU and hslV, a tyrosine recombinase xerC, and an annotated aldose-1 epimerase (lp\_1843) (Van Bokhorst-van de Veen et al., 2013). The putative aldose 1-epimerase (lp\_1843), the two protease subunits (lp\_1845 and lp\_1846) and the tyrosine recombinase (lp\_1847) all fall within the same region of the chromosome and are likely regulated by a single promoter. As such, our RNAseq data show these four genes are upregulated at 1 h, but only the distal lp\_1843 remained upregulated by 4 h. Proteomics, however, were only able to confirm the upregulation of the protease subunits HslU and HslV.

Despite the abundance of genes described as part of various stress responses, the responses we observed from our cultures did not perfectly align with any established stress response. Although 3OC<sup>12</sup> is not traditionally considered a stressor, and the genes upregulated in response to its addition to cultures of L. plantarum are do not traditionally follow any single previously defined stress response pathway, we consider this AHL sensing to be a form of a stress response based on our observations of downregulated cellular growth genes in addition to upregulation of previously characterized stress response genes compared to untreated controls. In particular, this AHL response shares similarities with a number of different characterized stress responses, namely oxidative stress responses, general Classes I and III stress responses, and late growth stage transition responses. Based on the nature of our experiments we can conclude that our cultures were free of any traditional initiator of these set responses, and therefore these findings represent a unique response that may borrow facets of other commonly used response pathways to achieve the most desirable survival phenotype.

#### CONCLUSION

Here we have shown that L. plantarum WCFS1 is capable of sensing the Gram-negative quorum sensing molecule 3OC<sup>12</sup> from P. aeruginosa. Transcriptomic and proteomic analyses indicate a number genes specifically upregulated as a result of 3OC<sup>12</sup> titration into pure cultures of L. plantarum. The majority of genes identified by both methods fall within the category of Global Metabolism with an emphasis on Fatty Acid Synthesis, although a number of identified genes also hint at the organism's alteration of energy metabolism in order to conserve ATP similar to what might occur in a transition to stationary phase of growth. Further changes registered by both methods include genes consistent with both a Classes I and III stress response traditionally caused by general environmental stress. While no stress response profiles could be perfectly matched to previous omic investigations of stress responses, our data showed similarities with cellular responses to oxidative stress and those associated with growth phase transitions, among others, such as thioredoxin activity, fatty acid synthesis, and membrane maintenance. Based on the genes identified we can conclude that the cell is responding to environmental stress unlike other previously established responses. The upregulation of the AI-2 synthesizing enzyme LuxS is an intriguing occurrence that implicates the attempt of L. plantarum to externalize a cell signaling event as a response to AHL addition. The upregulation of the Plantaricin genes is a similar event that when taken together with the luxS activity paints the picture of a probiotic organism initiating a defense mechanism in response to a pathogen-associated small molecule. We have therefore provided evidence that L. plantarum responds to the presence of 3OC<sup>12</sup> by initiating multiple quorum sensing systems of its own given the luxS activity and the upregulation of Plantaricin that results from AIP signaling. In addition, the induction of twocomponent systems and of multiple putative transcription factors further hints at the complex cell signaling cascade initiated by this AHL.

#### AUTHOR CONTRIBUTIONS

JS, SD, DL, and SW contributed to the experimental work described in this manuscript and the preparation of the manuscript. JS and SD cultured bacteria and prepared samples for transcriptomics and proteomics. JS performed library preparation and transcriptomics studies. DL was responsible for all proteomics work. SD performed data compiling and preliminary analysis. SW directed experimental design and analysis.

#### FUNDING

The authors were funds of the Naval Research Laboratory (MA041-06-41) and funds associated with the Office of the Secretary of Defense Applied Research for the Advancement of S&T Priorities (ARAP) Synthetic Biology for Military-Relevant Environments (SBME) program.

### REFERENCES

fmicb-10-00715 April 3, 2019 Time: 21:14 # 19


systems in Gram-positive bacteria. Mol. Microbiol. 24, 895–904. doi: 10.1046/ j.1365-2958.1997.4251782.x


aeruginosa binds to the lasI promoter. J. Bacteriol. 188, 815–819. doi: 10.1128/ JB.188.2.815-819.2006


epilithic colonial cyanobacterium Gloeothece PCC6909. ISME J. 2, 1171–1182. doi: 10.1038/ismej.2008.68



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Spangler, Dean, Leary and Walper. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Systems Biology – A Guide for Understanding and Developing Improved Strains of Lactic Acid Bacteria

Jianming Liu1,2† , Siu Hung Joshua Chan<sup>3</sup>† , Jun Chen<sup>1</sup>† , Christian Solem<sup>1</sup> \* and Peter Ruhdal Jensen<sup>1</sup> \*

<sup>1</sup> National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark, <sup>2</sup> Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Champaign, IL, United States, <sup>3</sup> Department of Chemical and Biological Engineering, Colorado State University, Fort Collins, CO, United States

#### Edited by:

Jan Kok, University of Groningen, Netherlands

#### Reviewed by:

Fernanda Mozzi, CONICET Centro de Referencia para Lactobacilos (CERELA), Argentina Gerald Fitzgerald, University College Cork, Ireland

#### \*Correspondence:

Christian Solem chso@food.dtu.dk Peter Ruhdal Jensen perj@food.dtu.dk

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

Received: 20 December 2018 Accepted: 04 April 2019 Published: 30 April 2019

#### Citation:

Liu J, Chan SHJ, Chen J, Solem C and Jensen PR (2019) Systems Biology – A Guide for Understanding and Developing Improved Strains of Lactic Acid Bacteria. Front. Microbiol. 10:876. doi: 10.3389/fmicb.2019.00876 Lactic Acid Bacteria (LAB) are extensively employed in the production of various fermented foods, due to their safe status, ability to affect texture and flavor and finally due to the beneficial effect they have on shelf-life. More recently, LAB have also gained interest as production hosts for various useful compounds, particularly compounds with sensitive applications, such as food ingredients and therapeutics. As for all industrial microorganisms, it is important to have a good understanding of the physiology and metabolism of LAB in order to fully exploit their potential, and for this purpose, many systems biology approaches are available. Systems metabolic engineering, an approach that combines optimization of metabolic enzymes/pathways at the systems level, synthetic biology as well as in silico model simulation, has been used to build microbial cell factories for production of biofuels, food ingredients and biochemicals. When developing LAB for use in foods, genetic engineering is in general not an accepted approach. An alternative is to screen mutant libraries for candidates with desirable traits using high-throughput screening technologies or to use adaptive laboratory evolution to select for mutants with special properties. In both cases, by using omics data and data-driven technologies to scrutinize these, it is possible to find the underlying cause for the desired attributes of such mutants. This review aims to describe how systems biology tools can be used for obtaining both engineered as well as non-engineered LAB with novel and desired properties.

Keywords: food fermentation, metabolic engineering, strain development, control analysis, screening and selection

**Abbreviations:** γ-GluH, gamma-glutamyl hydrolase; ACK, acetate kinase; ADHE, alcohol dehydrogenase; ALD, α-acetolactate decarboxylase; ALE, adaptive laboratory evolution; ALS (ILVB), α-acetolactate synthase; ATPase, (F1F0)- ATPase; ButBA, diacetyl reductase and butanediol dehydrogenase; FBA, flux balance analysis; FolA, dihydrofolate reductase; FolB, dihydroneopterin aldolase; FolC, dihydrofolate synthase; FolE, GTP cyclohydrolase I; FolK, hydroxymethyldihydropterin pyrophosphokinase; FolP, dihydropteroate synthase; FolQ, dihydroneopterin triphosphate pyrophosphohydrolase; Glu, glutamate; GMO, genetically modified organism; LAB, lactic acid bacteria; LDH, lactate dehydrogenase; Nox, NADH oxidase; pABA, para-aminobenzoic acid; PabAB, chorismate synthetase component I and II; PabC, 4-amino-4-deoxychorismate lyase; PDHc, pyruvate dehydrogenase complex; PFL, pyruvate formate lyase; PTA, phosphotransacetylase; PTS, phosphotransferase system; RibA, GTP cyclohydrolase II/3,4-dihydroxy-2-butanone-4 phosphate synthase; RibB, riboflavin synthase (a subunit); RibG, riboflavin-specific deaminase/reductase; RibH, riboflavin synthase (b subunit); SNV, single nucleotide variation.

## INTRODUCTION

fmicb-10-00876 April 29, 2019 Time: 17:54 # 2

Lactic acid bacteria (LAB) are traditionally used as starters in fermented food production. LAB produce lactic acid, the presence of which reduces growth of pathogens and other undesirable microbes in the products, while contributing to an appealing acidic taste. LAB are also capable of producing various aromatic compounds, the formation of which are initiated by lipolysis and proteolysis, and these also contribute to the pleasant organoleptic characteristics of different fermented products such as yogurt, cheese, and butter (Song et al., 2017). LAB are thus highly industrially relevant microorganisms and represent a multi-billion dollars business worldwide (Johansen, 2017).

Several important phenotypical characteristics of LAB such as acidification rate (Martinussen et al., 2013), robustness to environmental stresses and phage resistance (Garneau and Moineau, 2011; Papadimitriou et al., 2016), contribution to flavor and texture formation (Broadbent et al., 2005; Chen J. et al., 2017), bio-protection activity (Siedler et al., 2019) and probiotic function (Ljungh and Wadström, 2006), are important for real-life applications and are continuously being investigated by both the academic and industrial community. Traditional non-GMO (genetically modified organism) strain modification approaches such as random mutagenesis and screening, selection using toxic analogs and adaptive evolution, have a long history of successful use for improving properties of LAB. The use of mutagenesis for improving LAB is only constrained by the genetic repertoire of the bacteria, but a large screening input is usually required to identify desired mutants. The latter two methods are specific and very selective, but their implementation is largely condition-dependent. For analog selection, the target pathway should be tightly associated with anabolism, and an effective metabolite analog should also be easily accessible. For adaptive evolution, the selection is dependent on fitness improvements under stressful conditions. Rational strain modification methods, such as metabolic engineering and synthetic biology, have demonstrated their effectiveness to endow LAB with new/improved characteristics useful for food applications beyond what traditional approaches are capable of delivering. However, since LAB are often present in the final product destined for human consumption, there are still hurdles that have to be overcome before recombinant DNA technologies can be widely implemented, e.g., regulatory issues and skepticism of consumers.

The understanding of LAB physiology has fundamentally changed since the emergence of high-throughput genome sequencing technologies. By using whole-genome sequencing to compare the genome sequences of model LAB strains and their derivatives generated using traditional strain improvement procedures, researchers have gained great insight into how phenotypes are affected by genetic variations. Modern sequencing technologies allow for entire genome landscapes of multiple bacteria to be generated simultaneously within less than 24 h. These genome sequence data can be used as references, when doing physiological characterization or when using various systems biology approaches to address different fundamental and practical questions.

Due to their Generally Recognized as Safe (GRAS) status, simple metabolism, and a very high glycolytic flux (Koebmann et al., 2002a), LAB recently have emerged as promising cell factories for production of high-value biochemicals, including food ingredients and pharmaceutical precursors. This has been aided by genome-scale metabolic models, where researchers have used in silico simulations to find the best way to reroute LAB metabolic networks to optimize production of various compounds. In this review we will discuss four main approaches used for improving LAB: (1) Adaptive Laboratory evolution (ALE) in combination with meta-omics analysis for characterization of mutants; (2) Systems biology tools for elucidating microbial interactions and metabolic capacities; (3) a detailed analysis on metabolic flux regulation for LAB model strain – Lactococcus lactis; (4) metabolic engineering of L. lactis as a novel microbial cell factory.

## LAB AND ADAPTIVE LABORATORY EVOLUTION

#### Bacterial Evolution Is a Consequence of Inherent Genetic Variations and Natural Selection

In contrast to the comparative genomics approach for studying the inter/intraspecies relationship of LAB, ALE studies focus on genomic adaptation, including single nucleotide variations (SNV), deletions and insertions, to specific environmental stresses or metabolic perturbations, and its correlation to phenotypical changes, where the latter are typically studied using a combination of omics analysis, e.g., transcriptomics, proteomics, metabolomics, and physiological characterization. In an ALE setup, the number of accumulated mutations is normally controlled by limiting the number of generations of propagation in a subculturing or chemostat system. In contrast to natural evolution, the type and dose of selection stress, which serve as an input variable, can easily be defined in a laboratory environment. This approach puts a limit on the complexity of the subsequent comparative systems biology study, by limiting the number of output variations. The eventual multi-omics analysis and data integration with systems biology modeling are expected to provide the foundation for a better understanding of the evolutionary driving force of LAB, and to guide the rational design of ALE conditions to improve the performance of industrial LAB (Bachmann et al., 2017).

Adaptive laboratory evolution has been extensively used in stress physiological studies of LAB. As LAB are industrially important workhorses, LAB often undergo a variety of environmental stresses e.g., heat, salt and acid stress in different manufacturing conditions. Heat is one of the common stresses that LAB need to handle well in industrial processes. For instance, the mesophilic L. lactis, the main constituent in starters used for making semi-hard cheeses, such as cheddar and Gouda, is exposed to high temperatures during the syneresis step where the moisture content in the cheese grains is reduced. Such suboptimal temperatures are not lethal to L. lactis, but

the physiological properties of L. lactis are significantly altered to manage the harsh conditions. LAB normally display a multi-level cellular response to environmental stresses, where multi-omics serves as desirable analysis tools for understanding the mechanism. To study the high-temperature physiology and its molecular basis, Chen et al. (2015a) applied ALE on the model L. lactis strain MG1363 with gradually increased incubation temperature. After an 800-generation session of experimental adaptation at high temperatures using a serial-transfer regime, a thermal-tolerant variant was isolated. The authors characterized the mutant with comprehensive systems biology analysis. On the metabolic level, the mutant had expanded its glycolysis capacity at high temperatures, where the overall glycolytic flux in the mutant was 13% higher compared to that of the wild-type strain at 38◦C. On the transcriptomic level, a variety of solute transport activities had been enhanced in the mutant including ones for carbohydrates, amino acids and ions. Alterations in cell-membrane fatty acid composition were also observed for the mutant, which had an increased amount of saturated fatty acids (C14:0 and C16:0) in the cell membrane of the thermally tolerant mutants. Increased amounts of saturated fatty acids help maintain the fluidity of cell membranes at high temperatures and thereby stabilizing the cross-membrane transport activities (Los and Murata, 2000). On the genomic level, the authors identified 13 mutations through whole genome resequencing, and subsequently used reverse engineering and physiological characterization to determine the contribution of individual mutations to the overall phenotype. It was found that SNVs in groESL (encoding chaperon proteins) and rpoC (encoding RNA polymerase subunit) contributed significantly to thermal tolerance of the mutant. The SNVs in groESL led to overexpression of molecular chaperones, and the SNVs in rpoC caused the alteration in global gene expression, and as a consequence of this, the saturated fatty acid synthesis pathway was overexpressed. The integrated multifaceted analysis indicated that a stable cell membrane structure in combination with other events, such as overexpression of molecular chaperones, is important for the cells to support a high-energy turnover rate under heat stress conditions. The study demonstrates how multi-omics analysis can help understand the stress physiology of LAB on the molecular level.

Besides the environmental stresses, LAB such as L. lactis often encounter nutrient starvation conditions during cheese ripening. In the early stage of ripening, acidification continues with limited lactose (3–5% lactose of milk depending on moisture) in the curd after whey drainage (Ardö and Nielsen, 2007). After the carbohydrate is completely exhausted, the cells stop growing or eventually lyse (Ganesan et al., 2007). The global catabolite regulator CcpA plays an important role in the regulation of the metabolic shift under these carbohydrate limitation/starvation conditions, but how LAB reshape the cellular metabolism has still not been elucidated. In contrast to traditional ALE with the serialpassage regime, the growth rate can be controlled by limiting specific nutrients in a chemostat ALE setup, which is used to mimic the slow growth and metabolic activity of LAB during cheese ripening (Van Mastrigt et al., 2018). Price et al. (2017) performed a glucose-limited ALE for L. lactis in a chemostat setup to study the response to carbohydrate starvation. After a few generations of adaption, fast-growing L. lactis mutants emerged and finally dominated the population. Whole genome resequencing of the isolates obtained from parallel evolution lineages indicated that the SNVs causing a site-specific change in the amino acid sequence in CcpA (Met-19) was responsible for the adaptation. To study the global regulatory role of the mutated CcpA, they performed a transcriptomics analysis, and found that glucose transport in the mutant had changed. The mutation in CcpA led to an overexpression of PTSMan but a downregulation of PTSCel. Both PTSMan and PTSCel participate in the glucose transport in L. lactis, but PTSMan has a higher substrate affinity for glucose compared to PTSCel (Castro et al., 2009). The CcpA-mediated transcriptional change caused a 3 fold increased rate in glucose uptake in the mutant, which was further corroborated by the flux analysis with C14-glucose. These findings suggest that there is an evolutionary advantage associated with altering global regulators when compared to altering expression of individual genes, at least in response to carbohydrate starvation (Goel et al., 2015).

### LAB and Genome-Scale Metabolic Modeling

Since the first LAB genome sequence (L. lactis subsp. lactis IL1403) was announced in 2001 (Bolotin et al., 2001), whole genome sequencing has significantly accelerated systems biology research on LAB. Based on the genome annotation, Genome-Scale Metabolic models (GEMs) with Constraint-Based Reconstruction and Analysis (COBRA) have been used to predict the nutrient requirements for growth, and metabolic patterns for LAB under different conditions (Wu et al., 2017). This information is paramount for both starter culture producers as well as for the food industries relying on LAB. The accuracy of metabolic models depends on correct genome annotations, and subsequent manual curation with biochemical, genetic and cell physiological data (Kitano, 2002a). Several studies have suggested that the NADH/NAD<sup>+</sup> ratio, which depends on the sugar uptake rate, has a role in the shift between the homolactic and mixed-acid fermentation in LAB (Garrigues et al., 1997), and that LAB with a low glycolytic flux need more energy to produce biomass. During mixed-acid fermentation, one extra ATP is generated when acetate is formed from acetyl-CoA. The experimental observation that biomass yield is maximized under starvation conditions has also been predicted using GEMs (Oliveira et al., 2005). In another case, GEMs and COBRA were used to predict the outcome of an ALE, where Lactobacillus plantarum was grown on the non-conventional substrate glycerol. The question addressed was how the metabolic network could be reshaped to support fast growth. L. plantarum grows slowly on glycerol, but only under aerobic conditions. Teusink et al. (2009) adapted L. plantarum on glycerol using a complex medium. After approximately 800 generations, fastgrowing mutants could be isolated. Metabolic flux analysis showed that lactate rather than acetate was the major product from glycerol dissimilation. It was assumed that a mixed-acid fermentation mode was needed to support a high-energy yield on slowly fermentable sugars. In this case, the specific growth

rate of the mutant (0.26 h−<sup>1</sup> ) on glycerol was still substantially lower than the growth rate on other fast-fermentable sugars. To explain the observations, the authors applied flux balance analysis (FBA) using the uptake rates of glycerol, citrate (already present in the medium) and oxygen as constraints. As the flux model was set to maximize the biomass yield under aerobic conditions, the oxygen consumption played an important role in FBA. Under aerobic conditions, the model predicted that more ATP could be generated when lactate was formed, and that lactate production would support faster growth. The good correlation between the model prediction and experimental outcome demonstrates that COBRA is a valuable tool for predicting how the metabolic network of LAB adjusts over the course of an evolution experiment.

#### ALE and Genomics Analysis

Adaptive laboratory evolution in combination with genomic analysis can also help disclose fitness-associated gene functions in LAB. LAB such as Lactobacillus and Lactococcus species, are facultative anaerobes. Albeit the lack of respiration, the presence of oxygen usually does not affect normal growth (Higuchi et al., 2000). The toxicity of oxygen toward LAB is believed to be due to generation of reactive oxygen species (ROSs) during oxygen metabolism (Chen et al., 2013). Due to their catalase negative nature, LAB are more vulnerable to the attack from H2O2-derived ROSs. Oxygen not only affects the cell fitness, but also tends to alter the acidification profile of LAB cultures (Larsen et al., 2015). Therefore, the study of oxygen related gene functions in LAB has high industrial relevance. The thioredoxin-thioredoxin reductase is the major system for maintaining the cellular redox homeostasis of LAB when oxidative stress is imposed (Vido et al., 2005; Serrano et al., 2007; Serata et al., 2012). Two thioredoxin reductase genes are annotated on the chromosome of L. lactis namely trxB1 and trxB2. The importance of trxB1 for regulating redox homeostasis has been demonstrated by single gene knockout experiments in L. lactis MG1363. The loss of functional TrxB1, however, is not lethal for L. lactis in the presence of oxygen (Vido et al., 2005). Chen et al. (2015b) introduced a deletion in trxB2, and noticed that aerobic growth of the mutant was substantially retarded. TrxB2 shares a high amino acid sequence similarity with TrxB1, but it lacks two conserved cysteine residues, which are necessary to function as a thioredoxin reductase. To elucidate the biological function of TrxB2 under aerobic conditions, the authors designed an ALE setup, where the mutant was subcultured under aerobic conditions for a prolonged period of time. Genome resequencing of adapted mutants revealed the occurrence of suppressor mutations in the ribonucleotide reduction pathways in mutants isolated from independent lineages. The research suggested an important role of TrxB2 as a flavodoxin reductase in the aerobic ribonucleotide reduction, which was supported by subsequent physiological characterization (Chen et al., 2015b). Often, the physiology studies of LAB under oxidative stress conditions focus on ROS, as ROS has detrimental effects on cell fitness (Guchte et al., 2002). However, oxygen-dependent anabolism should not be underestimated in terms of its importance for aerobic growth of LAB.

### ALE and Multiomics Analysis

Adaptive laboratory evolution and multiomics analysis have been demonstrated to be useful tools for addressing fundamental questions in the natural evolution of LAB. The comparative genomics study of different L. lactis isolates suggests that the dairy-associated L. lactis diverged from the plant isolates through a long history of natural evolution (Siezen et al., 2008). The phenotypic properties are quite different between the non-dairy (plant, meat and bovine rumen) isolates and dairy-associated isolates in terms of nutrient requirements and stress tolerance in different environments. Some unique traits e.g., excellent flavor formation capacity from amino acid catabolism of the plantassociated L. lactis strains has also been used for enhancing the flavor formation in dairy fermentation (Tanous et al., 2005). However, the important question of gene gain and loss during the natural evolution of the plant isolates adapting the dairy niche has not been fully addressed. Bachmann et al. (2012) adapted the plant-derived L. lactis strain KF147 in milk for nearly 1,000 generations. Isolates obtained from three independent evolutionary lineages had adapted to growing well in milk. The integrated comparative genomics and transcriptomics analysis revealed that the adapted strains had variations that resulted in improved utilization of milk proteins and loss or downregulation of pathways important for using plant materials. Such gene gain and loss are a typical feature found in the natural evolution of LAB accommodated by a dairy niche (Makarova and Koonin, 2007; Hanemaaijer et al., 2015).

### FROM SINGLE-STRAIN TO COMMUNITY-BASED SYSTEMS BIOLOGY

#### Metagenomics

It is a common practice that mixed starter cultures (the combination of different strains) are used for food fermentations, e.g., for production of cheese and yogurt. One consideration is that the process with a single strain is more vulnerable to bacteriophage predation, where strain-specific phages can result in fermentation failure. This problem is alleviated when mixed strains are used, as strains unaffected by these phages will ensure that the fermentation does not fail (Smid et al., 2014).

The use of mixed strains also confers the culture a broader metabolic capacity, which is important to achieve a desirable flavor and organoleptic property of the final product. In particular for cheeses made using mesophilic starters cultures this is relevant. The mesophilic starter cultures are typically composed of L. lactis subsp. cremoris, L. lactis subsp. lactis and its citratepositive variant (L. lactis subsp. biovar diacetylactis). During milk fermentation, both L. lactis subsp. cremoris and L. lactis subsp. lactis contribute to acidification. During the subsequent cheese ripening stage, the roles of different strains for flavor formation become distinct (Ardö and Nielsen, 2007).

Mixed strains can be rationally designed by blending two or more well-characterized industrial strains on the condition that their growth dependency and product profiles meet the quality requirement such as the mesophilic/thermophilic cultures used in cheese/yogurt manufacturing (Sieuwerts, 2016). On the other hand, mesophilic undefined starter cultures are commonly used for the manufacture of the European continental semi-hard cheeses such as Cheddar, Gouda, and Danbo. Such cultures derive from back-slopping cultures, which were often collected from the batches yielding good quality cheese from artisanal cheese operations, and saved at –80◦C to minimize changes in LAB composition during storage (Smid et al., 2014). Such undefined starters have a long history of successful use for manufacturing cheeses with a rich flavor, and are highly phage resistant, most likely due to the coevolution of the strains and exchange of phage resistance mechanisms.

To study the undefined starter culture, it is necessary first to characterize starter composition, i.e., determine the strains it contains and the corresponding amounts, before characterizing single-strain physiology and genetic background (Temmerman et al., 2004). Culture dependent approaches are commonly used, however, are tedious and the empirical choice of medium for enumeration does not ensure the recovery of the entire community of strains (Erkus et al., 2013; Frantzen et al., 2016). Culture-independent approaches for characterization scarcely provide detailed information about community dynamics and interaction in practice. It is desirable to transfer systems biology strategies used for characterizing single strains to mixed strains, but an understanding of community-level genomics is necessary before systems biology approaches can be applied to study microbial consortia.

Metagenomics analysis provides insights into the species composition and its dynamics in a culture-independent manner. The holistic decoding of the genomic material in an undefined starter community via metagenomics studies definitely benefits the research of starter LAB e.g., high-resolution surveillance of the microbial community dynamics or construction of community-based GEMs. Such knowledge will facilitate the general understanding of growth, metabolism and physiology during mixed culture preparation, cheese manufacturing and ripening (Jonnala et al., 2018). Use of 16S amplicon sequencing and shotgun metagenomics are typical in metagenomics studies of microbial consortia. Sequencing amplicons of conserved regions in 16S rRNA using Next-Generation Sequencing (NGS) technologies provide information about species-level community composition on how composition changes in response to different abiotic factors (De Filippis et al., 2017). There are two limitations, which should be considered when using 16S amplicon sequencing to determine composition. First, the variation in 16S rRNA allele numbers in LAB prevents a precise prediction of abundance, if the copy number is not known for individual strains. Second, 16S amplicon sequencing only enables the species-level differentiation. Especially for undefined mesophilic starter cultures, the resolution of 16S amplicon sequencing is low, as it is mainly composed of the L. lactis species with a low 16S rRNA sequence variation. For example, Porcellato and Skeie (2016) accessed the impact of elevated cooking temperature on cheese microbiota composition. The authors used 16S amplicon sequencing and found that a high temperature led to a reduced number of live L. lactis cells during ripening. L. lactis subsp. lactis is generally more stress tolerant compared to L. lactis subsp. cremoris, but the 16S amplicon sequencing could not deliver the subspecies information of the inactivated L. lactis in this particular study. Furthermore, 16S amplicon sequencing only provides information about rRNA sequences, but hardly provides deep insight into the metagenome.

With improved quality and decreased price of secondgeneration shotgun sequencing, it is now possible to do deepsequencing metagenomics and determine the entire genomic content (core/pan-genome) of a microbial ecosystem, including the low-abundant strains. Using this approach, genome decay (plasmid-loss) and the species-level dynamic shift of the community due to the biotic/abiotic factors can be revealed while comparing the reads to marker databases for abundance calculation (De Filippis et al., 2017). Nevertheless, there are only few published studies where metagenomics analysis has been used to study starter culture composition at the strainlevel. The high coverage sequencing eases the identification of metabolic genes in the strain community, but the challenges remains as to which strains the genes belong. One technical shortcoming of second-generation sequencing platforms is that only short reads (<800 bp) are generated, which prevents the strain-level differentiation with the de novo-based assembly based metagenomics analysis. Especially in the mesophilic-undefined starters, the dominating L. lactis species exhibits a high intrasubspecies genome similarity (**Figure 1**). Therefore, the typical contig binning steps by GC content, coverage or metabolic networks are difficult to apply on the strain-level assembly (Albertsen et al., 2013; Biggs and Papin, 2015).

Recently, several new data analysis tools based on referencemapping approaches have been developed and validated which can be used for strain-level metagenomics data analysis (Scholz et al., 2016; Albanese and Donati, 2017; Truong et al., 2017; Zolfo et al., 2017). The emergence of these new tools has attracted the attention from the LAB scientific community as it now is possible to achieve a more profound understanding of the starter culture community (structure and function) and its contribution to the quality of fermented products (Ercolini, 2017). The development of the third-generation long-read sequencing technology is booming. Researchers have demonstrated single-read sequencing up to 0.9 Mbp length using the MinION system (Jain et al., 2018). Considering the comparable small genome size of LAB (1.2– 4.9 Mbp) (Douillard and de Vos, 2014), in the future, with further improvements of throughput and accuracy, the third-generation genome sequencing platforms with long-read sequencing ability will definitely facilitate de novo assembly of single-strain genomes in undefined starter cultures, and eventually facilitate systemlevel understanding of the community (Turaev and Rattei, 2016). Another promising approach for looking at single-cell genomics in microbial communities is to use single-cell sequencing technologies. The concept is to first sort and compartmentalize single cells in water-in-oil droplets, where the DNA is isolated and barcoded in an isolated environment prior to whole genome amplification of single chromosomes followed by sequencing

(Gawad et al., 2016). The main challenge for this technology is the resolution. If the sequencing accuracy can be improved in the future, it can largely facilitate the strain-level genomics study of microbial consortia.

#### Postgenomic Studies of LAB

Genomics data are the basis of systems biology (Kitano, 2002b). Postgenomic analysis tools such as transcriptomics, proteomics and metabolomics have also been widely used to understand the physiology of LAB at a systems level. A large number of these kinds of analyses have been applied to single strains and they provide great insights into how cells respond to different abiotic/biotic perturbations. To thoroughly understand industrially relevant properties of LAB, useful omics data should be generated from the LAB consortia in action, i.e., in dairy fermentations that are similar to those taking place in dairy plants, and there are only few such studies. The yogurt culture is a well-described mutualistic system, in which defined Streptococcus thermophilus (S. thermophilus) and Lactobacillus delbrueckii subsp. bulgaricus (L. bulgaricus) strains are mixed for co-culture fermentation of bovine milk. In the mutualistic life of the yogurt culture, the growth profiles of S. thermophilus and L. bulgaricus show a clear protocooperation relationship (Sieuwerts, 2016). As the LAB blended in the yogurt starter culture are normally welldocumented, it serves as a simplified model system for the implementation of system biology study for microbial community. The balanced growth of these two bacteria is important for fast acidification and desirable texture and flavor formation of yogurt (Chen C. et al., 2017; Yamauchi et al., 2019). In yogurt fermentations, both the growth and acidification are stimulated when cocultures of the two

LAB are used. It appears that the protocooperation occurs in the manner of nutrient exchange, where production of formic acid and carbon dioxide from S. thermopilus stimulates the growth of L. bulgaricus, and the proteolytic activity of L. bulgaricus provides essential peptides and amino acids to S. thermophilus. The integrated transcriptomics and proteomics analysis of S. thermophilus in coculture with L. bulgaricus confirmed the mutualistic effects, where upregulation of anabolic pathways in both S. thermophilus and L. bulgaricus compared to monoculture cultivation was observed (Herve-Jimenez et al., 2009; Sieuwerts et al., 2010). Meta-transcriptomics and meta-proteomics analyses revealed a dynamic response to both biotic and abiotic factors at different stages of the mutualistic growth, results that largely confirm the previously observed phenotypic behavior between the cooperation of these two bacteria. Interestingly, ion homeostasis was affected in S. thermophilus in the coculture with L. bulgaricus (Herve-Jimenez et al., 2009; Sieuwerts et al., 2010). Reduced ion transport and increased ion-chelating activity in S. thermophilus was shown to be correlated with H2O<sup>2</sup> production by L. bulgaricus. The major oxidative stress reponse system, however, was not induced in S. thermophilus. One reason for this was that L. bulgaricus produces minor amounts of H2O<sup>2</sup> during growth in milk. The adaptation of the catalasenegative S. thermophilus to H2O<sup>2</sup> by control of the intracellular ion homeostasis gives an indirect way to circumvent the oxidative stress.

The cheese environment is a more complex ecosystem than yogurt and also the cultures used are more complex. Typically, different kinds of LAB e.g., lactococci and streptococci are involved in acidification and cheese ripening. Cheese ripening is one of the most important processes in cheese manufacturing in terms of flavor formation. However, ripening is a slow process. Not only starter LAB but also other non-starter LAB (NSLAB) usually have an important contribution to flavor formation during cheese ripening. To accelerate cheese maturation, one option is to elevate the ripening temperature to increase the rate of the biochemical transformations taking place. Meta-transcriptomics (RNA-seq) data shows that expression of genes involved in proteolysis, lipolysis and amino acids/fatty acids catabolism are promoted in NSLAB at higher ripening temperatures, and in turn, an accelerated maturation has been noticed (De Filippis et al., 2016).

Besides meta-transcriptomics, meta-proteomics and metabolomics are also important tools for studying LAB. Currently only very few works have used these two technologies to study LAB communities. Reasons could be due to the challenges in sample preparation, proteins and metabolites identification compared to sequence-based metagenomics and meta-transcriptomics data in the community study (Blackburn and Martens, 2016; Smirnov et al., 2016).

The aforementioned ALE and metabolomics tools have been harnessed to obtain LAB that exhibit improved properties. We have also discussed the application of these systems biology tools to elucidate interactions taking place in LAB communities and to determine metabolic capacities. In the following part, we will focus on the model strain of LAB – L. lactis to elaborate its metabolic flux regulation in more detail and illustrate its metabolic engineering potential for valuable biochemicals production as a novel microbial cell factory.

### METABOLIC FLUX REGULATION OF L. lactis WITH A FOCUS ON GLYCOLYSIS

The glycolysis of L. lactis comprises the typical EMP pathway with different carbohydrates entering at different points. Sugars enter cells of L. lactis by either the PEP-dependent PTSs or sugarspecific permeases. For glucose uptake, there are two distinct PTSs, PTSMan and PTSCel, and a proton-motive force dependent permease and the kinetic properties of these transport systems in MG1363 have been characterized (Castro et al., 2009). Glycolytic enzymes are in general highly expressed to sustain the large glycolytic flux in L. lactis, accounting for about 20% of the total soluble protein (Puri, 2014). Their genes are in general located close to the origin of replication for higher expression (Cocaign-Bousquet et al., 2002). Transcriptional regulation of the glycolytic genes in L. lactis primarily is through the carbon catabolite repression (CCR), which has been verified experimentally in L. lactis (Deutscher et al., 2006). The HPr protein, at high levels of fructose 1,6-bisphosphate (FBP) and ATP, is phosphorylated and then forms a complex with CcpA acting as a global regulator binding to the cre sites on chromosome. The wellconserved consensus in Gram-positive bacteria for the cre site is TGNNANCGNTNNCA, which is roughly palindromic with the central CG base always present. A study on the CcpA regulon in MG1363 proposed the consensus WGWAARCGYTWWMA, which is specific for L. lactis (Zomer et al., 2007). The binding of CcpA to cre sites can lead to either activation or repression, depending on the position relative to promoters also seen for Bacillus (Henkin, 1996). When a cre site is upstream of, inside, or downstream of the promoter of a gene, transcription is activated, repressed, or aborted accordingly (Cocaign-Bousquet et al., 2002). Results also have indicated that the interaction between CcpA and the transcription machinery may be dependent on the helix side of CcpA binding, because the strongest repression was observed for cre sites that were consecutively separated by around 10.5 bp, equal to a full helical turn of double-stranded DNA (Zomer et al., 2007). The expression of a number of glycolytic enzymes was found to be up-regulated by CcpA, including phosphofructokinase (PFK), pyruvate kinase (PYK) and lactate dehydrogenase (LDH) (the members of the las operon) (Luesink et al., 1998; Kok et al., 2017), phosphoglucose isomerase (PGI), glyceraldehyde 3-phosphate dehydrogenase (GAPDH), enolase (Guédon et al., 2002).

In the intracellular environment, assuming constant environmental factors, such as pH, temperature, viscosity etc., the reaction rate of an enzyme depends on concentrations of the enzyme, substrates, products and other effectors, such as allosteric regulators, that can modulate the enzymatic activity. Reaction rates would in turn change metabolite concentrations, forming a dynamical system. The metabolic flux of a pathway is the overall conversion rate of metabolites by the pathway resulting from the dynamical interactions between involved

enzymes and metabolites. The factors determining the flux can be divided into two levels. The first is the enzyme level, which accounts for changes in fluxes caused by changes in gene expression level. The second is the metabolic level referring to changes in fluxes which are not caused by altered gene expression but by changes in metabolite concentrations and the inherent kinetic properties of enzymes such as maximum velocity and substrate affinity. The statement that an enzyme has "control" on a flux should refer to the phenomenon that a change in the enzyme level leads to change in the flux but not the direct regulatory mechanism. "Regulation" should refer to the exact mechanism causing the change (Chan, 2014).

### Control of the Glycolytic Flux by Individual Enzyme Levels

Control of fluxes by individual enzymes can be quantified by Flux Control Coefficients (FCCs) in the theory of Metabolic Control Analysis (MCA), which is defined by the rate of fractional change of the steady-state flux with respect to the fractional change of the enzyme activity. Finding the "rate-limiting" step or an enzyme with high FCC of the glycolytic flux in L. lactis can have direct industrial relevance, e.g., speeding up the production properly and increasing the productivity. An earlier study trying to inhibit the activity of GAPDH by the specific inhibitor iodoacetate indicated that GAPDH had a high FCC of about 0.9 on glycolytic flux in non-growing cells of L. lactis subsp. cremoris Wg2 (Poolman et al., 1987). Similar results were obtained in another strain NCDO2118 with GAPDH having a FCC equal to 0.7 (Even et al., 1999). Later, the control of glycolytic flux by glycolytic enzymes has been extensively studied by the Jensen group by experimental estimation of FCCs in the laboratory strains L. lactis IL1403 and MG1363. The results are summarized in **Table 1**.

Interestingly, usually each individual enzyme appears to have no control on growth rate and glycolytic flux at the wildtype enzyme level, including LDH (Andersen et al., 2001a), GAPDH (Solem et al., 2003), PFK, PYK (Koebmann et al., 2005) and triosephosphate isomerase (TPI) (Solem et al., 2008) for MG1363; TPI, enolase (Koebmann et al., 2006) and phosphoglycerate mutase (PGM) (Solem et al., 2010) for IL1403. Among these enzymes, some are present in the wild type in significant excess for attaining maximum glycolytic flux, such as LDH, TPI and GAPDH whereas some enzymes appear to be optimally expressed in the wild type for maximum glycolytic flux, such as PFK, PYK, PGM and enolase (ENO). For the latter set of enzymes, when the expression level is increased or decreased slightly, the growth rate and glycolytic flux decrease. This property of maximum growth rate and glycolytic flux in the wild type also leads to a zero FCC at the wild-type level. **Figure 2** illustrates the two different scenarios leading to zero FCCs in the wild type encountered in the experimental studies of glycolytic flux control by glycolytic enzymes. In the literature, nonetheless, the possible consequences and interpretation of these observations have not been discussed thoroughly.

The reason for the zero flux control for these important enzymes also remains elusive. One possible explanation is that glycolysis is already running at its maximum possible rate or the control is distributed over many enzymes (Koebmann et al., 2002a). Another conjecture is that glycolysis is so optimized throughout evolution that the true FCCs cannot be measured due to optimal regulation of protein expressions which somehow counteracts the effect of modulating an enzyme by reallocating the protein expression profile (Teusink et al., 2011). If this is true, the calculated rate is not the defined partial derivative because the concentrations of other enzymes are also functions of the concentration of the perturbed enzyme. The conflicting results on the role of GAPDH from different studies also highlight the difficulty of studying flux control. In the earlier study, nearly full control of glycolytic flux by GAPDH in L. lactis Wg2 and NCDO2118 was found (Poolman et al., 1987; Even et al., 1999) but zero control was found in L. lactis MG1363 (Solem et al., 2003). One possible explanation is the intrinsic difference between the two strains as the GAPDH level was found to be two-fold higher in MG1363 compared to Wg2. Another possibility is the difference in experimental methods. GAPDH's activity in Wg2 was only inhibited but not increased whereas both under- and over-expression of GAPDH were included in the study of MG1363. This can lead to contradictory estimation of FCCs.

### Hierarchical Regulation Under Different Growth Conditions

Another approach used to distinguish between hierarchical regulation and metabolic regulation was proposed by Kuile and Westerhoff (2001) by estimating coefficients for the two types of regulation. This approach is to a certain extent working in a reverse sense to the aforementioned approach which estimates the FCC of an enzyme by growing strains with different activities of the enzyme under the same condition and then measuring the changes in activities and fluxes. In contrast, the same strain is cultured under different growth conditions, for example, in chemostat at different dilution rates, or different starvation conditions (Rossell et al., 2006). Then the relative change in enzyme level, measured by enzyme assay, is divided by the relative change in fluxes to define the "hierarchical regulation coefficient" (HRC). It is equal to one in the ideal case of pure hierarchical regulation. The "metabolic regulation coefficient" (MRC) can then be computed by 1 – HRC to account for the change in fluxes not accountable by hierarchical regulation.

In L. lactis, several studies have been conducted using this approach. For instance, MG1363 has been grown in chemostat at a dilution rate of 0.1 h−<sup>1</sup> at different pH, from 4.7 to 6.6 (Even et al., 2003). It was found that when taking the inhibitory effect of pH on enzyme activities into account, metabolic regulation was the dominant force controlling glycolytic flux. Also among the hierarchical regulation, post-transcriptional regulation of gene expression was found to be more prominent than transcriptional regulation by comparing change in mRNA transcript level with change in enzyme activity. MG1363 was grown in chemostat at different dilution rates from 0.15 to 0.6 h−<sup>1</sup> and transcriptomes, proteomes, enzyme activities were


TABLE 1 | Summary of experimentally determined FCCs in Lactococcus lactis strains.

∗ 'Definite expression level for optimality' refers to a unique maximum of growth rate and glycolytic flux at the wild-type enzyme level. LDH, lactate dehydrogenase; TPI, triosephosphate isomerase; GAPDH, glyceraldehyde 3-phosphate dehydrogenase; PFK, phosphofructokinase; PYK, pyruvate kinase; PGM, phosphoglycerate mutase; ENO, enolase.

quantified simultaneously (Puri, 2014). Similar conclusions were reached. For dilution rates between 0.15 and 0.5 h−<sup>1</sup> , the changes in flux through most enzymes were predominantly caused by metabolic regulation instead of hierarchical regulation except for alcohol dehydrogenase (ADHE) and possibly pyruvate formate lyase (PFL) whose concentrations decreased as the dilution rate increased and the flux through mixed-acid fermentation pathway decreased. So these two enzymes probably controlled the switch between fermentation modes but not the glycolytic flux. Significant hierarchical regulation only occurred during transition from 0.5 to 0.6 h−<sup>1</sup> in which the expression of several enzymes were found to have changed, probably due to the effect of CCR by the regulatory protein CcpA. Indeed, similar results of the lack of significant change in expression of glycolytic enzymes have also been observed in an accelerostat study on IL1403 in which the dilution rate increased very slowly from 0.1 to 0.6 h−<sup>1</sup> to obtain different steady states (Lahtvee et al., 2011). Only PGM was found to show changes in expression.

#### Metabolic Regulation

Metabolic regulation is not easy to discover because it usually involves interactions between an enzyme and metabolites other than substrates and products of that enzyme. Extensive in vitro enzyme characterization is required to identify possible effector metabolites and experiments for confirmation of in vivo regulatory roles can even be more difficult to design. As mentioned above some studies indicated that metabolic regulation was the main driving force for flux regulation and meanwhile many pieces of knowledge on particular regulatory relationships are available, nonetheless, a clear and integrative picture of

how different types of metabolic regulation work together to explain most of the known experimental results still remains elusive.

#### Negative Feedback on PTS by FBP and Inorganic Phosphate

One example of the metabolic regulation of glycolysis is the regulation of the phosphorylation of HPr protein (HPr/HPr-Ser-P) by FBP, ATP and inorganic phosphate (Pi) mentioned previously. Since HPr helps sugar uptake through PTS but HPr-Ser-P does not, a high level of FBP due to a high rate of sugar uptake causes more HPr to be phosphorylated into HPr-Ser-P, which eventually slows down the sugar uptake thus forming a negative feedback loop. This loop may help to stabilize the glycolytic flux, especially against sudden changes in sugar availability (Teusink et al., 2011). The question of whether this negative feedback loop poses a bottleneck on maximum glycolytic flux, nevertheless, remains unanswered.

#### Feed-Forward on PYK by FBP

Besides the role in PTS, FBP has also been known to be an activator for PYK (Thompson, 1987). A kinetic study of glycolytic intermediates in glucose-pulse experiments using NMR found that the FBP level rose to a peak during glucose uptake and started to drop after glucose was exhausted and until a certain low FBP level, PEP started to accumulate and remained at a high level during glucose starvation (Voit et al., 2006). The authors proposed that the low FBP level reflecting low supply of glucose could serve as a way to preserve high PEP pool by inhibiting PYK which consumes PEP during sugar starvation for future rapid sugar uptake through PTS. Others, nonetheless, observed that such an activation relationship was also preserved in other organisms including those without PTS and remained conservative about the role of this FBP-PYK relation in glycolysis (Teusink et al., 2011). They suggested another possible role in which a high FBP level could serve as a signal for PYK to remove the phosphoglycerate compounds in favor of a high flux through GAPDH which operated close to thermodynamic equilibrium and was thus sensitive to mass action.

#### Global Cofactors: NADH/NAD<sup>+</sup> Ratio

Another interesting example is how the glycolytic flux responds to cofactor levels, e.g., NADH/NAD<sup>+</sup> and ATP/ADP. NADH/NAD<sup>+</sup> ratio was first proposed by Garrigues et al. (2001) to be an important factor for regulating the glycolytic flux in L. lactis NCDO2118 based on findings in a study where the strain was exponentially growing on three sugars, glucose, galactose and lactose, with decreasing glycolytic fluxes. The following observations were made: (i) the in vitro activity of GAPDH was almost completely inhibited by a NADH/NAD<sup>+</sup> ratio higher than 0.05; (ii) the NADH/NAD<sup>+</sup> ratio positively correlated with the glycolytic flux and was as high as 0.08 on glucose (severe inhibition of GAPDH expected); (iii) high pools of metabolites upstream of GAPDH were found including FBP, GAP and dihydroxyacetone phosphate (DHAP) (suggesting insufficient GAPDH activity to metabolize GAP). A later study by the same group on MG1363 found the same correlation between NADH/NAD<sup>+</sup> ratio and glycolytic flux, but the factors determining the ratio remained unknown (Garrigues et al., 2001).

#### Global Cofactors: ATP/ADP Ratio

Glycolytic kinetics in non-growing cells of L. lactis has been studied using in vivo NMR by Neves et al. (1999). The kinetic model built in the study fitted with experimental data predicted that conversion of PEP into pyruvate by PYK is inhibited by a high ATP surplus, i.e., high ATP/ADP ratio, which in turn inhibits NAD<sup>+</sup> regeneration by LDH and in this way, restricts the glycolytic flux. This somehow provided an explanation for the positive correlation between NADH/NAD+ ratio and glycolytic flux observed in other studies (Garrigues et al., 1997). Later NMR study by the same group focusing on the role of NADH and NAD<sup>+</sup> found that GAPDH was able to sustain a flux as high as in the wild-type MG1363 in a LDH-knockout strain in which the NADH concentration was 1.5 mM while the inhibitory constant of NADH for GAPDH was found to be 0.4 mM (Neves et al., 2002). The authors, to a certain extent, dismissed the control by GAPDH and NADH, and proposed ATP, ADP and Pi as important regulatory metabolites in glycolysis. When interpreting these results, however, one should bear in mind that that in vivo NMR studies often deal with non-growing L. lactis, so the results obtained can possibly be different from ones derived using exponentially growing cells.

To test the in vivo role of ATP/ADP ratio in growing L. lactis, Koebmann et al. (2002b) decreased the intracellular ATP/ADP ratios in MG1363 by expressing ATPase using a synthetic promoter library. Surprisingly, for these strains growing exponentially on glucose, the glycolytic flux showed no significant change over a large range of ATP/ADP ratio from around 9 in the wild type to around 5 in the strain with the highest expression of ATPase. When these strains were in a nongrowing state (achieved by resuspending cells in media without amino acids and vitamins), however, the glycolytic flux increased with ATPase activities until a level close to that observed for growing wild type cells. These observations suggest the possibility that glycolysis already operates at its maximal rate in the wild type. Another possible situation suggested by the authors is that although lowering the ATP/ADP ratio might stimulate glycolysis (e.g., by increasing the activities of kinases in the pay-off phase), it might eventually reduce the activity of PFK, which becomes a bottleneck countering the effect, known as the risk of a "turbo design" in which part of the desirable products are invested in the first place as input (Teusink et al., 1998).

#### Other Factors

Other sources of metabolic regulation considered to be important for regulating glycolytic flux include the inhibition of PYK by Pi (Ramos et al., 2004), inhibition of PFK by PEP (Papagianni and Avramidis, 2012) and inhibition of GAPDH by NADH (Garrigues et al., 1997, 2001). Hoefnagel et al. (2002) integrated these three inhibitive relations and the activation of PYK by FBP in a single kinetic model with all rate equations and parameters adopted from literature without fitting. The model succeeded in

simulating the observed kinetic behavior during glucose run-out experiments including the rapid increase in PEP and Pi, decrease in ATP and slow depletion of FBP. The authors reasoned the following sequence of kinetic responses upon glucose depletion: (1) Less Pi is retained in G6P, F6P and FBP; (2) Pi increases and inhibits PK; (3) PEP increases due to inhibited PK and no glucose for PTS and thus less pyruvate available; (4) Less substrate for LDH and thus NADH accumulates; (5) GAPDH is inhibited by NADH.

### REDIRECTION OF THE GLYCOLYTIC FLUX TO DIFFERENT BIOCHEMICALS

Due to its high glycolytic flux and its safe background, L. lactis has been engineered into a cell factory for production of a broad range of interesting biochemicals, from biofuels and food ingredients to vitamins and pharmaceutical precursors (Liu, 2017). Most of these biochemicals are derived from glycolytic precursors. Early efforts on metabolic engineering of L. lactis were primarily directed at the manipulation of the pyruvate node. The native metabolic pathways can drive the pyruvate flux to lactate via LDH, or to formate and acetyl-CoA by PFL. Then acetyl-CoA can either be converted into ethanol via the bi-functional ADHE or to acetate via phosphotransacetylase (PTA) and acetate kinase (ACK), subsequently (**Figure 3**).

### Production of Lactate and Biofuels Lactate

The natural main product of L. lactis is L-lactate. L. lactis ATCC 19435 has been reported to be able to produce around 100 g/L lactate when using a medium containing whole wheat flour hydrolysate, with a high productivity of 3.0 g/L/h (Hofvendahl and Hahn-Hägerdal, 1997). One of the limiting factors that have prevented the widespread use of L. lactis as a cell factory for lactate has been its fastidious nature and a relatively low pH tolerance when compared with various Lactobacillus species or yeast (van Hylckama Vlieg et al., 2006; Mazzoli et al., 2014).

Cheaper substrates, e.g., lignocellulose hydrolysates, could serve as a potential feedstock for some L. lactisstrains that are able to metabolize both the pentoses and hexoses contained, and one research group has investigated the non-dairy strain L. lactis IO-1, which is capable of metabolizing xylose (Tanaka et al., 2002). It was demonstrated that xylose could be metabolized via both the phophoketolase pathway and the pentose phosphate pathway into lactate. The flux distribution between the two different pathways was affected by xylose concentration, and a more homolactic profile was attained at high xylose concentrations. Sirisansaneeyakul et al. (2007) used immobilized L. lactis IO-1 for producing lactate from glucose and achieved an extremely high volumetric productivity of 4.5 g/L/h using a packed bed of encapsulated cells where the broth was recycled. Another interesting work demonstrated that by expressing the Escherichia coli chaperone DnaK in L. lactis NZ9000, it was possible to improve multiple-stress tolerance (high temperature, salts, lactate and alcohol) and to increase lactate production (Abdullah-Al-Mahin et al., 2010).

#### Ethanol

The demand for liquid fuels is growing and microbial-based production of biofuels could be an attractive way to increase the supply and decrease the cost. There have been attempts to use L. lactis for producing ethanol, but most of them only had limited success. Possible reasons for this could be a too low expression level of heterologous pyruvate decarboxylase (PDC) and alcohol dehydrogenase (AdhB) from Zymomonas mobilis. Liu et al. (2005) introduced PDC from Zymobacter palmae in L. lactis and only observed the accumulation of acetaldehyde. Similarly, Nichols et al. (2003) expressed PDC and AdhB from Z. mobilis in L. lactis and still failed to drive the flux from lactate fermentation. However, our group several years ago successfully constructed a recombinant L. lactis, which was able to produce ethanol as the sole fermentation product (87% of the carbon flux was directed to ethanol production) (Solem et al., 2013). The homo-ethanol producer incorporated the knock-out of several genes encoding the three LDH homologs (ldh, ldhB, and ldhX), PTA and native ADHE with the introduction of codon-optimized PDC/AdhB from Z. mobilis under the synthetic promoter library. Further, this strain was modified by introducing a functional lactosemetabolism. By developing a cheap growth medium containing whey waste and processed corn steep liquor (Liu et al., 2016b), efficient and cheap production of ethanol was accomplished with a titer of 41 g/L ethanol. On the other hand, we also explored the possibility to use the non-dairy L. lactis strain KF147 to produce ethanol as the only dominant product from xylose, although the final titer was low. This work demonstrated the great potential of using the more metabolically diverse nondairy L. lactis strains for bio-production from xylose containing feedstocks (Petersen et al., 2017).

#### Butanol Isomers

Butanol has a higher energy density and lower hygroscopicity than ethanol (Peralta-Yahya et al., 2012; Generoso et al., 2015). There have been attempts at engineering L. lactis into producing butanol isomers, as they are excellent fuel additives. One advantage of using L. lactis as a butanol-producing host is its high butanol tolerance (L. lactis can tolerate more than 2% while E. coli can only tolerate 1%) (Hviid et al., 2017). Liu et al. (2010) introduced the Clostridium beijerinckii P260 thiolase, a key enzyme for re-directing acetyl-CoA to butanol, into L. lactis, and enabled production of 28 mg/L butanol, which demonstrates that it is feasible to use L. lactis as a production platform. L. lactis was also reported to be able to produce isobutanol via the valine degradation pathway (Priyadharshini et al., 2015). In both of these studies the performance of producing strains was clearly insufficient for commercial production, and one reason for the low titer could be a too low availability of acetyl-CoA, the precursor for butanol/isobutanol. In L. lactis, acetyl-CoA can be either formed by PFL or by the pyruvate dehydrogenase complex (PDHc) (**Figure 3**), where PFL is only active in the absence of oxygen and PDHc is quite sensitive to the NADH/NAD<sup>+</sup> ratio (Snoep et al., 1993). It thus appears that one way to

ensure an adequate supply of acetyl-CoA is to introduce a robust PDHc. One of the native enzymes in the isobutanol pathway of L. lactis is KDC (α-ketoisovalerate decarboxylase), and this enzyme has been widely used in different strain platforms for efficient isobutanol production (Atsumi et al., 2008).

#### 2,3-Butanediol Isomers

2,3-butanediol (2,3-BDO) is also considered to be an excellent biofuel as well as a good platform-chemical with many applications, e.g., for making plastics, perfumes and pharmaceuticals (Ji et al., 2011). Crow (1990) characterized the two native 2,3-butanediol dehydrogenases from L. lactis and identified that one generates meso-2,3-BDO from acetoin whereas the other forms an optical isomer generated from diacetyl. Gaspar et al. (2011) overexpressed the native α-acetolactate synthase (Als) and acetoin reductase (ButA) in an LDH-deficient strain, and demonstrated that 67% of the glucose flux could be redirected to 2,3-BDO concurrently with the production of formate (0.65 mol/mol glucose) and ethanol (0.59 mol/mol glucose) anaerobically. Recently we constructed several recombinant L. lactis strains for high-titer and high-yield production of 2,3-BDO isomers: meso-2,3-BDO, (R,R)-2,3- BDO and (S,S)-2,3-BDO (Kandasamy et al., 2016; Liu et al., 2016a). We introduced EcBDH (butanediol dehydrogenase from Enterobacter cloacae) into our platform strain L. lactis where deletions had been introduced into the genes encoding LDH, PTA, ADHE and ButBA, then further introduced the capacity to metabolize lactose, which enabled high-titer (51 g/L) and highyield (0.47 g/g lactose) production of meso-2,3-BDO from whey permeate (dairy byproduct). Similarly, we demonstrated that the alcohol dehydrogenase SadB from Achromobacter xylosooxidants was able to produce (R,R)-2,3-BDO with a titer of 32 g/L and a yield of 0.40 g/g lactose. Since the production pathway of meso-2,3-BDO and (R,R)-2,3-BDO requires one NADH per mole glucose, the excess of NADH is consumed by controlling the Nox activities through limiting oxygen levels. It is not feasible or very difficult to control oxygen levels in large scales, and for this reason we developed a robust strategy to facilitate chemicals production by fine-tunning the respiration capacity (Liu et al., 2017). The efficient production of meso-2,3-BDO and (R,R)-2,3-BDO demonstrated the great potential of L. lactis to become an efficient cell factory for synthesis of biochemicals. Regarding the synthesis of (S,S)-2,3-BDO, which is another optical isomer of 2,3-BDO, we developed a special strategy, which is to combine enzymatic reactions (glycolysis and from diacetyl to (S,S)-2,3-BDO) and a non-enzymatic reaction (from α-acetolactate to diacetyl). By using a metabolically engineered L. lactis strain a titer of 6.7 g/L (S,S)-2,3-BDO was achieved from

glucose (Liu et al., 2016a). To the best of our knowledge, this is the first time this chemical has been produced by direct microbial fermentation. The key to the success was to use a combination of a biocompatible catalyst, for speeding up the conversion of α-acetolactate into diacetyl, and a robust diacetyl reductase from E. cloacae.

## Production of Food Ingredients and Vitamins

#### Diacetyl

The long record of safe use of L. lactis for food fermentations makes L. lactis an excellent choice as a host for producing food ingredients. There have been many attempts to use this organism for producing diacetyl, which is a potent flavor compound that contributes to the buttery aroma of many fermented foods, such as cheese, butter and butter milk. L. lactis subsp. lactis bv. diacetylactis is a native producer of diacetyl, and is widely used in the food industry. Normally the diacetyl-forming ability of this strain is associated with citrate metabolism, where citrate is converted into pyruvate which boosts the pyruvate pool enabling α-acetolactate formation via α-acetolactate synthase (Hugenholtz and Starrenburg, 1992). Since the availability of citrate is very low in the food raw materials, there have been efforts to change the native metabolism of L. lactis, either by random mutagenesis or by rational design. Monnet et al. (2000) selected mutants of L. lactis subsp. lactis bv. diacetylactis that were deficient in α-acetolactate decarboxylase (ALD) and had an overall low LDH activity, and demonstrated formation of 6 mM diacetyl, 30 mM acetoin and 12 mM α-acetolactate when the strains were grown in milk supplemented with catalase under aerobic conditions. Hugenholtz et al. (2000) combined NADH-oxidase (NoxE) overexpression and ALD inactivation, and achieved 1.6 mM of diacetyl using resting cells under aerobic conditions, which corresponded to a conversion efficiency of 16% (57% α-acetolactate, 21% acetate). Guo et al. (2012) constructed a promoter library for driving the expression of NoxE and could achieve a slightly higher diacetyl concentration of 4.16 mM. Recently we developed a novel strategy in L. lactis, relying on a combination of metabolic engineering, respiration technology and metal-ion catalysis, which turned out to be successful. The homolactic L. lactis was converted into a homo-diacetyl producer with a very high titer (95 mM) and a high yield (87% of the theoretical maximum) (Liu et al., 2016a).

#### Acetoin Isomers

Another important flavor compound, acetoin ((3R)-acetoin) can be produced through the native pathway in L. lactis from pyruvate by the subsequent Als and ALD. It was reported that L. lactis subsp. lactis bv. diacetylactis could produce 5.4 mM acetoin under fully aerated conditions (100% oxygen saturation) after the citrate had been completely consumed (Bassit et al., 1993). We recently achieved high-level production of acetoin (306 mM, 27 g/L) using metabolically engineered L. lactis from whey permeate (Kandasamy et al., 2016; Liu et al., 2016c). The high titer is mainly due to the completely rerouted metabolism.

We also managed to produce another isomer of acetoin, (3S) acetoin. It can also serve as a flavor compound, but has other applications as well, e.g., for the synthesis of novel optically active α-hydroxyketone derivatives, pharmaceutical precursors and liquid crystal composites (Xiao and Lu, 2014). (3S) acetoin can be produced from diacetyl with the aid of diacetyl reductase (DAR). By manipulating cofactor availability and using metal ion catalysis for speeding up non-enzymatic oxidative decarboxylation of α-acetolactate, (3S)-acetoin can be produced at 66 mM (71% of the theoretical maximum) (Liu et al., 2016d). To the best of our knowledge this is the first time this isomer has been made by microbial fermentation.

#### Alanine

Hols et al. (1999) successfully rerouted the carbon flux of L. lactis toward the production of alanine, which is a natural sweetener used as a food ingredient and as a pharmaceutical precursor. The engineered strain was deficient in LDH activity and equipped with alanine dehydrogenase (AlaDH) from Bacillus sphaericus. Using resting cells as biocatalyst, alanine (around 200 mM) was the only product formed after optimizing pH and ammonium concentration. Another AlaDH from Bacillus subtilis (natto) was demonstrated to have a potential for improving alanine levels in L. lactis NZ9000 fermentation broth (Ye et al., 2010).

#### Acetaldehyde

Acetaldehyde is considered as the most important aroma compound in yogurt. In order to achieve formation of more acetaldehyde in dairy products, Bongers et al. (2005) overexpressed PDC from Z. mobilis and NoxE in the wild type of L. lactis. They found it was possible to redirect 50% of the carbon flux in resting cells to acetaldehyde, which accumulated to 9.5 mM. From this work it was clear that PDC with its low Km for pyruvate (0.3 mM) could efficiently drain the pyruvate pool, whereas this is more difficult to achieve using Als that has a very high Km for pyruvate (50 mM) (Zullian et al., 2014).

#### Vitamins – Riboflavin

Vitamins are vital nutrients needed for the normal functioning of living organisms. Vitamin deficiencies occur commonly, and are often associated with certain health problems, however, deficiencies can be overcome by supplementation, e.g., by fortifying foods with vitamins. Several studies have focused on engineering L. lactis into synthesizing B vitamins, such as riboflavin (vitamin B2) and folate (vitamin B11) (Thakur et al., 2016). Burgess et al. (2004) overexpressed the rib operon (ribGBAH) (**Figure 4**) and achieved high level production of riboflavin (24 mg/L). Furthermore, they developed a strategy to select and isolate spontaneous riboflavin-overproducing L. lactis, which was to use the toxic riboflavin analog roseoflavin. Several mutants exhibited a significant higher-level of riboflavin (around 900 mg/L) (Burgess et al., 2004). Recently our group selected a riboflavin overproducer through the combination of roseoflavin resistance, random mutagenesis and microfluidic screening. The mutant could increase the riboflavin content in milk to 2.81 mg/L whereas the wild-type reduced the riboflavin content of milk to 0.66 mg/L (Chen J. et al., 2017). These results indicate a

FIGURE 4 | Metabolic pathways involved in the synthesis of riboflavin and folate in LAB. FolE, GTP cyclohydrolase I; FolQ, dihydroneopterin triphosphate pyrophosphohydrolase; FolB, dihydroneopterin aldolase; FolK, hydroxymethyldihydropterin pyrophosphokinase; FolP, dihydropteroate synthase; FolC, dihydrofolate synthase; FolA, dihydrofolate reductase; PabAB, chorismate synthetase component I and II; PabC, 4-amino-4-deoxychorismate lyase; Glu, glutamate; pABA, para-aminobenzoic acid; γ-GluH, gamma-glutamyl hydrolase; RibA, GTP cyclohydrolase II/3,4-dihydroxy-2-butanone-4-phosphate synthase; RibG, riboflavin-specific deaminase/reductase; RibH, riboflavin synthase (beta subunit); and RibB, riboflavin synthase (alpha subunit). Dotted arrows represent multiple consecutive steps in the pathways.

great potential for L. lactis to become an efficient producer of riboflavin, either by using GMO or non-GMO approaches. LeBlanc et al. (2005) carried out an animal test and found that feeding rats with live L. lactis cells overproducing riboflavin could stimulate their growth.

#### Vitamins – Folate

Lactococcus lactis is a good producer of folate, predominantly in the form of polyglutamyl. The Fol operon, which includes folA, folB, folKE, folP and folC, was found to be involved in folate biosynthesis (**Figure 4**). The overexpression of folKE in L. lactis resulted in an increase in extracellular folate production by 8 fold (from 10 to 80 ng/ml), while the total folate production increased by 2-fold (from 100 to 180 ng/ml) (Sybesma et al., 2003). Wegkamp et al. (2007) further documented that the overexpression of the folate operon and para-aminobenzoic acid (pABA) gene clusters, where pABA is the important precursor for folate synthesis, resulted in production of 2.7 mg/L folate per

optical density unit at 600 nm, which is 80 times higher than what the wild type is capable of.

### Production of Polysaccharides and Plant Metabolites

#### Exopolysaccharides and Hyaluronic Acid

Exopolysaccharides (EPSs) are long-chain polysaccharides that are secreted into their surroundings during bacterial growth. EPSs contribute to the texture of many dairy products (Laws et al., 2001). It has also been reported that they can function as prebiotics, cholesterol lowering nutraceuticals or immunomodulants (Nwodo et al., 2012). The biosynthesis of EPSs is complex and normally involves a large number of gene products. In L. lactis, the genes responsible for EPSs synthesis are located in a large operon containing 14 genes, epsRXABCDEFGHIJKL, which is found on a plasmid (van Kranenburg et al., 1997). van Kranenburg et al. (1999) demonstrated that by over-expressing edsD, encoding the priming glycosyltransferase, could increase the EPSs production from 113 to 133 mg/L. Further through increasing the expression level of the entire eps gene cluster, EPSs production was increased further to 343 mg/L (Boels et al., 2003a,b). However, an increase in the concentration of sugar-phosphates (Glu-6P, Glu-1P) or sugar nucleotides (UDP-glucose, UDP-galactose) did not affect production of EPSs significantly. Recently the EPSs biosynthetic pathways were re-engineered to enable production of hyaluronic acid (HA), which is another kind of polysaccharide with various applications in pharmaceuticals and foods (Shah et al., 2013). HA is synthesized from the polymerization of UDP-glucuronic acid and UDP-N-acetylglucosamine, two precursors of cellwall components, through HA synthase. The recombinant expression of HA synthase from Streptococcus equi subsp. zooepidemicus in L. lactis resulted in 0.08 g/L HA, and the coexpression of HA synthase and uridine diphosphate-glucose dehydrogenase (UDP-GlcDH) significantly enhanced the HA production to 0.65 g/L (Chien and Lee, 2007). Prasad et al. (2010) coexpressed UDP-glucose pyrophosphorylase as well as HA synthase and UDP-GluDH and achieved 1.8 g/L HA in bioreactor experiment with controlled pH and aeration. These promising results demonstrate the great potential of L. lactis as a good platform for the production of functional polysaccharides.

#### Plant Metabolites

In the last 10 years, L. lactis has been studied as a host for the expression of plant derived genes to produce plant metabolites. Song et al. (2012) expressed b-sesquiphellandrene synthase from Persicaria minor in L. lactis successfully and confirmed the production of b-sesquiphellandrene, which is a valuable sesquiterpene and has excellent antimicrobial and antioxidative properties. The co-expression of 3 hydroxy-3-methyglutaryl coenzyme A reductase, which is the limiting enzyme in the mevalonate pathway, increased the b-sesquiphellandrene level to 109 nM (Song et al., 2012). Hernández et al. (2007) cloned two genes from strawberry, encoding an alcohol acyltransferase and a linalool/nerolidol synthase and expressed them in L. lactis, which resulted in the production of octyl acetate (1.9 mM) and linalool (85 nM). The production of plant metabolites in L. lactis is just in the beginning phase and these previous results demonstrate a great potential.

### CONCLUDING REMARKS AND FUTURE PROSPECTS

In many single strain omics and systems biology studies, cells are scrutinized under steady-state conditions, which is indeed useful for the purpose of exploring and understanding the regulation of cellular metabolism of LAB, e.g., glycolysis. A large number of studies have also demonstrated how helpful this knowledge and understanding is for the design and engineering of robust LAB cell factories for producing different high-value chemicals. Systems biology methods including omics analysis also have allowed us to understand how LAB respond to environmental changes. In the future, other variables should be considered for the study of LAB communities e.g., temporal and spatial change of community composition due to biotic/abiotic factors in the production process, which significantly increase complexity of the research. These changes have showed great influences on the cellular states of the microbial community. So, in the future, the improved meta-omics analysis resolution, and use of properly simplified models are some of the keys for the systems biology study of the LAB community, which will benefit the understanding of these organisms in the reallife applications.

### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

## FUNDING

This work was supported by Innovationsfonden (Grants 6150-00036B and 4150-00037B), Novo Nordisk Fonden (Grant NNF12OC0000818), and Statens Naturvidenskabelige Forskningsfond (FNU10-085115).

### ACKNOWLEDGMENTS

This review article includes some contents which first appeared in SC's thesis (Chan, 2014) and JL's thesis (Liu, 2017) for their doctoral degree (the only medium), individually, from Technical University of Denmark.

### REFERENCES

fmicb-10-00876 April 29, 2019 Time: 17:54 # 16


GlcU permease. Mol. Microbiol. 71, 795–806. doi: 10.1111/j.1365-2958.2008. 06564.x


Lactococcus lactis: predominant role of the NADH/NAD+ ratio. J. Bacteriol. 179, 5282–5287.


fmicb-10-00876 April 29, 2019 Time: 17:54 # 17


pyruvate dehydrogenase complexes of Enterococcus faecalis, Lactococcus lactis, Azotobacter vinelandii and Escherichia coli: implications for their activity in vivo. FEMS Microbiol. Lett. 114, 279–283.


Lactococcus lactis. Curr. Opin. Biotechnol. 17, 183–190. doi: 10.1016/j.copbio. 2006.02.007


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Liu, Chan, Chen, Solem and Jensen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fmicb-10-00876 April 29, 2019 Time: 17:54 # 19

# A Systematic Approach to Identify and Characterize the Effectiveness and Safety of Novel Probiotic Strains to Control Foodborne Pathogens

Diana I. Ayala<sup>1</sup> , Peter W. Cook1,2, Jorge G. Franco<sup>1</sup> , Marie Bugarel<sup>1</sup> , Kameswara R. Kottapalli<sup>3</sup> , Guy H. Loneragan<sup>1</sup>† , Mindy M. Brashears<sup>1</sup> and Kendra K. Nightingale<sup>1</sup> \*

1 International Center for Food Industry Excellence, Department of Animal and Food Sciences, Texas Tech University, Lubbock, TX, United States, <sup>2</sup> Influenza Division, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta, GA, United States, <sup>3</sup> Center for Biotechnology and Genomics, Texas Tech University, Lubbock, TX, United States

#### Edited by:

Jan Kok, University of Groningen, Netherlands

#### Reviewed by:

Jasna Novak, University of Zagreb, Croatia Robin Anderson, United States Department of Agriculture, United States

\*Correspondence: Kendra K. Nightingale kendra.nightingale@ttu.edu

#### †Present address:

Guy H. Loneragan, School of Veterinary Medicine, Texas Tech University, Lubbock, TX, United States

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 22 January 2019 Accepted: 01 May 2019 Published: 17 May 2019

#### Citation:

Ayala DI, Cook PW, Franco JG, Bugarel M, Kottapalli KR, Loneragan GH, Brashears MM and Nightingale KK (2019) A Systematic Approach to Identify and Characterize the Effectiveness and Safety of Novel Probiotic Strains to Control Foodborne Pathogens. Front. Microbiol. 10:1108. doi: 10.3389/fmicb.2019.01108 A total of 44 lactic acid bacteria (LAB) strains originally isolated from cattle feces and different food sources were screened for their potential probiotic features. The antimicrobial activity of all isolates was tested by well-diffusion assay and competitive exclusion on broth against Salmonella Montevideo, Escherichia coli O157:H7 and Listeria monocytogenes strain N1-002. Thirty-eight LAB strains showed antagonistic effect against at least one of the pathogens tested in this study. Improved inhibitory effect was observed against L. monocytogenes with zones of inhibition up to 24 mm when LAB overnight cultures were used, and up to 21 mm when cell-free filtrates were used. For E. coli O157:H7 and Salmonella maximum inhibitions of 12 and 11.5 mm were observed, respectively. On broth, 43 strains reduced L. monocytogenes up to 9.06 log<sup>10</sup> CFU/ml, 41 reduced E. coli O157:H7 up to 0.84 log<sup>10</sup> CFU/ml, and 32 reduced Salmonella up to 0.94 log<sup>10</sup> CFU/ml 24 h after co-inoculation. Twenty-eight LAB isolates that exhibited the highest inhibitory effect among pathogens were further analyzed to determine their antimicrobial resistance profile, adhesion potential, and cytotoxicity to Caco-2 cells. All LAB strains tested were susceptible to ampicillin, linezolid, and penicillin. Twenty-six were able to adhere to Caco-2 cells, five were classified as highly adhesive with > 40 bacterial cells/Caco-2 cells. Low cytotoxicity percentages were observed for the candidate LAB strains with values ranging from −5 to 8%. Genotypic identification by whole genome sequencing confirmed all as members of the LAB group; Enterococcus was the genus most frequently isolated with 21 isolates, followed by Pediococcus with 4, and Lactobacillus with 3. In this study, a systematic approach was used for the improved identification of novel LAB strains able to exert antagonistic effect against important foodborne pathogens. Our findings suggest that the selected panel of LAB probiotic strains can be used as biocontrol cultures to inhibit and/or reduce the growth of L. monocytogenes, Salmonella, and E. coli O157:H7 in different matrices, and environments.

Keywords: lactic acid bacteria, probiotics, characterization, safety, pathogens

## INTRODUCTION

fmicb-10-01108 May 16, 2019 Time: 14:41 # 2

In the United States (U.S.), an estimated 9.4 million cases of foodborne illness, 55,961 hospitalizations, and 1,351 deaths per annum are attributed to 31 identifiable foodborne pathogens (Scallan et al., 2011). Ninety-five percent of the total illnesses, hospitalizations, and deaths were estimated to be caused by only 15 pathogens, including Listeria monocytogenes, non-typhoidal Salmonella and Escherichia coli O157:H7 (Hoffmann et al., 2015). Non-typhoidal salmonellosis is a leading cause of bacterial gastroenteritis in the U.S. and worldwide and foodborne illnesses caused by L. monocytogenes and E. coli O157:H7 are associated with exceptionally high morbidity and mortality rates (Scallan et al., 2011). The growing concern of antimicrobial resistance (AMR) coupled with the increased demand for a safe food supply by consumers has prompted an increased interest in the use of probiotics as a natural biocontrol strategy to reduce foodborne pathogens along the food continuum.

Probiotics are live, naturally occurring microorganisms that in adequate amounts confer benefits to the host (Fuller, 1992). Probiotics have also emerged as a natural alternative to antimicrobials in animal feed to promote animal health [also referred to as direct fed microbials (DFMs) in animal feed] and chemical interventions to control foodborne pathogens in human and pet food. Modes of action used by probiotics include production of antimicrobial compounds (i.e., bacteriocins and organic acids) and competitive exclusion. Probiotic strains compete with pathogens for nutrients and minerals as well as receptors or adhesion sites in the host intestinal tract, therefore displacing pathogen adhesion to host intestinal epithelial cells. Probiotics also improve host intestinal barrier function and activate mucosal immunity (McAllister et al., 2011). Together these modes of probiotic action and stimulation of the host immune system, interfere with the pathogens' essential cell functions causing leakage of cytoplasmic components and cytotoxicity, thus leading to pathogen cell death (Yirga, 2015).

Due to their demonstrated antagonistic effects against foodborne and spoilage bacteria, the probiotic strains most commonly used to promote host health and control foodborne pathogens are lactic acid bacteria (LAB) from the genera of Lactobacillus and Enterococcus (Imperial and Ibana, 2016). LAB are an order of gram-positive, non-spore forming cocci, bacilli or rods that are generally non-respiratory and lack catalase; they are able to ferment glucose to produce lactic acid or lactic acid, CO<sup>2</sup> and ethanol. Most LAB are beneficial to the host; however, some LAB are pathogenic or opportunistic pathogens to animals and humans (e.g., some Streptococcus and Enterococcus spp.) and careful selection criteria should be evaluated in selecting probiotic strains to be included as DFMs in animal feed and probiotics in human and pet food (Yirga, 2015). LAB are ubiquitous in nature and can be routinely isolated from vegetation and a wide range of raw foods including milk and milk products, meat, and produce (Mohania et al., 2008; Quinto et al., 2014). Additionally, LAB are natural commensals of the gastrointestinal tract (GIT) of mammals, they constitute the dominant indigenous lactic microbiota present, this enables LAB to beneficially affect the host by an improvement of the microbial profile in the gut (Fuller, 1997; Brashears et al., 2003).

The criteria and safety assessment to select new probiotic strains to be used as a biocontrol intervention includes identification and characterization of non-pathogenic strains with antagonistic features against pathogens in a host or other systems where pathogen control is needed. Desirable features of a potential probiotic strain include (i) attach to and colonize intestinal epithelial cells, (ii) exhibit susceptibility to antibiotics, (iii) stably survive and have metabolic activity in the small intestine, and (iv) remain viable during delivery (Krehbiel et al., 2003; Gaggìa et al., 2010; Sanders et al., 2010; Seo et al., 2010). Benefits of supplementing animal feed and pet food with probiotics include, (i) improve resistance to disease by a beneficial shift in the microbial community, (ii) reduce pathogen colonization, (iii) stimulate host immunity and (iv) overall improved host health (Beauchemin et al., 2003; Quigley, 2011). LAB include at least 13 genera where thousands of genetically diverse strains differ in their ability to benefit the host through controlling pathogens and improving overall host health (Liu et al., 2014).

Previous findings have shown that effectiveness to control enteric pathogens and health benefits conferred by probiotics depend on strain-specificity, with different results for adhesion, autoaggregation, and immunomodulatory effect depending on the strain used (Santosa et al., 2006; Angelakis et al., 2011). The overall aim of this study was to use a combination of genotypic and phenotypic assays to characterize a set of novel LAB strains for their ability to control Salmonella, E. coli O157:H7 and L. monocytogenes throughout the food continuum, including pre-harvest applications in animal feed and post-harvest applications in pet and human food along with environments associated with pet and human food processing and handling. A collection of 44 LAB strains from cattle feces and human food were characterized by agar well-diffusion, competitive broth exclusion assays (to identify LAB strains with antagonistic effects against L. monocytogenes, Salmonella, and E. coli O157:H7), whole genome sequencing (for taxonomic identification and to predict bacteriocin and virulence gene carriage), antimicrobial susceptibility, and cell culture assays (to determine adhesion to and cytotoxicity against intestinal epithelial cells).

#### MATERIALS AND METHODS

### Bacterial Strains and Growth Conditions

A total of 53 LAB isolates (out of an initial set of > 200 novel strains) from cattle feces and different food sources including meat, fruits, and vegetables that showed initial antagonistic activity against L. monocytogenes, Salmonella, and E. coli O157:H7 were obtained from the stock culture collection of the International Center of Food Industry Excellence at Texas Tech University (ICFIE: TTU). Isolates were streaked for isolation onto de Man Rogosa and Sharpe (MRS) agar plates (Merck, Darmstadt, Germany), and incubated aerobically at 37◦C for 48 h to obtain well-isolated colonies. A single colony was

selected and grown in 9-ml of fresh MRS broth (Criterion, Hardy Diagnostics, CA, United States) at 37◦C for 12–18 h. Nine of the 53 LAB strains did not further grow in MRS broth and were removed from the study. Pure cultures of 44 remaining LAB strains were grown up in MRS broth as described above, preserved on cryobeads (Key Scientific, Stamford, TX, United States) and stored at −80◦C until further use. Foodborne pathogen isolates including L. monocytogenes, Salmonella and E. coli O157:H7 were streaked onto Tryptic Soy Agar (TSA, Becton, Dickinson, Le Pont de Chaix, France) and incubated at 37◦C for 18–24 h. A single colony of each pathogen was selected and grown individually in 9-ml of Brain Heart Infusion (BHI) broth (Merck, Darmstadt, Germany). Overnight cultures of LAB and pathogenic strains were used to perform agar-well diffusion and competitive exclusion assays as detailed below to determine the antagonistic activity of all 44 LAB strains against each pathogen.

#### Agar-Well Diffusion Assay

This set of experiments was performed based on the method described previously by Vinderola et al. (2008) with slight modifications. Overnight cultures (incubated in nutrient rich media for 12–18 h at 37oC) of foodborne pathogens were diluted to achieve a concentration of 10<sup>5</sup> CFU/mL, swabbed onto BHI plates and incubated at 37oC for 24 h to create a lawn. To assay the antagonistic activity of each LAB strain against each foodborne pathogen, five 6-mm wide wells cut into a BHI agar plate were filled with a 100 µl aliquot of the following: (i) 10<sup>8</sup> CFU/mL of each LAB strain overnight culture in duplicate wells, (ii) cell-free filtrate (CFF) of LAB overnight cultures, passed through a sterile 0.45 µM filter (Sigma-Aldrich, St. Louis, MO, United States), in duplicate wells, and (iii) MRS broth (control) in a single well. Plates were first incubated at 4◦C for 2 h to allow suspensions to diffuse in the agar followed by incubation for 24 h at 37◦C (Zavisic et al., 2012). Antagonistic activity was determined by the development of clear zones of inhibition around each well and measured using a caliper. Overall scores of inhibition were calculated by summing up values observed for overnight cultures and CFF across all three pathogens, and potential LAB probiotic strains were ranked for antagonistic activity against all three foodborne pathogens.

#### Competitive Exclusion Broth Culture Assay

Overnight cultures of each pathogen and LAB strains were prepared as described above for agar well diffusion assays and co-inoculated at 10<sup>5</sup> and 10<sup>6</sup> CFU/mL, respectively in Tryptic Soy Broth (TSB; Oxoid Ltd., Basingstoke, United Kingdom) supplemented with 1 g l−<sup>1</sup> Tween 80 (Acros, Organics, NJ, United States) and incubated at 37◦C with slight agitation (130 rpm). A previous study demonstrated that addition of tween to TSB allows growth of both LAB and gram-negative pathogens, including E. coli O157:H7 and Salmonella (Cálix-Lara et al., 2012). Samples were diluted and plated onto Modified Oxford Agar (MOX; Becton, Dickinson and Company), Xylose Lysine Tergitol 4 Agar (XLT4; Becton, Dickinson and Company), MacConkey agar with sorbitol (SMAC; Criterion, Hardy Diagnostics, CA, United States), and MRS agar plates to enumerate L. monocytogenes, Salmonella, E. coli O157:H7, and LAB strains, respectively, at 0, 6, 12, and 24 h of coinoculation. MOX, XLT4, and SMAC plates were incubated at 37◦C for 24 h, and MRS plates were incubated at 37◦C for 48 h. Antagonistic activity of each LAB strain was determined by pathogenic reduction with respect to control samples (pathogen cultures without LAB) at each time point. As for agar well diffusion assays, pathogenic reductions were summed up across all time points and across all three pathogens, and potential LAB probiotic strains were ranked based on their antagonistic effect. The top 20 ranking LAB strains from the agar well diffusion and competitive exclusion broth assays (n = 28 LAB strains in total) were selected for further phenotypic and genotypic evaluation by the assays below.

### Caco-2 Cell Attachment Assay

The human epithelial intestinal cell line Caco-2 was used to evaluate the in vitro ability of the top ranking LAB strains to adhere to the GIT. Caco-2 cells (ATCC HTB-37TM) were maintained in Eagle's Minimum Essential Medium (EMEM, ATCC 30-2003TM) supplemented with 20% fetal bovine serum (FBS, ATCC 30-2020TM) and 1% of penicillin/streptomycin solution (10,000 U/mL of penicillin and 10,000 µg/mL streptomycin) (PenStrep) (Gibco, Thermo Fisher Scientific) at 37◦C in a water-jacketed incubator with 5% CO2. Three days before attachment assays were performed, Caco-2 cells were seeded into 24-well tissue culture plates at a target density of 5 × 10<sup>4</sup> cells/well to achieve a confluent density of 1 × 10<sup>5</sup> cells/well in EMEM supplemented with 20% FBS and 1% PenStrep. For each attachment assay, cell culture media (1 mL/well) was removed and replaced with antibiotic-free EMEM medium. Duplicate confluent Caco-2 cell monolayers were then inoculated with an average 1.6 × 10<sup>7</sup> CFU/ ml (CI 95%: 1.1 × 107–2.0 × 10<sup>7</sup> ) of each LAB strain to be analyzed resulting in a multiplicity of infection (MOI) of 52.4, as determined by plating the LAB inoculum on MRS plates and counting the Caco-2 cells in a Neubauer chamber. Inoculated Caco-2 cell plates were returned to the water-jacketed incubator for 30 min to allow for LAB attachment. Caco-2 cell monolayers were washed once with 1 mL sterile phosphate-buffered saline (PBS, Thermo Scientific, Rockford, IL, United States) to remove non- or loosely adherent bacteria. Caco-2 cells were then lysed by addition of 1 mL of ice-cold water to release adherent LAB bacteria. Appropriate serial dilutions were plated onto MRS plates, incubated and LAB were enumerated as detailed above. The attachment efficiency of each LAB strain was assayed in two biologically independent experiments each with two technical replicates. Attachment efficiency was calculated by dividing the average number of adherent bacterial cells (across biological and technical replicates) by the number of Caco-2 cells in each well.

### Caco-2 Cell Cytotoxicity Assay

The in vitro cytotoxicity of the top ranking LAB strains against Caco-2 cells was evaluated by using a CytoTox 96

non-radioactive cytotoxicity kit (Promega, Madison, WI, United States), which measures release of lactate dehydrogenase (LDH) upon Caco-2 cell lysis, following manufacturer's recommendations. Briefly, Caco-2 cells were seeded on a 24-well plate and grown to confluence as described above. Caco-2 monolayers were inoculated with each LAB strain at a level of 1.6 × 10<sup>7</sup> CFU/ml (CI 95%: 1.1 × 107–2.0 × 10<sup>7</sup> ) and cytotoxicity was evaluated after 24 h by measuring absorbance (release of LDH) in a microplate reader (Biotek Instruments, Winooski, VT, United States) at 490 nm. The cytotoxicity of each LAB strain was evaluated in two biologically independent experiments each containing two technical replicates. In each independent experiment, two un-inoculated Caco-2 monolayers were included and used as maximum lysis controls (according to the kit instructions) and two un-inoculated Caco-2 monolayers were included to determine the background absorbance associated with the medium. Background absorbance was averaged and subtracted from all individual observed values for Caco-2 wells inoculated with a LAB strain within each independent experiment. Percent cytotoxicity was expressed as the adjusted average absorbance value (subtracting background absorbance) for each LAB strain divided by the absorbance for the maximum lysis control.

#### Antimicrobial Susceptibility Profiling

Antimicrobial susceptibility profiling was performed following the National Antimicrobial Resistance Monitoring System (NARMS) protocol (FDA, 2016). LAB strains were streaked onto TSA plates and incubated at 37◦C for 24 h. A single colony of each LAB strain was selected and sub-streaked onto TSA plates containing 5% defibrinated sheep blood (HemoStat Laboratories, Dixon, CA, United States) and incubated at 37◦C for 24–48 h. Antibiotic susceptibility was evaluated using the SensititreTM Gram-positive MIC plate assayed on the SensititreTM automated antimicrobial susceptibility system (Trek Diagnostic Systems, Westlake, Ohio) following manufacturer's instructions. Enterococcus faecalis ATCC 29212 was used as a quality control organism. Thirteen antimicrobial agents, included in the MIC plate assayed, were evaluated in this study including: ampicillin (AMP), clindamycin (CLI), daptomycin (DAP), erythromycin (ERY), gentamicin (GEN), levofloxacin (LVX), linezolid (LZD), penicillin (PEN), rifampicin (RIF), synercid (SYN), tetracyclin (TET), trimethoprim/sulfamethohazole (SXT), and vancomycin (VAN). The Minimum Inhibitory Concentration (MIC) breakpoints for the antimicrobials tested were interpreted based on the Clinical and Laboratory Standards Institute (CLSI) (Clinical and Laboratory Standards Institute, 2017).

#### Whole Genome Sequencing and Bioinformatics Analyses

A total of 28 novel LAB strains with antagonistic characteristics toward L. monocytogenes, E. coli O157:H7 and Salmonella were selected for further genotypic characterization. LAB strains were cultivated in MRS broth as detailed above, genomic DNA (gDNA) was isolated and purified using the Invitrogen Purelink DNA Extraction kit (ThermoFisher Scientific, Waltham, MA, United States). Pure gDNA was quantified using a Qubit <sup>R</sup> 2.0 Fluorometer (Life Technologies, CA, United States), and used for library preparation with the Nextera XT v2.0 kit (San Diego, CA, United States) as per manufacturer's recommendations. DNA libraries were subjected to paired-end sequenced using the 2 × 250 basepair (bp) V2 sequencing kit on an Illumina MiSeq platform (Illumina Inc., United States). Raw reads were preprocessed and filtered using Trimmomatic version 0.36 (Bolger et al., 2014), which was followed by de novo assembly using SPAdes version 11 (Bankevich et al., 2012). Resultant scaffolds were annotated using Prokka v1.13 (Seemann, 2014) and 31 conserved amino acid coding sequences were identified through the AMPHORA2 pipeline (Wu and Scott, 2012). The 31 conserved amino acid coding sequences were aligned and a concatenated alignment was created to compare our novel LAB strains to a large background of other strains representing genera in the LAB order. Taxonomic identification for each of our LAB strains was determined based on the highest confidence gene set.

Bacteriocins, virulence factors, and potential AMR genes were identified by comparing genome untranslated gene sequences identified during genome annotation to the BAGEL3, Virulence Finder, ResFinder and PlasmidFinder databases, respectively (Zankari et al., 2012; van Heel et al., 2013; Joensen et al., 2014). Un-gapped alignments with higher than 95% identity and 95% query coverage were identified as positive for the virulence factors and AMR genes were used to confirm the presence of these genes. A phylogenetic tree was generated using the concatenated alignment in RAxML (Randomized Axelerated Maximum Likelihood) (Stamatakis, 2014) on the CIPRES science gateway (Miller et al., 2010).

Antimicrobial resistance-encoding genes identified by using the ResFinder v3.0 (Zankari et al., 2012) and PlasmidFinder v2.0 (Carattoli et al., 2014) pipelines from the Center for Genomic Epidemiology website<sup>1</sup> were compared using BLASTn against the GenBank nucleotide database using default settings for sequence identification. A multiple genome alignment was created using Mauve software (v2.4.0) to compare plasmid sequences identified in strains L22, L24-A, and L25 against plasmid sequence data from Enterococcus faecium (accession number KJ645709).

### Confidence Interval Estimation and Statistical Analyses

Confidence intervals (95%) for attachment and cytotoxicity assay data were estimated using the mean and standard deviation of the ratio of bacterial cells to Caco-2 cells (attachment efficiency) and percent cytotoxicity (calculated by dividing adjusted average absorbance values by the maximum lysis control absorbance value in each experiment), respectively. Attachment efficiency and percent cytotoxicity values were analyzed using an ANOVA followed by the Bonferroni familywise error correction for multiple comparisons. Strain to strain comparisons were made to identify statistically significant differences using R coding language (R CoreTeam, 2017). The R packages: ggplot, phangorn, tidytree, ggtree, phylotools, and ape

<sup>1</sup>https://cge.cbs.dtu.dk/services/

(Wickham, 2009; Schliep, 2011; Popescu et al., 2012; Revell, 2012; Yu et al., 2017) were used to describe the relationship between attachment efficiency, percent cytotoxicity and number of predicted bacteriocin and virulence genes.

### RESULTS

### Agar-Well Diffusion Assay

fmicb-10-01108 May 16, 2019 Time: 14:41 # 5

Thirty-eight of the LAB strains showed an antagonistic effect against at least one of the pathogens tested (Salmonella, E. coli O157:H7 and L. monocytogenes), the remaining six did not inhibit or reduce the growth of the pathogens analyzed in this study. Thirty-seven of the LAB strains showed antimicrobial activity against L. monocytogenes with clear zones of inhibition ranging from 8.5 to 24 mm when overnight LAB cultures were used and from 6.5 to 21 mm when CFF was used. Twenty strains were antagonistic against E. coli O157:H7 with a maximum inhibition zone of 12 mm when L15 strain was used, and 18 were antagonistic against Salmonella, with a maximum zone of inhibition of 11.5 mm produced by strain L24-B. No zones of inhibition were observed for either E. coli O157:H7 or Salmonella when only CFF was used. Overall scores of inhibition summed across overnight culture and CFF for all three pathogens ranged from 33.5 to 45 mm for the top 20 LAB strains. Strain L20-B produced the highest ranking inhibition but was not inhibitory against Salmonella or E. coli O157:H7. Strain L28 was the highest ranking antagonistic strain that was effective against all three pathogens. Twelve of the top 20 LAB strains were originally isolated from a bovine source (i.e., cattle feces or raw meat), while the eight remaining isolates were isolated from fruits (n = 6) or vegetables (n = 2). Overall, increased antimicrobial activity was found when overnight cultures were used compared to CFF (**Table 1**).

### Competitive Exclusion in Broth Culture

The antimicrobial activity of all LAB strains evaluated in this study is shown in **Supplementary Table 1**. Pathogen reductions were analyzed at 6, 12, and 24 h after co-inoculation with respect to un-inoculated control cultures; for E. coli O157:H7 and Salmonella highest reductions were observed 6 h after coinoculation (2.03 and 2.53 log<sup>10</sup> CFU/ml, respectively). At 12 h, highest reductions were 1.12 and 1.36 log<sup>10</sup> CFU/ml for E. coli O157:H7 and Salmonella, respectively. Twenty-four hours after co-inoculation, 43 of the strains reduced L. monocytogenes by 0.12–9.06 log<sup>10</sup> CFU/ml. Strains L28 and L20-B had the highest antimicrobial activity against L. monocytogenes with reductions of 9.06 and 6.96 log<sup>10</sup> CFU/ml, respectively. Forty-one of the strains reduced E. coli O157:H7 by 0.02 to 0.84 log<sup>10</sup> CFU/ml, where L4-B and L20-B achieved the highest reductions of 0.84 and 0.82 log<sup>10</sup> CFU/ml, respectively. Thirty-two of the LAB strains reduced Salmonella by 0.05–0.94 log<sup>10</sup> CFU/ml with L24-B and L20-B as the most antagonistic with reductions of 0.94 and 0.84 log<sup>10</sup> CFU/ml, respectively. Greater antimicrobial effects were observed against L. monocytogenes compared with the other pathogens evaluated, notably strain L28 completely eliminated L. monocytogenes after 24 h after co-inoculation (**Supplementary Table 1**).

Pathogen reductions were summed up across time points and all three pathogens, and novel LAB strains were ranked based on their increased antimicrobial effect; strain L28 had the highest reduction across all pathogens tested in this study (**Supplementary Table 1**). Overall scores for agar-well diffusion and competitive exclusion assays were summed up to determine the top 20 LAB strains (within each assay), L20-B, L28, J7 ranked as the LAB strains with the greatest antagonistic activity. The top 20 strains from each assay (n = 28 strains collectively) were further characterized by cell culture assays, antimicrobial susceptibility and whole genome sequencing.

## Caco-2 Cell Attachment and Cytotoxicity Assays

The ability of LAB strains to adhere to Caco-2 cells after 30 min was evaluated in this study. LAB attachment ranged from 4 to 84 bacterial cells/Caco-2 cell (**Table 2**). According to Candela et al. (2005) classification of microorganisms based on bacterial adhesive properties, five including L4-B, L8-A, L3-A, L15, and L2A were classified as highly adhesive, with > 40 bacterial cells/Caco-2 cells; 21 of the strains were classified as adhesive with 5–40 bacterial cells/Caco-2 cells and two were classified as non-adhesive with < 5 adherent bacterial cells/Caco-2 cells. The cytotoxic activity of our LAB strain panel was determined based on the release of the stable cytosolic enzyme LDH in the culture medium 24 h after bacterial inoculation. Percent cytotoxicity for the panel ranged from −5 to 8%, with L6B being the least cytotoxic and J16 the most cytotoxic (**Table 2** and **Figure 1**).

### Antimicrobial Susceptibility Profiling

All LAB isolates evaluated were susceptible to AMP, PEN, and LZD. Resistance to LVX was the most commonly found in our study (n = 17), followed by CLI (n = 9), VAN and SYN (each with n = 7), TET and SXT (each with n = 6), and RIF and GEN (each with n = 1). Intermediate resistance was mostly commonly observed for ERY, and DAP with 20 and 13 LAB isolates, respectively. Additionally, 7, 4, 3, and 2 LAB isolates exhibited intermediate resistance to GEN, TET, RIF, and SYN, respectively (**Table 3**). Three of the LAB isolates exhibited resistance to one antimicrobial agent (LVX); 16 were resistant to two antimicrobials, LVX-DAP was the most common AMR profile in this study with 9 of the isolates. Additionally, nine of all the LAB isolates in this study exhibited multidrug-resistance (MDR) with LVX-VAN-SXT as the most common AMR profile with 4 isolates, followed by DAP-TET-CLI-SYN with 3MDR isolates (**Table 4**).

### Whole Genome Sequencing and Bioinformatics

Genotypic identification by WGS confirmed all as members of the LAB group; Enterococcus was the genus most frequently isolated accounting with 21 of the total LAB isolates. The seven remaining strains were identified as members of the Pediococcus and Lactobacillus genus with 4 and 3 isolates, respectively.

TABLE 1 | Antimicrobial activity of novel lactic acid bacteria strains against Listeria monocytogenes, Salmonella, and Escherichia coli O157:H7 sorted by rank of overall antagonistic activity across all pathogens.


a "CFF" indicates cell free filtrate prepared by passing overnight lactic acid bacteria cultures through a 0.45 µM.

<sup>b</sup>Overall score summed from zones of inhibition observed from overnight lactic acid bacteria culture and CFF across all three pathogens (i.e., Salmonella, E. coli O157:H7 and L. monocytogenes).

E. faecium was the most common species identified in this study with 11 isolates, followed by E. hirae, and P. acidilactici with 5 and 4 LAB isolates, respectively (**Table 5** and **Figure 2**).

In this study between one and six putative bacteriocins were identified in 24 of the top LAB strains analyzed by genome comparison against the BActeriocin GEnome



<sup>a</sup>Average (Av.) lactic acid bacterial cells/Caco-2 cell values were calculated from two biologically independent experiments each with technical replicates. <sup>b</sup>Average (Av.) cytotoxicity values were calculated from two biologically independent experiments each with technical replicates.

mining tool Database. Putative bacteriocins identified included enterolysin A, enterocin, lactacin F, sactipeptides, pediocin, closticin, lasso peptide, lanthipeptide, salivaricin, colicin V, and carnocinCP52. Enterolysin A was the most common being found in 15 of the isolates, followed by enterocins, and lactacin F found in 9, and 7 of the LAB strains, respectively (**Table 6**). No virulence factors were identified for 5 of the isolates, and virulence factors associated with cell adhesion were identified for 13 of the LAB isolates (**Table 5**).

### Molecular Characterization of Antimicrobial Resistance Genes

A total of 23 of the investigated strains carried from one to four acquired genes associated with antimicrobial resistance. The most common AMR-encoding gene identified among the LAB strain set was msr(C) gene (n = 11) encoding for an ABC transporter associated with resistance to erythromycin, other macrolides, or streptrogramin B antibiotics. The msr(C) gene was always found alone and was associated with the phenotypes showing resistant to less antimicrobials (LVX and LVX DAP profiles). A total of four strains exhibited MDR phenotypic profiles (DAP TET CLI SYN, and DAP TET CLI SYN GEN), three of the these strains shared the same genotype: erm(B), aac(6<sup>0</sup> )-Iid, ant(6)-Ia and tet(M), encoding for a rRNA adenine N-6-methyltransferase, an aminoglycoside 6<sup>0</sup> -Nacetyltransferase, an aminoglycoside nucleotidyltransferase and a tetracycline resistance protein, TetM, involved in antibiotic target modification, respectively. Additionally, six different plasmid incompatibility types (rep1, rep2, repUS15, rep6, rep9, and repUS1) were characterized to determine if AMR-encoding genes were carried on the chromosome or plasmid. Seventeen strains harbored from one to three plasmids supporting the majority of AMR-genes were likely carried on plasmids. Further investigation elucidated that only plasmids belonging to the incompatibility type repUS1 were carrying antimicrobial genes (**Figure 3**). LAB strains, L22, L24-A, and L25, carried such plasmid harboring the erm(B) and ant(6)-Ia genes, involved in resistance to macrolides and aminoglycosides, respectively. BLASTn comparison showed that the repUS1 plasmids had 99% similarity out of 47, 48, and 68 of the total plasmid sequence registered under the accession number KJ645709, for the L22, L24-A, and L25 LAB strains, respectively.

FIGURE 1 | Attachment, cytotoxicity, virulence gene count, and bacteriocins. Labeled points of strain specific values for cytotoxicity and attachment are mapped to the x- and y-axis by bacterial cells/caco-2 cell and percent difference, respectively. Gray lines represent the confidence intervals for the mean estimate on either x- or y-axis. Points are colored and connected via lines of the same color based on each strain genus and species. Each labeled point is also represented by both a diamond and circle, the size of the diamond represents the number of bacteriocin genes, and the size of the circle represents the number of virulence-associated genes identified in the sequencing data. Black lines differentiate the adhesive potential as defined by Candela et al. (2005).

TABLE 3 | Summary of antimicrobial susceptibility pattern observed for novel lactic acid bacteria (LAB) strains.


<sup>a</sup>Minimum Inhibitory Concentration (MIC) breakpoints were interpreted based on the Clinical and Laboratory Standards Institute (Clinical and Laboratory Standards Institute, 2017), where isolates were classified into categories of (i) susceptible or (ii) non-susceptible, which includes both resistant and intermediately resistant.


<sup>a</sup>DAP, daptomicyn; LVX, levofloxacin; TET, tetracycline; VAN, vancomycin; STX, trimethoprim-sulphamethoxazole; CLI, clindamycin; SYN, synercid; RIF, rifampicin; GEN, gentamicin.

#### DISCUSSION

Assessment of safety of potential probiotic strains is essential to be determined prior their use as feed additives. This assessment should include antimicrobial susceptibility to antibiotics of human and veterinary importance, attachment and cytotoxicity to intestinal epithelial cells, determination of presence of virulence factors and transmissible AMR genes. The aim of this study was to use a systematic approach to identify safe and effective novel LAB probiotic strains able to collectively control three important foodborne pathogens, including Salmonella, E. coli O157:H7 and L. monocytogenes along the food continuum through a combined phenotypic and genotypic characterization strategy.

Antagonistic activity was determined based on overall reductions or inhibitions across three important foodborne pathogens (i.e., Salmonella, E. coli O157:H7 and L. monocytogenes). Results showed that all candidate LAB strains evaluated were antagonistic to at least one of the pathogens tested. In agreement with other studies (Das et al., 2016; Abushelaibi et al., 2017), the antimicrobial effect was shown to be species and strain-dependent, with E. faecium L20-B as the top strain across pathogens and in both antagonistic assays; however, in agar well diffusion the strain was only effective against L. monocytogenes. Importantly, eight of our candidate LAB strains had an antagonistic effect against all the pathogens evaluated in this study. L. salivarius L28 was the top-ranking LAB strain that was effective against all three pathogens in both the agar well diffusion and competitive exclusion broth culture assays. Reductions found in this study were higher than those observed by other authors (Osuntoki et al., 2008; Angmo et al., 2016; Abushelaibi et al., 2017). Abushelaibi et al. (2017) observed zones of inhibition ranging from 0.1 to 2 mm for Salmonella and E. coli O157:H7 and >2.1 mm of inhibition for L. monocytogenes compared to the largest reductions of 12 mm for E. coli O157:H7, 11.5 mm for Salmonella, and



24 mm for L. monocytogenes observed in this study. Similar to our results, LAB isolates exhibited greater activity against L. monocytogenes compared to the other pathogens analyzed (Abushelaibi et al., 2017). Angmo et al. (2016) characterized LAB probiotic strains isolated from fermented foods, they also observed larger antagonistic activity of LAB isolates against Gram-positive bacteria including L. monocytogenes and weaker to medium activity against Gram-negative bacteria such as E. coli.

In this study, no zones of inhibition were observed for Salmonella and E. coli O157:H7 when using LAB CFF. L. monocytogenes was inhibited by overnight cultures of candidate LAB probiotic strains and by their CFF. E. faecium L20- B caused the highest inhibition; E. faecium strains are known by their anti-listerial activity, this antagonistic action is associated with the production of bacteriocins, such as enterocins (Corr et al., 2007; Kasra-Kermanshahi and Mobarak-Qamsari, 2015).

In the competitive exclusion assay, greater inhibition was also observed for L. monocytogenes. Notably, Lactobacillus salivarius L28 inhibited L. monocytogenes completely after 24 h of coculture. Inhibition was highly probiotic strain dependent, for Salmonella and E. coli O157:H7 higher antagonistic effects were observed after 6 h of co-culture. As time increased, the antagonistic effect decreased, possibly by adaptation of the pathogens to the decrease in pH and/or the presence of antimicrobial compounds in the media. For L. monocytogenes the effect was different: as time increased, improved reductions were observed. It is important to highlight that all LAB strains tested remained at high concentrations (10<sup>9</sup> CFU/ml) in the co-culture assay throughout the 24 h incubation period.

Enterococcus was the most prevalent genus identified among our strain collection. Enterococcus strains from bovine and produce sources ranked among the top ten LAB strains, and were antagonistic mainly against L. monocytogenes. Enterococcus strains are ubiquitous in nature and are known by their bacteriocinogenic effect (Marekova et al., 2003; Sabia et al., 2004). Bacteriocins are ribosomally synthesized antimicrobial peptides able to inhibit closely related or non-related bacterial strains (Yang et al., 2014; Alvarez-Sieiro et al., 2016). Their bactericidal mechanism is primarily directed toward the receptor-binding located on bacterial surface, and also by causing cell membrane permeabilization (Yang et al., 2014). The anti-listerial activity of our LAB strains could have been associated with the production of bacteriocins. In fact, 24 of the strains had between one and six putative bacteriocins. Class II (unmodified peptides with 30–60 amino acids and a size < 10 kDa), and class III (heat unstable large proteins with a molecular weight > 30 kDa) bacteriocins were the most commonly identified in E. faecium strains with enterocins, and enterolysins being the most predominant. Bacteriocin-producing bacteria target cytoplasmic membrane, in Gram-negative bacteria the

TABLE 6 | Description and summary of bacteriocins produced by novel lactic acid bacteria (LAB) strains.


presence of a lipopolysaccharide (LPS) layer confers protection against the bactericidal effects of bacteriocins (Alvarez-Sieiro et al., 2016). However, Gram-negative bacteria with outer membranes that have been compromised due to sub-lethal stresses (i.e., heating, freezing) could be killed by membrane permeabilization (Bromberg et al., 2004). In Gram-positive bacteria, the lack of this protective layer make them more sensitive to these antimicrobial compounds. Salmonella and E. coli were reduced by our candidate LAB probiotic panel by 0.1–0.9, and 0.02–0.8 log<sup>10</sup> CFU/ml, respectively. It has been suggested that the antagonistic effect of some probiotic strains, including Lactobacillus, against Salmonella and E. coli O157:H7 is primarily due to the production of organic acids, mainly lactic and acetic acids (De Keersmaecker et al., 2006; Makras et al., 2006). Organic acids act by permeabilizing the outer membrane, allowing antimicrobial compounds to pass through and exert an antagonistic effect (Alakomi et al., 2000).

The use of probiotics strains is a natural alternative to reducing the use of antibiotics for growth promotion in animal agriculture and in human medicine and, possibly, the rapid emergence of AMR pathogens (Imperial and Ibana, 2016). For this reason, it is imperative to evaluate the AMR profile of candidate probiotic strains, where special consideration should be taken to separate intrinsic (i.e., from point mutations) from acquired resistance (i.e., from transfer of AMR genes and plasmids) ( Sanders et al., 2010; Imperial and Ibana, 2016). A great variability in antimicrobial susceptibility was observed among the strains tested here. All LAB strains were susceptible to penicillin (ampicillin and penicillin), and oxazolidinone (linezolid) antimicrobial classes. These results correlate with the fact that no genes for resistance against these antimicrobials were identified here. Our observations are in agreement with those of Abushelaibi et al. (2017), with most of their probiotic isolates being susceptible to ampicillin, and penicillin.


FIGURE 3 | Comparison of antimicrobial resistance (AMR) phenotype and AMR genes identified and description of AMR gene. The most common AMR-encoding gene identified among the LAB strain set was msr(C) gene encoding for an ABC transporter associated with resistance to erythromycin, other macrolides, or streptrogramin B antibiotics (MS phenotype). The msr(C) was associated with the phenotypes showing resistant to less antimicrobials (LVX and LVX DAP profiles). <sup>a</sup>AMR phenotypes were determined using a Sensititre Gram-Positive MIC plate and using CLSI breakpoints. LVX, levofloxacin; DAP, daptomycin; TET, tetracycline; VAN, vancomycin; SXT, trimethoprim/sulfamethoxazole; CLI, clindamycin; SYN, synercid; RIF, rifampicin; GEN, gentamycin. Acquired antimicrobial genes were determined using the ResFinder v.3.1, and the plasmid incompatibility types were determined using PlasmidFinder v.2.0 pipeline. Percentage of similarity of the genes defining the incompatibility type with the reference sequences of the pipeline database. AMR genes indicated in bold were detected on the same contig that the incompatibility type genes, therefore most likely located on the plasmid.

It is well known that LAB strains have intrinsic resistance to vancomycin and beta-lactam (Tejero-Sariñena et al., 2012; Varankovich et al., 2015). Resistance to levofloxacin was the most common AMR phenotype observed here. No acquired genes associated with levofloxacin-resistance were identified. Levofloxacin is a second-generation fluoroquinolone, where resistance is related to a point mutation(s) in one or more genes encoding the type II topoisomerases (gyrA, gyrB, parC, and parE) present in a chromosomal region known as the quinolone resistance-determining region (Redgrave et al., 2014). In Enterococcus species, resistance to levofloxacin has been associated with the presence of emeA gene (Jia et al., 2014), which was not identified in the Enterococcus isolates studied here. Resistance to lincosamides (clindamycin) was the second most common AMR phenotype, with most of the resistance isolates belonging to the Enterococcus genus: eight strains; resistance to lincosamides is conferred by a speciesspecific chromosomal gene, lsa(A), which encodes for an ABC transporter (Singh et al., 2002). Bacteria harboring the lsa(A) gene express the LS<sup>A</sup> AMR phenotype with cross-resistance to lincosamides and spectrogramins (Cattoir and Leclercq, 2017), as observed in this study.

A high percentage of LAB strains demonstrated intermediate resistance to erythromycin, which correlated with the presence of the msr (C) gene, a species-specific chromosomal gene of E. faecium that encodes for an ABC transporter and efflux pump (Cattoir and Leclercq, 2017). The msr (C) gene confers also the MS antimicrobial phenotype (resistance to erythromycin and type B spectrogramins). Its inactivation has resulted in increased susceptibility of E. faecium to MSB antimicrobials (Reynolds and Cove, 2005). Nine of the LAB strains exhibited multi-drug resistance; to further probe the presence of transferable AMRassociated genes, analysis of all candidate LAB strain genomes was performed. Only three of the candidate LAB strains carried resistance-associated genes in a plasmidic region. The presence of AMR-associated genes in plasmidic regions or mobile elements is of concern due to their potential to be horizontally transferred (Broaders et al., 2013; Imperial and Ibana, 2016). These plasmidcontaining strains were not among the top strains that showed antagonistic activity against foodborne pathogens and should not be considered as potential probiotic strains. These results support the safety of our top twenty selected LAB strains.

To be effective, bacteria in probiotic preparations should be able to adhere to the intestinal epithelium without causing cytotoxicity, to ensure longer permanence in the GIT (Pi ˛atek et al., 2012; García-Hernández et al., 2016). The ability of probiotic strains to adhere to epithelial cells improves their antagonistic action by allowing them to outcompete pathogens

Ayala et al. Novel Probiotics Characterization

for receptors on epithelial cells (Corr et al., 2009). All LAB strains evaluated were able to attach to Caco-2 cells, with adhesion efficiencies varying among strains. Adhesion to epithelial cells is dose, matrix, and strain-dependent (Jensen et al., 2012). Candela et al. (2005) classified microorganisms based on their adhesive properties into three categories: (i) non-adhesive strains, when less than 5 cells adhere to Caco-2, (ii) adhesive strains, when the effectiveness of adhesion means 5–40 cells adhered to one Caco-2 cell, and (iii) highly adhesive strains, when the level of adhesion exceeds 40 cells per one epithelial cell (Candela et al., 2005). Based on this classification, 5 of our isolates were highly adhesive, 21 were adhesive, and 2 were non-adhesive. One important difference between the Candela et al. (2005) study and the current study was the amount of time allowed for interaction between bacterial and Caco-2 cells. We chose 30 min of incubation as this is sufficient for initial attachment, the 18 h incubation time used by Candela et al. (2005) could have allowed for subsequent bacterial growth. Attachment efficiencies of our LAB strains were higher than those observed by Jankowska et al. (2008); their LAB isolates, including Lactobacillus and Lactococcus strains, were able to adhere to Caco-2 cells in a range between 0.5 and 5 bacterial cells per one Caco-2 cell after four h of incubation.

Caco-2 cytotoxicity based on the amount of LDH released into the medium was low, ranging from −4.69 to 8.42, after 24 h of inoculation with LAB strains. These values were lower than those observed by Awaisheh and Ibrahim (2009), who analyzed two probiotic isolates, L. acidophilus LA102 and L. casei LC232. Our LAB strains might be used as feed additives without causing cytotoxicity of epithelial cells, but this is clearly something that needs further investigation.

Four groups of virulence-associated genes were detected; two groups in E. faecium and two in E. faecalis. The first group of E. faecium strains carried a single virulence-associated gene, efaAfm encoding for EfaA, an important component of cell adhesion and biofilm formation homologous to PsaA in S. pneumoniae (Lowe et al., 1995). The second group, carried two virulence-associated genes, acm and efaAfm. These genes are both involved in cell adhesion; acm has also been highly associated with clinical isolates from humans (Nallapareddy et al., 2008). The third and fourth groups of virulenceassociated genes are contained in three strains of E. faecalis; these groups comprised seven unique virulence-associated genes including a hyaluronidase gene, hylA. Hyaluronidases are normally associated with cell lysis and degradation (Kayaoglu and Ørstavik, 2004). Of interest, the three hylAcontaining strains L9, L14-A, and L10, did not show higher levels of cytotoxicity than other LAB strains tested here

#### REFERENCES


(Bonferroni corrected p-value < 0.05). Additional genes found in these two groups included sex pheromone-associated genes (camE and cOB1), biofilm and cell wall adhesion genes (ebpA, efaA, and ace), and genes involved in macrophage persistence (elrA) (Nakayama et al., 1995; Brinster et al., 2007; Fisher and Phillips, 2009; Woods et al., 2017).

Overall, the results demonstrate the ability of our selected LAB probiotic strains to inhibit L. monocytogenes, Salmonella, and E. coli O157:H7. The strains exhibit important features that could enhance their antagonistic action (no AMR-encoding genes in mobile elements, production of bacteriocins, ability to adhere to epithelial, low cytotoxicity percentages). L. salivarius L28 was the top-ranking strain that was effective against all three pathogens in both the agar well diffusion and competitive exclusion broth assays. L. salivarius L28 not only demonstrated adhesion to and low cytotoxicity against Caco-2 cells but also carried a low number of virulence and AMR genes making this strain a particularly good candidate for further evaluation to control foodborne pathogens in pre- and post-harvest applications.

#### AUTHOR CONTRIBUTIONS

DA, PC, JF, and MB performed all experiments in the study. DA, PC, and KK contributed to the bioinformatics analyses. GL, MMB, and KN conceived the study. DA, PC, and KN contributed to writing and editing the final version of the manuscript.

### FUNDING

This work was supported by the International Center for Food Industry Excellence at Texas Tech University.

### ACKNOWLEDGMENTS

The authors would like to thank Dr. Lily Peterson and David Campos for initial isolation and characterization of the strain set described in this study at Texas Tech University.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2019.01108/full#supplementary-material


Ladakh. LWT Food Sci. Technol. 66, 428–435. doi: 10.1016/j.lwt.2015. 10.057


accumulation of lactic acid. FEMS Microbiol. Lett. 259, 89–96. doi: 10.1111/j. 1574-6968.2006.00250.x


Environments Workshop (GCE) (IEEE), New Orleans, LA, 1–8. doi: 10.1109/ GCE.2010.5676129


species to clindamycin and quinupristin-dalfopristin. Antimicrob. Agents Chemother. 46, 1845–1850. doi: 10.1128/AAC.46.6.1845-1850. 2002


**Conflict of Interest Statement:** MMB, GL, and KN have ownership in NexGen Innovations, LLC. NexGen Innovations, LLC, has licensed some of the strains described in this study through Texas Tech University for commercial development and application.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ayala, Cook, Franco, Bugarel, Kottapalli, Loneragan, Brashears and Nightingale. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Finding Functional Differences Between Species in a Microbial Community: Case Studies in Wine Fermentation and Kefir Culture

Chrats Melkonian1†, Willi Gottstein1†‡, Sonja Blasche<sup>2</sup> , Yongkyu Kim<sup>2</sup> , Martin Abel-Kistrup<sup>3</sup> , Hentie Swiegers 3‡, Sofie Saerens <sup>3</sup> , Nathalia Edwards <sup>3</sup> , Kiran R. Patil <sup>2</sup> , Bas Teusink <sup>1</sup> and Douwe Molenaar <sup>1</sup> \*

#### *Edited by:*

George Tsiamis, University of Patras, Greece

#### *Reviewed by:*

Gilberto Pereira, Federal University of Paraná, Brazil Chrysoula C. Tassou, Hellenic Agricultural Organization-ELGO, Greece

> *\*Correspondence:* Douwe Molenaar d.molenaar@vu.nl

†These authors have contributed equally to this work

#### *‡Present Address:*

Willi Gottstein, DSM Delft BV, Delft, Netherlands Hentie Swiegers, Carlsberg Breweries A/S, Copenhagen, Denmark

#### *Specialty section:*

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

*Received:* 17 January 2019 *Accepted:* 31 May 2019 *Published:* 25 June 2019

#### *Citation:*

Melkonian C, Gottstein W, Blasche S, Kim Y, Abel-Kistrup M, Swiegers H, Saerens S, Edwards N, Patil KR, Teusink B and Molenaar D (2019) Finding Functional Differences Between Species in a Microbial Community: Case Studies in Wine Fermentation and Kefir Culture. Front. Microbiol. 10:1347. doi: 10.3389/fmicb.2019.01347 <sup>1</sup> Systems Bioinformatics, VU University Amsterdam, Amsterdam, Netherlands, <sup>2</sup> European Molecular Biology Laboratory, Heidelberg, Germany, <sup>3</sup> Christian Hansen A/S, Hørsholm, Denmark

Microbial life usually takes place in a community where individuals interact, by competition for nutrients, cross-feeding, inhibition by end-products, but also by their spatial distribution. Lactic acid bacteria are prominent members of microbial communities responsible for food fermentations. Their niche in a community depends on their own properties as well as those of the other species. Here, we apply a computational approach, which uses only genomic and metagenomic information and functional annotation of genes, to find properties that distinguish a species from others in the community, as well as to follow individual species in a community. We analyzed isolated and sequenced strains from a kefir community, and metagenomes from wine fermentations. We demonstrate how the distinguishing properties of an organism lead to experimentally testable hypotheses concerning the niche and the interactions with other species. We observe, for example, that L. kefiranofaciens, a dominant organism in kefir, stands out among the Lactobacilli because it potentially has more amino acid auxotrophies. Using metagenomic analysis of industrial wine fermentations we investigate the role of an inoculated L. plantarum in malolactic fermentation. We observed that L. plantarum thrives better on white than on red wine fermentations and has the largest number of phosphotransferase system among the bacteria observed in the wine communities. Also, L. plantarum together with Pantoea, Erwinia, Asaia, Gluconobacter, and Komagataeibacter genera had the highest number of genes involved in biosynthesis of amino acids.

Keywords: microbial communities, lactic acid bacteria, genomes, metagenomics, computational biology, wine, kefir

### 1. INTRODUCTION

Lactic acid bacteria (LAB) are a group of microorganisms widely used for production of fermented food. They play a key role as natural fermentors or are used as starting cultures for a large variety of foods (Teusink and Molenaar, 2017), such as dairy products, kefir and yogurt (Prado et al., 2015). LAB are also used in alcoholic beverage production with a prominent role in winemaking, due to their capacity to perform malolactic fermentation (MLF) (Lonvaud-Funel, 1999, 2002). In none of these environments do they live in isolation but rather in communities of microscopic and macroscopic scale, for example on the skin and in biofilms. Therefore, LAB should be studied not only in isolation but also as a part of communities. Consequently, there is a strong desire to understand their roles in microbial communities, for example in their stability of communities. A deep understanding of these roles would enable alterations or even design of communities that serve a certain purpose. Results in this direction have already been achieved for small consortia, usually consisting of two species (Song et al., 2014; Biggs et al., 2015; Zomorrodi and Segrè, 2016). However, interactions in natural communities consisting of dozens to thousands of species are hard to analyze.

For complex communities, dynamic abundance data has been used to infer interactions between species within a community (Faust and Raes, 2012). While this can indeed lead to testable predictions, these results can also be very hard to interpret as they do not provide any detail of their underlying mechanism. For example, a positive correlation between two species can be caused by niche-overlap, cross-feeding or because these two species are both affected by a third one (Faust and Raes, 2012). To distinguish these options, the metabolic potential of the individual species should be taken into account as many of the interactions will probably take place at the level of exchange of metabolic products. These analyses currently typically require large-scale metabolic models (Freilich et al., 2010, 2011; Harcombe et al., 2014; Zomorrodi and Segrè, 2016). The reconstruction of such models is a time-consuming process as it usually requires manual curation, experimental validation, gap-filling, and an organism-specific biomass composition. As typically only a small percentage of species within a community can be cultured individually, the generation of high quality models for all members of a community is close to infeasible.

Attempts to do so (Magnusdottir et al., 2017) suffer from a lack of detailed validation of the predictions. Therefore, approaches that rely on genome-scale stoichiometric models are currently mostly applicable to small well-described (synthetic) communities (Mahadevan and Henson, 2012; Harcombe et al., 2014; Song et al., 2014; Biggs et al., 2015; Tan et al., 2015; Zomorrodi and Segrè, 2016) but even there one encounters many technical and biological challenges (Gottstein et al., 2016).

In this paper, we use a purely data-driven approach, with genomic information as primary input, that allows the creation of hypotheses about metabolic and other physiological properties of species in communities without the need to reconstruct detailed genome-scale metabolic models. The starting point of this analysis is gene annotation; we use the KEGG Orthology (KO) Database (Kanehisa and Goto, 2000; Kanehisa et al., 2012) whereby each KO represents a group of gene orthologs from different organisms associated with a molecular function. As KO's alone can be hard to interpret, we also map these KO's on KEGG pathways. This higher level mapping reveals discriminating features between organisms and leads to testable hypotheses about their metabolic and physiological characteristics. Although we used the KEGG annotation tool and database, alternative resources such as Gene ontology(GO), SEED and MetaCyc (Ashburner et al., 2000; Overbeek et al., 2005; Caspi et al., 2016) could be used and yield comparable results (Mitra et al., 2011; Altman et al., 2013).

We apply this computational pipeline on two different case studies. Firstly, to investigate Kefir a fermented milk product made with kefir grains, which consist of a complex microbial community embedded in a polysaccharide matrix. These communities consist of dozens of species (Walsh et al., 2016) whose metabolic capacities are largely elusive. Studies of the kefir community using metagenomic barcoding already showed that Lactobacillus was the most abundant genus, specifically the species Lactobacillus kefiranofaciens, Lactobacillus buchneri and Lactobacillus helveticus (Nalbantoglu et al., 2014). We expect that knowledge of their metabolism will provide more insight in their interactions in kefir and, therefore, we investigated genomes of 30 organisms isolated from kefir for their potential metabolism.

The second application of the pipeline is in understanding the role of L. Plantarum MW-1 in winemaking, by a functional comparison of microbial communities in three varieties of wine. Microbial activities are crucial in the formation of wine flavor and aroma. A prerequisite for improving winemaking is to understand the dynamics of the microbial communities in wine and the interactions that take place during the fermentation (Tempère et al., 2018). The alcoholic fermentation (AF) at the initial stage of winemaking is performed mainly by Saccharomyces cerevisiae. Subsequently, Oenococcus oeni, which due to its overall resistance to the harsh conditions of wine fermentation, such as high alcohol concentrations, is the best candidate to start a MLF (Ribéreau-Gayon et al., 2006a,b). Various studies indicate the possibility to use alternative MLF starters. L. plantarum strains received interest to fulfill this role (Hernandez et al., 2007; Testa et al., 2014), due to their characteristic fermentation profile. To investigate the influence of L. Plantarum MW-1 on the development of the microbial communities we followed its inoculation in three different wine varieties (Bobal, Tempranillo, and Airen) from La Mancha, Spain, 2013 (one inoculated and two control fermentations per variety **Figures S5, S6**). The point of inoculation was chosen to be at the start, to give precedence of MLF over AF. In this way, a reduction of total fermentation time is obtainable, and inhibition of L. plantarum by high alcohol levels is avoided. We used metagenome shotgun time-series from these fermentations to study the community. Although next-generation sequencing (NGS) has recently been applied in food research and particular in wine fermentation (Kioroglou et al., 2018; Stefanini and Cavalieri, 2018), the usage of metagenomic shotgun sequencing that allows a direct identification and comparison of the functional potential capabilities for a microbial community and its members, is not yet fully exploited (Morgan et al., 2017a; Sternes et al., 2017; Zepeda-Mendoza et al., 2018).

### 2. MATERIALS AND METHODS

### 2.1. DNA Extraction and Genome Sequencing of Kefir Isolates

Two milliliters of the culture were pelleted at 15,000 rpm in a table centrifuge. The pellet was suspended in 600 µl TES buffer (25mM Tris; 10mM EDTA; 50mM sucrose) containing 20 mg/ml lysozyme (Sigma-Aldrich, cat# 62971) and incubated for 30 min at 37◦C. The samples were then crushed with 0.3 g glass beads (Sigma-Aldrich, cat# G1277, 212–300 µm) at 4m/s for five times 20 s using the FastPrep-24 instrument (MP Biomedicals). 150 µl 20% SDS was added and after 5 min incubation at room temperature the tubes were centrifuged at maximum speed for 2 min. The supernatant was digested with 10 µl proteinase K (20 mg/ml) for 30 min at 37◦C and proteins were precipitated with 200 µl potassium acetate (5 M) for 15 min on ice. The samples were then centrifuged for 15 min at 4◦C and the supernatant applied to phenol/chloroform extraction. DNA was precipitated by adding two volumes of icecold isopropanol and 20 min incubation at –20◦C followed by washing with 70% ethanol at 4◦C. DNA quality was checked on agarose gel.

Kefir species were identified by Sanger sequencing of the 16S/ITS (internal transcribed spacer) region, using the primers S-D-Bact-0515-a-S-16 (GTGCCAGCMGCNGCGG) and S-\*- Univ-1392-a-A-15 (ACGGGCGGTGTGTRC) (Klindworth et al., 2012). Unique isolates were sequenced using the Illumina HiSeq 2000 platform at EMBL genomics core facility (Heidelberg, Germany) with 100 bp paried-end reads. The A5-miseq pipeline was used for quality-based trimming and filtering, error correction and de novo assembly (Coil et al., 2015). The assembled genome was annotated using Prokka version 1.11 (Seemann, 2014).

### 2.2. Sampling and Sequencing of Wine Fermentations

Wine was sampled in the autumn of 2013 at Bodegas Purisima Concepcion (La Mancha, Spain) before the fermentation (day 0), during fermentation (days 1,2,3,4,7,14) and at the end of the fermentation (day 21). Samples of the white wine were taken from the top of the concrete tank by rapidly lowering a 250 mL baby bottle (single use) to 1 m depth using a rope and slowly bring it to the top. The wine was decanted to a 50 mL falcon tube and put directly in a –50◦C freezer. To avoid the grape skin cap the red wine was sampled from the valve in the bottom after flushing the valve in order avoid obtaining residue wine. This was also done after racking of the wine. Cautions where taken in order to minimize contamination. Samples were handled wearing gloves and changed between replicates, aluminum foil was applied on the work station and changed between replicates, and filter pipettes were used all the time.

For DNA isolation, cells were pelleted from 50 mL of wine centrifuged at 4,500 g for 10 minutes and subsequently washed three times with 10 mL of 4◦C phosphate buffered saline (PBS). The pellet was mixed with G2-DNA enhancer (Ampliqon, Odense, Denmark) in 2 ml tubes and incubated at RT for 5 min. Subsequently, 1 mL of lysis buffer (20 mM Tris-HClpH 8.0, 2 mM EDTA and 40 mg/ml lysozyme) was added to the tube and incubated at 37◦C for 1 h. An additional 1 mL of CTAB/PVP lysis buffer (50) was added to the lysate and incubated at 65◦C for 1 h. DNA was purified from 1 mL of lysate with an equal volume of phenol-chloroform-isoamyl alcohol mixture 49.5:49.5:1 and the upper aqueous layer was further purified with a MinElute PCR Purification kit and the QIAvac 24 plus (Qiagen, Hilden, Germany), according to manufacturer's instructions, and finally eluted in 100 ul DNase-free water.

Prior to library building, genomic DNA was fragmented to an average length of 400 bp using the Bioruptor XL (Diagenode, Inc.), with the profile of 20 cycles of 15 s of sonication and 90 s of rest. Sheared DNA was converted to Illumina compatible libraries using NEBNext library kit E6070L (New England Biolabs) and blunt-ended library adapters described by Meyer and Kircher (2010). The libraries were amplified in 25-mL reactions, with each reaction containing 5 muL of template DNA, 2,5 U AccuPrime Pfx Supermix (Invitrogen, Carlsbad, CA), 1X Accuprime Pfx Supermix, 0.2 uM IS4 forward primer and 0.2 uM reverse primer with sample specific 6 bp index. The PCR conditions were 2 minutes at 95◦C to denature DNA and activate the polymerase, 11 cycles of 95◦C for 15 s, 60◦C annealing for 30 s, and 68◦C extension for 40 s, and a final extension of 68◦C extension for 7 minutes.

The quality and quantity of the libraries were measured using the high sensitivity DNA analysis kit on the Bioanalyzer 2100 (Agilent technologies, Santa Clara, United States), and the libraries were pooled at equimolar concentration. Sequencing was performed on the Illumina HiSeq 2500 in PE100 mode and MiSeq in 250PE mode following the manufacturer's instructions.

### 2.3. General Workflow of Functional Computational Analysis

The general workflow that we follow is illustrated in **Figure 1**. The starting point of the analysis is gene annotation to determine orthologous genes (Gabaldn and Koonin, 2013) for which we use BlastKoala and GhostKoala (Kanehisa et al., 2016), through webservices provided by KEGG (Kanehisa et al., 2013). These webservices map genes to KEGG Orthologs (KO's) that represent groups of orthologous genes which are linked to a molecularlevel function. Based on their KO content, the organisms and samples can be clustered. This process yields several groups of distinct characteristics that are determined using diverse data mining techniques and mainly, but not exclusively, concern the metabolic potential. Finally, these characteristics enable the formulation of specific hypotheses about the physiological properties of species and a community as a whole in individual samples. Other methods for the analysis of genome information on the functional level exist, such as MG-RAST (Meyer et al., 2008) and Megan (Quince et al., 2017). HUMAnN (Abubucker et al., 2012) was the first to incorporate microbial pathway abundances for metagenomic data. We choose to apply a custom pipeline to be generic and allow high versatilely throughout the analysis. Moreover, the use of the published BlastKOALA and GhostKOALA from KEGG (Kanehisa et al., 2016) provides an up to date annotation with KEGG database. Alternative, eggNOG (Huerta-Cepas et al., 2016) provide a strong framework for orthology annotation. All figures were visualized using base R packages (R Core Team, 2016), ggplot2 (Wickham, 2016) and pheatmap (Raivo, 2019).

BlastKoala and GhostKoala in this study. We use the gene orthologs (KO's) to cluster species and samples based on their KO content. From the individual clusters we extract the characteristic features which leads to educated predictions about the functional potential of individual species and a community present in a sample. In the case of isolates, the predictions are confirmed using MetaDraft that does not rely on KO's.

### 2.4. Metagenomic Sequence Prepossessing

Quality control and filtering was applied on all paired-read data using FastQC v0.11.4 (Andrews, 2010) before and after the application of Trim Galore v0.4.1 (Andrews, 2012) and Cutadapt v1.9.1 (Martin, 2011), tools for quality and adapter trimming. Subsequently, the reconstruction of full-length small subunit (SSU rRNA) gene sequences was obtained using EMIRGE (Miller et al., 2011) with the SILVA 123 SSURef Nr99 database (Pruesse et al., 2007). A taxonomy was assigned using SINA Alignment Service on the resulting SSUs (Pruesse et al., 2012). The resulting SSU's were clustered to OTUs with 97% identity using UCLUST (Edgar, 2010) and the estimates of relative taxon abundances provided by the program added and normalized accordingly. A chimera sequence check was performed using UCHIME (Edgar, 2016). For both tools the qiime interface was used (Caporaso et al., 2010). Afterwards, the OTUs were arranged to a BIOM table with a custom R script (R Core Team, 2016), to allow further analysis.

### 2.5. Sequence Binning

For each grape variety the metagenome shotgun samples were merged together to achieve deep coverage, and were assembled with the Iterative De Bruijn graph de novo Assembler for short reads sequencing data with highly Uneven sequencing Depth (IDBA-UD) (Peng et al., 2012). The resulting contigs were binned with Maxbin 2.0 (Wu et al., 2014, 2015), which clusters the sequences into draft genomes (bins) using the tetranucleotide frequencies and sequence coverage. For differential coverage, all the metagenome samples belong to fermantaitons of the same grape variety were used. Furthermore, bin taxonomy assignments were carried out following the multi-metagenome pipeline (Albertsen et al., 2013). Maxbin calculates a quality of the resulting bins, using occurrence of essential genes to calculate a completeness score for the entire bin.

#### 2.6. Gene Annotation

The gene annotation was carried out using BlastKoala and GhostKoala (Kanehisa et al., 2016) using the databases, "genus\_prokaryotes" and "genus\_prokaryotes" or "genus\_prokaryotes plus family\_eukaryotes" for the kefir isolates and the metagenomic samples, respectively. While protein fasta files can be directly submitted to BlastKoala when isolates are examined, a re-assembly with IDBA-UD was necessary before submission of metagenome samples (Peng et al., 2012). To predict the open reading frames (ORFs), we used prodigal (Hyatt et al., 2012) with parameterization for metagenome data. The produced ORFs are then used as an input for GhostKOALA, which provides the KO (KEGG Orthology) assignments. Also, the effect of different sequencing depth on the number of predicted ORFs was investigated **Figure S7**.

### 2.7. Calculation of Feature Matrices and Clustering

Using the output from BlastKoala and GhostKoala, several feature matrices were calculated. In the case of microbial isolates, a feature matrix K is constructed of dimensions n×m where m is the number of isolated species and n is the number of KO's. The entries kij are 1 if the KO j is present in species i and 0 otherwise. A r × m feature matrix P was calculated, whose r rows and m columns correspond to KEGG pathway ID's and isolated species, respectively. The entries pij thereby represent the number of KO's present in pathway i for species j. To account for the different pathways sizes, pij is normalized with respect to the total number of KO's present in pathway i.

For the analysis of the metagenomic data, a n × m feature matrix G was constructed by calculating sequence abundance per KO and summing these per genus. The entries gij equal the number of sequence reads of the genus i present in sample j. To account for variability in sequence reads per sample the entries gij were normalized with respect to the number of sequence reads per sample <sup>g</sup><sup>j</sup> and multiplied by 1 million ( <sup>g</sup>ij gj × 10<sup>6</sup> ). We also took into account the inoculation of Lactobacillus plantarum and further normalize all samples using the complement (1 − <sup>g</sup>lactobacillus) of the Lactobacillus genus abundance ( <sup>g</sup>ij 1−glactobacillus )

Another feature matrix A was calculated in which entries aij equal the number of sequence reads mapped to a KO-genus combination i present in sample j. This matrix yields a very large number of features and, consequently, very detailed information.

Finally, a feature matrix PM is used to explore biological implications by mapping KO's to KEGG pathways. Similarly, m is the samples during the fermentations, on the other hand n now is the KEGG pathway IDs tagged with genera. The entries pmij thereby represent the number of KO's present in pathway i in sample j. To account for the different pathways sizes, pmij was normalized with respect to the number of KO's per pathway i.

Clustering analysis is performed using affinity propagation, which is a graph based approach (Frey and Dueck, 2007; Bodenhofer et al., 2011). Pearson correlation was frequently chosen as the final similarity measure and Bray-Curtis similarity in few cases. A general work-flow to assess the most suitable number of clusters is started with high exemplar preferences values, which led to a very large number of clusters. Application of agglomerative clustering on the resulting affinity propagation clusters using the R-package apcluster (Bodenhofer et al., 2011), allowed an inspection of the corresponding dendrogram (**Figure S1**). Therefore, a cutoff manually decided and affinity propagation rerun repeatedly to achieve the desirable number of clusters.

#### 2.8. Feature Selection

The R package Boruta (Kursa and Rudnicki, 2010) was used to obtain a reliable ranking of feature importance and to select only discriminative features for different classification tasks. This algorithm is a wrapper around Random Forest (Breiman, 2001) that performs randomization tests. Features with confidence of importance above 0.99 (the default value in Boruta) were treated as informative. Also the maximal number of importance source runs was increased to 2000 and in some cases to 5000. As the input one of the 75 × z feature matrices described above (where 75 corresponds to the number of samples) were used, with z varying from around 1016 to 228.256 features depending on the matrix. For example, when summing up all KO abundances per genus the resulting matrix is 75×1016. On the other hand, when using KO-genus combinations as features, the matrix extended to 75×228256 after filtering. For supervised machine learning, apart from an input feature matrix X also a response vector Y is used. Here we used prior knowledge of the samples and constructed a response vector based on red or white wine varieties (two classes) or the individual grape varieties (three classes).

### 2.9. Computational Validation

#### 2.9.1. Validation of KEGG Functional Annotation With MetaDraft

As only around 50% of the genes can be mapped to KO's (see **Table S1**) when analyzing kefir isolates, it is unclear how much information will be lost by mapping compared to just using all genetic information. We therefore created template models for selected KEGG pathways and then used MetaDraft (See section **S7**; **Figure S24**) to determine genes that are present in an organism. For a given pathway, all reactions were retrieved along with their corresponding genes that are found in organisms belonging to the phylum Firmicutes using the Python package BioServices (Cokelaer et al., 2013). Within MetaDraft, the AutoGraph method (Notebaart et al., 2006) is used, which is a sequence based orthology approach, independent of functional annotation. It is therefore suitable to serve as an independent method to validate the results obtained using KO's.

#### 2.9.2. Validation Computational Findings in Metagenomics

In metagenomics, a computational validation perform using 16SrRNA reconstruction and binning, which aims to reach the species level of taxonomy. Therefore, it provides extra confidence for the hypothesis generated with the basic computational pipeline on genus level. Moreover, an extra computational validation performed on the concluding results from pathways enrichment analysis on LAB comparison. By removing all close identical sequences (below 99% amino acid similarity) from metagenome samples of reconstructed bins and complete isolate genomes of interest (L. plantarum), for example potential exclusive contribution of the high PTS of L. plantarum can be determined. Therefore, prediction of an accurate shift of functional potential of the community induced by a single species can be identified.

#### 2.10. Assessing Motility of *Acetobacter*

Motility of Acetobacter was tested on MRS/whey agar (26 g MRS broth from OXOID, 16 g agar, 500 ml water and 500 ml kefir whey, 48 h fermentation). The plates were incubated for 3 to 4 days at 30◦C. Motility was regarded as positive when the cultures spread into the agar and around the spotted colony. Growth only at the spotted area was rated negative. Motility was observed after already 1 day for all four Acetobacter isolates. Growth on YPDA for up to 4 days at 30◦C revealed no motility.

## 3. RESULTS

### 3.1. Grouping of Genera Based on Presence of KO's

We isolated and sequenced 33 organisms from kefir communities (see section 2.1 for details). To identify discriminative factors between species, we first focused only on the presence and absence of KO's per species and cluster the species based on the KO content using affinity propagation. Hierarchical clustering on top of this result identified eight distinct clusters that separate and in some cases subdivide the genera of Lactobacilli, Lactococci, Rothia, Acetobacter, Staphylococci and Micrococci (**Figure 2**). See section 2.7 and **Figure S1** for details. This result shows that the KO content alone already has discriminative power and can also lead to non-trivial results, as not only organisms of the same genus group together but also organisms of different genera. The interpretation of the results is, however, not straightforward as the molecular functions assigned to the KO's cannot easily be translated into predictions about physiological characteristics that distinguish the clusters. Therefore, further analyses is required, as described below.

### 3.2. KEGG Pathway Coverage Discriminates Two Groups of *Lactobacilli*

To understand the clustering results better, we mapped the KO's to the level of KEGG pathways and calculated pathway coverage (which is the number of KO's present in this organism in this pathway divided by the total number of KO's in the pathway, see section 2.7). Pathway coverage was subsequently used as input for another clustering. The resulting hierarchical clustering shown in **Figure 2** is similar to the one obtained based only on KO presence, except for the Lactobacilli. Whereas, these form a single cluster in the previous dendrogram, they are distributed over two clearly separated clusters when using pathway coverage.

To identify the pathways that discriminate the two groups of Lactobacilli, we determined all pathways that have a high standard deviation with respect to their coverage. They are shown in **Figure 3**. The most notable differences are associated with amino acid metabolism: In L. kefiranofaciens, histidine, phenylalanine, tryptophan and tyrosine metabolism is completely absent while the remaining Lactobacilli all have KO's associated with the synthesis pathways for these amino acids. Conversely, L. kefiranofaciens has 27 entries on the phosphotransferase system (PTS) pathway map, whereas the remaining Lactobacilli have at most 7 KO's on this map (**Figure S2**).

### 3.3. Identifying Discriminating Signaling Pathways and Structural Components

This method is not restricted to metabolism but can also make predictions about structural and signaling components represented in KEGG pathways. By identifying the pathways that show the highest standard deviation with respect to their coverage between a representative of each of the clusters, we found that only Acetobacter has KO's associated with flagella assembly (**Figure 3**). They also have the highest pathway coverage for bacterial chemotaxis (**Figure 3**) which is related to oxygen sensing. Since they are strict aerobes (Sievers and Swings, 2015) both observations would be in agreement with the hypothesis that they use chemotaxis to move on oxygen gradients, and possibly also on gradients of their carbon- and energy source. The presence of flagella in Acetobacter was experimentally confirmed (see section 2.9).

### 3.4. Results of KO Annotation Are Consistent With Systematic Pathway Reconstruction

These analyses show that it is possible to create hypotheses about metabolic capacities and structural properties in a fast manner using annotated genomes, in this case annotated with KO's. As only around 50% of the coding sequences can be mapped to KO's (**Table S1**), there is the possibility that important reactions which do not have KO's associated with them are missed. Therefore, we confirmed the results shown in **Figure 3** using an approach that does not rely on KO's but uses only sequence information. For the KEGG maps containing histidine and phenylalanine, tyrosine and tryptophane synthesis pathways, respectively, we created stoichiometric models by retrieving all genes associated with reactions in the respective pathways that belong to organisms of the phylum Firmicutes which also covers the genus Lactobacillus. Subsequently, InParanoid (O'brien et al., 2005) was used to find orthologs in sequences of the kefir isolates, the corresponding reactions were identified and compared to the reactions associated with present KO's. The results obtained in this way are consistent with the BlastKoala output (**Figures S3, S4**, and see section 2.9 for details), however, the analysis is far more time-consuming than running BlastKoala even if only these two pathways are considered.

### 3.5. Dynamics of Genera in Wine Fermentations

The metagenome of each sample was assembled into contigs and scaffolds (see section 2.6). The open reading frames (ORF's) on these sequences were identified and annotated with KO's using GhostKoala. An overview of the dynamics of abundances of genera was obtained by summing the KO coverage, i.e., the number of reads mapped to the ORF corresponding to the KO, per genus, in each of the samples (**Figure 4**). Although our basic computational pipeline aims to explore the functional potential of the community, in metagenomics the overview of abundance dynamics can be obtain without extra workload. The table of genera abundances was normalized, and genera with a high standard deviation of abundance across the samples were kept (see section 2.7). A few notable patterns appeared. Firstly, the Lactobacillus genus is highly abundant in the samples inoculated with L. plantarum. However, the abundance of Lactobacillus diminished in time when inoculated in the two red grape varieties, Bobal and Tempranillo, whereas in the white grape variety, Airen, it was highly abundant and the abundance increased during the fermentation. Furthermore, Lactobacillus was also present in the Airen controls, in contrast to the control fermentations of the red varieties. Secondly, the

abundance of Lactobacillus in the Airen variety seems to correlate negatively with the abundance of two genera (Aspergillus and Sclerotinia), which are spoilage molds. Thirdly, the abundance of Lactobacillus is positively correlated with multiple genera such as Pediococcus, Enterococcus, Oenococcus (see **Figure 4**). Fourthly, some genera are present in fermentations of all three grape varieties, like Pseudomonas, Azotobacter, Vitis and Saccharomyces. Fifthly, some genera occur in fermentations of one variety only, such as Pantoea and Gluconobacter in Airen, Dyella and Rhodanobacter in Tempranillo, and Bradyrhizobium and Acetobacter in Bobal (see section **S5**; **Figures S17**, **S19**, **S20**, for a systematic investigation of discriminative genera and the corresponding pathways for each wine variety). Finally, the observation of Saccharomyces and Vitis (grape) DNA is in agreement with the prior knowledge that during the alcoholic fermentation Saccharomyces abundance is high and that grape skins are only added at the start of the red wine fermentations and not in the white wine fermentations.

### 3.6. Clustering of Samples Based on KO Abundance in Genera

The mapped data were used to create a table of the KO abundances per genus, which increases the feature space substantially relative to summing these numbers per genus

as done above. The samples were clustered using affinity propagation on the Pearson correlation matrix of this table (see section 2.7). This resulted in a high resolution grouping of samples (**Figure 5**), evidently better than when using reconstructed small subunit (SSU) rRNA abundances (see **Figure S10**). The microbiomes of the red and white grape varieties could be distinguished, as well as three different stages of fermentation separating the samples of the initial grape must phase, the samples during fermentation, and bottled or final samples of the time series. Finally, the samples of the Airen variety inoculated with L. plantarum formed a highly correlated separate cluster. The robustness of the clustering was tested by removing major genera (Lactobacillus, Oenococcus and Saccharomyces) and a potential artifact (Vitis) from the data and reapplying the clustering. The main groups remained essentially unchanged after this procedure (**Figure S15**).

### 3.7. *L. plantarum* Has the Highest PTS Potential Among the Community

To confirm that the Lactobacillus genus pattern identified so far is indeed the result of the added L. plantarum MW-1 strain, we applied a 16S-rRNA reconstruction and binning (see sections **S2–S4**; **Figures S8**, **S9**, **S11–S14**). As a result we obtained a reconstruction of 16S-rRNA genes of L. plantarum. Moreover, the L. plantarum draft genome was successfully binned with a high completeness score. Using a few well reconstructed genomes from the binning process, we demonstrate the potential usage of our method also on metagenomic bins. We compared the L. plantarum isolate strain with the reconstructed Lactobacillus brevis genome bin from the Airen fermentations and the three reconstructed Oenococcus oeni genome bins from each variety of grape. The comparison revealed that the L. plantarum and L. brevis bins had a higher metabolic potential than the three Oenococcus bins, especially with regard to amino acid metabolism, PTS and sulfur relay system KEGG pathways (see **Figure 6A**). Using metagenomic assembly annotations the coverage of Lactobacilli PTS stood out when L. plantarum was present in the fermentations. (see **Figure 6B** top). The same effect was observed for genes mapped to amino acid metabolism. Moreover, in addition to Saccharomyces, Pantoea, Komagataeibacter, Gluconobacter, Erwinia, and Asaia were found to be in the top ten genera with high coverage of amino acid

metabolism (**Figure 6B** bottom). Interestingly, Boruta feature selection analysis assigns the latter five genera as discriminative for Airen against Bobal and Tempranillo (**Figure S16**).

### 4. DISCUSSION

The examples demonstrating computational analysis on functional and metabolic level show that it is possible to characterize organisms or samples based on KO annotation of genomes, and that hypotheses concerning the physiology and roles of organisms can be derived. This approach is especially useful when studying complex communities. It aims at grouping and contrasting of species by a global comparison of functions. It thereby provides evidence for groups of organisms that might play similar roles, or points to their differences and putative specific roles that they might play in a community. Our computational pipeline can be used in several ways in the research of microbial communities.

When genome sequences of individual community members are available, they can be easily characterized in terms of their functional potential. This is particularly relevant for communities that are not well described. As an example, the Acetobacter species stood out among the kefir isolates by the fact that they possess structural genes for the assembly of flagella, as well as a chemotaxis signaling system possibly involved in oxygen sensing **Figure 3**. Since their motility was confirmed experimentally, these observations suggest an important role for chemotaxis of this species in kefir. Indeed, Acetobacter is mostly present in kefir milk, and less in the semi-solid grains, which is in accordance with this hypothesis (Marsh et al., 2013).

Another important observation was that L. kefiranofaciens, a dominant organism in kefir (Walsh et al., 2016), stands out among the Lactobacilli because of the absence of biosynthesis pathways for a number of amino acids. This species will therefore most likely have several amino acid auxotrophies. Hence, the organism will depend on free amino acids and peptides in milk, which can be present in fresh milk, are released by extracellular enzymatic degradation of milk protein or are produced by other organisms. Whichever way, these auxotrophies will play an important role in the ecology of kefir fermentation.

One should, however, keep in mind that the characterization only concerns genotypic potential. Whether and under which conditions the same genotypic potential also results in identical phenotypes will have to be examined in experiments. We anticipate that the absence of a pathway is more conclusive than its presence as it is most likely context and media dependent whether genes of a pathway are expressed. We strongly believe that this approach provides more insights than a clustering based on gapfilled genome-scale stoichiometric models. To accurately close gaps in pathways one would have to determine an organism-specific biomass composition and grow the individual species under several different conditions to e.g., identify auxotrophies and carbon sources that can be utilized which is very time and resource consuming. It is also very challenging from an experimental point of view as species can be hard to cultivate in isolation. Alternatively, one could also automatically gapfill all the models without experimental validation on a defined medium but then one might miss auxotrophies that can lead to metabolic interactions and the added value of the gapfilling is more than questionable. The presented method focuses only on the gene-associated reactions avoiding all unnecessary overhead and a fast selection of interesting species that can then be examined further in experiments.

Computational analysis was further applied to metagenome data of wine fermentations to explore the effect of the introduction of a L. plantarum strain on community composition and dynamics. Furthermore, the dataset, although limited, also allowed an initial exploration of differences between communities in red and white wine fermentations. Together with the functional annotation GhostKoala provides also taxonomic assignment on genus level, which allows not only the exploration of the functional potential of the community, but also the straightforward investigation of genera abundances dynamics.

Therefore, we readily found evidence to support the hypothesis that successful inoculation of a new species to a community was in the case of wine an effect firstly of medium composition, and may determined by fermentation with skin or without skin. Nevertheless, the effect of microbial community interactions such as competition or collaboration cannot be discarded. The experimental results supported this hypothesis (See section **S1**; **Figures S21**, **S22**). Studies on the closely related species Lactobacillus hilgardii and Pediococcus pentosaceus indicated that phenolic compounds from grape skins could be involved (García-Ruiz et al., 2009). Therefore, the identification of the mechanism behind the inhibition by phenolic compounds as well as the selection of strains resistant to these could play a key role for the usage of organism other than O. oeni for MLF in red wines.

The use of annotated metagenomes allowed a fast overview of the community abundance dynamics, such as time-dependent abundance level per genera, presence of common genera in different microbiomes and identification of unique genera in the microbioomes of grape varieties. In addition, we identified putative positive and negative correlations with L. plantarum, suggesting for example that L. plantarum may inhibits growth of fungi (Aspergillus, and Sclerotinia), as has been observed before (Valerio et al., 2009; Tropcheva et al., 2014; Lipinska et al., 2016 ´ ).

By binning metagenomics data and using these to investigate KEGG pathway enrichment, we showed that L. plantarum is highly enriched in PTS transport components compared to the other microorganisms in the wine communities. Only a few other metabolic conversions are exclusively found in L. plantarum (Fructoselysine/Glucoselysine → Fructoselysine/Glucoselysine 6-phosphate, N-Acetylgalactosamine → N-Acetyl-galactosamine 6-phosphate, Galactosamine → Galactosamine 6-phosphate (See section **S6**; **Figure S18**). These unique properties could play a role in growth of the community.

The shannon index reveals substantial differences in microbial diversity between the white and the two red varieties (**Figure S23**). The relative abundance of S. cerevisiae reaches up to 90% in the red wine fermentations whereas in the white wine fermentations it reaches up to 60%. Also, Pantoea, Erwinia from Enterobacteriaceae family and Asaia, Gluconobacter and Komagataeibacter from Acetobacteraceae family are exclusively found in the white wine fermentations. These genera are known to be relevant for wine making (Marzano et al., 2016),(Morgan et al., 2017a), in particular acetic acid bacteria for their capacity to oxidize ethanol to acetic acid (Gomes RJ, 2018). Yet, their potential function inside wine communities is not fully explored. We have shown that these five genera have high coverage of metabolic pathways involved in amino acid metabolism. Amino acids, together with ammonium salts, are major nitrogen sources present in grapes, and are essential for microbial growth (Waterhouse, 2016). Moreover, the composition of amino acids seems to influence wine aroma (Hernández-Orte et al., 2002) (Styger et al., 2011). Therefore, studies already examined the effect of microorgansims on amino acid composition during AF (S. cerevisiae Fairbairn et al., 2017) and MLF (O. oeni and L. plantarum Pozo-Bayón et al., 2005). With this in mind, we suggest that the five genera mentioned above are candidates for future investigation.

## AUTHOR CONTRIBUTIONS

CM and WG conceived the methodology, wrote the code, and performed the analyses. SB, YK, and KP sequenced the genomes of the kefir microorganisms. MA-K, HS, and SS carried out the sequencing of wine metagenomes. CM and NE carried out the inhibition experiments on L. plantarum. CM, WG, DM, and BT wrote the paper.

### FUNDING

MicroWine: This study was funded by the Horizon 2020 Programme of the European Commission within the Marie Skłodowska-Curie Innovative Training Network MicroWine (grant number 643063).

### ACKNOWLEDGMENTS

We thank Herwig Bachmann, Frank Bruggeman, Elke Brockmann, Esther Kuiper, Raissa Novais, Ana Rute Neves, and Ulisses Nunes da Rocha for discussions. We thank Domaine Kikones and Boutari wineries for providing samples.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2019.01347/full#supplementary-material

## REFERENCES


**Conflict of Interest Statement:** MA-K, HS, SS, and NE were employed by the company Christian Hansen A/S.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Melkonian, Gottstein, Blasche, Kim, Abel-Kistrup, Swiegers, Saerens, Edwards, Patil, Teusink and Molenaar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genomic Comparison of Lactobacillus helveticus Strains Highlights Probiotic Potential

Alessandra Fontana<sup>1</sup>† , Irene Falasconi<sup>1</sup>† , Paola Molinari<sup>1</sup> , Laura Treu<sup>2</sup> \*, Arianna Basile<sup>2</sup> , Alessandro Vezzi<sup>2</sup> , Stefano Campanaro2,3‡ and Lorenzo Morelli<sup>1</sup>‡

<sup>1</sup> Department for Sustainable Food Process – DiSTAS, Università Cattolica del Sacro Cuore, Piacenza, Italy, <sup>2</sup> Department of Biology, University of Padua, Padua, Italy, <sup>3</sup> CRIBI Biotechnology Center, University of Padua, Padua, Italy

#### Edited by:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### Reviewed by:

Julio Villena, CONICET Centro de Referencia para Lactobacilos (CERELA), Argentina Giorgio Giraffa, Research Centre for Animal Production and Aquaculture (CREA), Italy

#### \*Correspondence:

Laura Treu laura.treu@unipd.it †These authors have contributed equally to this work as co-first authors ‡These authors have contributed

equally to this work as co-last authors

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 21 January 2019 Accepted: 03 June 2019 Published: 26 June 2019

#### Citation:

Fontana A, Falasconi I, Molinari P, Treu L, Basile A, Vezzi A, Campanaro S and Morelli L (2019) Genomic Comparison of Lactobacillus helveticus Strains Highlights Probiotic Potential. Front. Microbiol. 10:1380. doi: 10.3389/fmicb.2019.01380 Lactobacillus helveticus belongs to the large group of lactic acid bacteria (LAB), which are the major players in the fermentation of a wide range of foods. LAB are also present in the human gut, which has often been exploited as a reservoir of potential novel probiotic strains, but several parameters need to be assessed before establishing their safety and potential use for human consumption. In the present study, six L. helveticus strains isolated from natural whey cultures were analyzed for their phenotype and genotype in exopolysaccharide (EPS) production, low pH and bile salt tolerance, bile salt hydrolase (BSH) activity, and antibiotic resistance profile. In addition, a comparative genomic investigation was performed between the six newly sequenced strains and the 51 publicly available genomes of L. helveticus to define the pangenome structure. The results indicate that the newly sequenced strain UC1267 and the deposited strain DSM 20075 can be considered good candidates for gut-adapted strains due to their ability to survive in the presence of 0.2% glycocholic acid (GCA) and 1% taurocholic and taurodeoxycholic acid (TDCA). Moreover, these strains had the highest bile salt deconjugation activity among the tested L. helveticus strains. Considering the safety profile, none of these strains presented antibiotic resistance phenotypically and/or at the genome level. The pangenome analysis revealed genes specific to the new isolates, such as enzymes related to folate biosynthesis in strains UC1266 and UC1267 and an integrated phage in strain UC1035. Finally, the presence of maltose-degrading enzymes and multiple copies of 6-phospho-β-glucosidase genes in our strains indicates the capability to metabolize sugars other than lactose, which is related solely to dairy niches.

Keywords: lactic acid bacteria, Lactobacillus helveticus, bile salts tolerance, exopolysaccharides, antibiotic resistance, comparative genomics, probiotics

#### INTRODUCTION

Lactic acid bacteria (LAB) play a key role in the production of various fermented foods, as well as in diverse environments, such as soil and the human gut, which is often considered a reservoir of potential novel probiotic strains. Lactobacillus helveticus was first described by Orla-Jensen in 1919 as Thermobacterium helveticum, in which the prefix "thermos" referred to the high temperature used for the production of Emmental, the initial isolation source of the bacterium

(Naser et al., 2006). In addition to Swiss-type cheeses, strains belonging to this species are present in natural whey cultures from Italian long-ripened cheeses (e.g., Parmigiano Reggiano and Grana Padano) (Giraffa et al., 2000), strongly suggesting that the primary habitat of this species is the dairy environment. However, some strains of L. helveticus contain specific probiotic features in their genome. For example, an immune protection effect related to S-layer proteins has been demonstrated in L. helveticus M92 and NS8 (Beganovic et al., 2011 ´ ; Rong et al., 2015). Additionally, high production of exopolysaccharides (EPSs) and bacteriocins in the MB2-1 and KLDS1.8701 strains, respectively, have been studied (Li et al., 2015a,b). In addition, the inhibition of Campylobacter jejuni invasion was demonstrated for L. helveticus R0052 (Wine et al., 2009), whereas the MTCC 5463 strain possesses genes for adhesion and aggregation, including mucus-binding proteins (Senan et al., 2015b). Furthermore, a close relatedness exists between L. helveticus DPC 4571 and Lactobacillus acidophilus NCFM found in the gastrointestinal tract (GIT), which exhibit 98.4% sequence identity for the 16S rRNA gene (Callanan et al., 2008). An important environmental factor for bacterial niche specialization is the type of sugar available (Slattery et al., 2010). For instance, lactose is the main sugar present in the dairy niches, whereas maltose is typically located in environments where starch metabolic by-products are present, such as the gut. The presence of the enzyme maltose-6 phosphate glycosidase, as well as multiple copies of glucosidase genes, can be putative indicators of a gut-adapted microorganism (Slattery et al., 2010). Most common probiotic strains contain two copies of the enzyme α-1,6-glucosidase (Cremonesi et al., 2012; Møller et al., 2012, 2014). However, only one copy of this enzyme has been found in the dairy strain L. helveticus DPC 4571 (Slattery et al., 2010).

Another important element allowing adaptation to a certain niche is the capacity to deal with stressful conditions, such as those present in the gut environment. For example, the survival at the low pH of the stomach is fundamental, as well as the tolerance to bile salts and the presence of a functional bile salt hydrolase (BSH). Bile acids originate from cholesterol and can be found in a conjugated form with either glycine or taurine. The toxicity toward bacterial cells relies on their surfactantlike nature, which induces intracellular acidification or disrupts cell membranes (Begley et al., 2005). Deconjugation of bile salts by BSH activity increases their recovery via passive absorption through the colonic epithelium (Ridlon et al., 2006). A frameshifted, non-functioning bsh has been found in L. helveticus DPC 4571 (Slattery et al., 2010). In contrast, the presence of a bile acid-inducible operon, containing one bsh gene and two choloylglycine hydrolases, has been highlighted in L. helveticus MTCC 5463 (Senan et al., 2015b). Nonetheless, other genes are involved in the mechanisms underlying bile salt tolerance. For example, in the probiotic strain L. acidophilus NCFM, among the genes induced in the presence of bile are some major facilitator superfamily (MFS) members, permease, and ATPase subunits of the ABC transporters (Pfeiler and Klaenhammer, 2009). All of these genes were annotated as members of the multidrug resistance (MDR) family, which plays a role in defending against inhibitory compounds by ejecting a wide variety of substrates from the cell, such as antibiotics, bile salts, and peptides.

In addition to bile salt tolerance and hydrolyzation, the ability to colonize the GIT by forming biofilms is also fundamental (Branda et al., 2005). In this context, EPSs play a key role, contributing to the structural diversity of the cell wall of Lactobacillus spp. (De Vuyst and Degeest, 1999). Moreover, the presence of other cell surface factors, such as S-layer proteins, which are not present in all Lactobacillus spp., can promote adherence and immunostimulation mechanisms and be involved in competitive pathogen exclusion (Åvall-Jääskeläinen and Palva, 2005; Lebeer et al., 2008). For instance, surface-layer extracts from L. helveticus R0052 have been shown to inhibit the adhesion of Escherichia coli O157:H7 to epithelial cells (Johnson-henry et al., 2007). S-layer-related genes have also been found in other strains of L. helveticus, namely, CNRZ 892, MIMLh5, M92, NS8, and MTCC 5463 (Callegari et al., 1998; Beganovic et al., 2011 ´ ; Taverniti et al., 2013; Rong et al., 2015; Senan et al., 2015b).

Considering the safety profile characterizing a putative probiotic strain, analysis of the antibiotic resistance pattern is important. A wide range of antibiotic resistance has been found in many L. helveticus strains isolated from dairy products, including Grana Padano and Provolone cheese starters (Fortina et al., 1998; Frece et al., 2009). Specifically, resistance to rifampicin, chloramphenicol, kanamycin, lincomycin, streptomycin, polymixin B, and rifamycin has been highlighted.

According to the literature, many of the 51 genomes of L. helveticus deposited possess proven probiotic capability, namely, R0052 (Mohammadi et al., 2018), KLDS1.8701 (Li et al., 2017), CAUH18 (Yang et al., 2016), MB2-1 (Li et al., 2015a), MTCC5 463 (Senan et al., 2015a), H9 (Chen et al., 2015), M92 (Beganovic et al., 2013 ´ ), and D75 and D76 (Roshchina et al., 2018). Specifically, R0052 is characterized by the production of mucus-binding proteins and surface-layer proteins; CAUH18, by EPS formation and cell aggregation properties; and M92, H9, D75, and D76, by their proteolytic activity and bacteriocin production. However, as the remaining strains may have unidentified phenotypic probiotic features, all of the 51 available L. helveticus strains were included in the comparative analysis described in the present study. The observed adaptation of L. helveticus strains to different ecological niches, such as the gut and dairy, suggests the need for more in-depth investigation at both the genomic and phenotypic levels, which can be useful for gaining insights into the evolutionary history of this species. Moreover, the demand for new interesting strains for industry-driven applications in cheese ripening and healthpromoting products opens further perspectives for L. helveticus (Giraffa, 2014).

In the present study, the phenotypes and genotypes of six L. helveticus strains isolated from natural whey cultures were analyzed to highlight specific features for possible application as probiotics. Specifically, EPS production, S-layer-related genes, low pH and bile salt tolerance, BSH activity, and antibiotic resistance were evaluated. In addition to the search for properties of gut-adapted strains, a comparative genomic investigation was performed between the newly sequenced strains and the 51 publicly available genomes of L. helveticus.

#### MATERIALS AND METHODS

fmicb-10-01380 June 24, 2019 Time: 15:14 # 3

#### Whole Genome Sequencing of L. helveticus Strains

Genomic DNA of six L. helveticus strains (UC1035, UC1266, UC1267, UC1275, UC1285, and UC3147) was extracted with the E.Z.N.A. <sup>R</sup> Bacterial DNA Kit (Omega Bio-tek, United States). The quality of the extracted DNA was checked by agarose gel electrophoresis (0.8%) and then quantified with the Qubit fluorometer (Life Technologies, Carlsbad, CA, United States). Genomic DNA was sequenced using the Illumina MiSeq technology (2 × 150 bp). Reads were filtered and assembled with CLC Genomics workbench v. 5.1 (CLC Bio, Aarhus, DK, United States) using CLC's de novo assembly algorithm, using a k-mer of 63 and a bubble size of 60, as previously described (Zhu et al., 2019). Only scaffolds longer than 1 kb were considered for further analyses.

#### Comparative Genome Analysis

Fifty-one L. helveticus strain genomes were downloaded from NCBI microbial genome database<sup>1</sup> (December 2018). Genome metrics (e.g., genome size, N50, number of scaffolds, etc.) were determined using CheckM (v1.0.7) (Parks et al., 2015). Gene prediction and annotation were performed using Prokka (v1.12) (Seemann, 2014) trained on Lactobacillus annotations deposited in NCBI database. Annotation was refined with EggNOG (v4.5.1) using eggNOG-mapper (Huerta-Cepas et al., 2016) using as input the protein sequences predicted with PROKKA. All the L. helveticus genomes were uploaded to the RAST server and annotated using SEED and the RAST gene caller (Overbeek et al., 2014). Annotations obtained from PROKKA and RAST in tabular format and genbank format were uploaded and made available in sourceforge<sup>2</sup> .

Annotation results were downloaded, and using in-house developed perl scripts (Treu et al., 2018), the number of genes present in each SEED category was determined considering both "first" and "second" level (**Supplementary Data S2**). Pangenome was predicted with Roary (v3.11.2) (Page et al., 2015) using as input the annotation files previously generated, and results were visualized using the script "roary\_plots.py." "Unique" and "new" genes derived from pangenome analysis were determined using create\_pan\_genome\_plots.R software of the Roary package. The core genomes of L. helveticus were aligned using Parsnp (v1.2) (Edgar, 2004; Bruen et al., 2006; Price et al., 2010; Treangen et al., 2014) producing variant (SNP) calls and core genome phylogeny. An additional verification of the genes present in the six strains sequenced in the current project was performed independently from the assembly and considering all the pangenome sequences obtained from Roary. More specifically, a representative gene sequence was collected for each gene cluster giving priority to the sequences of the complete strains downloaded from the NCBI database; shotgun Illumina reads obtained for each strain were aligned on the "pangenome database" using Bowtie 2 (v2.2.4) (Langmead and Salzberg, 2012), and the coverage of each gene was determined using pileup.sh software of the BBTools package (**Supplementary Data S2**). The files in "newick tree" format obtained with Prokka and Parsnp were re-rooted on strain KLDS1.8701 and decorated using iTol (Letunic and Bork, 2016). Genome sequences were aligned using progressive MAUVE software to identify strain-specific regions and their presence in the genome (Darling et al., 2010). The strain CAUH18 deposited in RefSeq was used as reference for genome comparison for **Supplementary Figure S1**. Presence of antibiotic resistance genes (ARG), prophages, bacteriocins, and plasmids was evaluated in the strains sequenced in the present study. ARG analysis was performed using RGI (v3.2.1) (Jia et al., 2017) with parameter "–loose\_criteria = no." Integrated prophages were investigated using PHASTER (Arndt et al., 2016). The presence of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) was evaluated with CRISPRCasFinder (Couvin et al., 2018) (**Supplementary Data S3**). Bacteriocins were tested using Bagel4 (de Jong et al., 2006). Presence of plasmid sequences was evaluated using plasmidSPAdes (v3.13.0) (Antipov et al., 2016). Presence of BSH and penicillin-V acylase (PVA) genes was verified using hmmsearch (Eddy, 2011); briefly, protein sequences of BSH and VPA were recovered from NCBI database, aligned using clustalw (v2.1) (Thompson et al., 2002), a hidden Markov model was built with hmmbuild (v3.1b1) (Eddy, 1998), and protein sequences obtained from Prokka were analyzed using hmmsearch (**Supplementary Data S3**).

#### EPS Production Evaluation

The ability to produce EPS was tested by inoculating our six isolates in de Man Rogosa Sharpe (MRS)-lac agar in which glucose was replaced by 2% of lactose (Torino et al., 2005). Ropy phenotype was examined by picking the colonies with a sterile loop and observing the formation of a filament when the loop was lifted (Ruas-Madiedo and de los Reyes-Gavilán, 2005).

#### Bile Salts and Low pH Tolerance Assays

Bile salt tolerance was assessed by cultivating the bacterial strains in the presence of increasing concentrations of the following bile salts: glycocholic acid (GCA), glycodeoxycholic acid (GDCA), taurocholic acid (TCA), and taurodeoxycholic acid (TDCA). The concentrations tested were 0.01, 0.1, 0.2, 0.5, 1.0, and 2.0% w/v for each compound. Briefly, overnight cultures were centrifuged, washed twice with PBS, and adjusted to a final Abs<sup>600</sup> of 1. Ten microliters of the cell suspension was inoculated in a 96-well plate containing 190 µl of MRS added with different concentrations of each bile salt and then incubated for 24 h at 45◦C. After incubation, Abs<sup>600</sup> was measured and the results were expressed as the percentage of growth in the presence of bile salts compared to the control grown without the addition of any compound (Shinoda et al., 2001).

The test for tolerance to low pH was carried out following the protocol of Prasad et al. (1999) with minor changes. Strains were cultivated overnight at 45◦C, and then 0.2 ml was centrifuged, washed twice with 1 ml of NaCl 5 g/L, and resuspended in 2 ml of an acid solution composed of NaCl 5 g/L and

<sup>1</sup>https://www.ncbi.nlm.nih.gov/genome/microbes/

<sup>2</sup>https://sourceforge.net/projects/lactobacillus-helveticus/

25 mM glucose at pH 3 to have approximately 10<sup>8</sup> CFU/ml. A 96-well plate was filled with 200 µl of each acid solution inoculated with the different strains and incubated at 45◦C. Aliquots of 100 µl of each sample were collected at time 0, 60, 120, 180, and 300 min after incubation and used to make the bacterial counts.

#### BSH Activity

Bacterial strains were grown overnight at 45◦C in 50 ml of MRS, centrifuged for 10 min at 10,000 × g at 4◦C, washed twice with 0.1 M sodium-phosphate buffer pH 6.8, and resuspended in the same buffer. The obtained suspension was sonicated for 60 s using a CV17 sonicator (VibraCell, Sonics and Materials Inc., Newtown, CT, United States) and centrifuged to remove cell debris (Noriega et al., 2006) in order to obtain cell-free extracts. Quantitative determination of the BSH activity was calculated according to the two-step method described by Tanaka et al. (1999). In the first reaction, conjugated primary/secondary bile salts were incubated in a reaction mix with the different cellfree extracts, to achieve the release of amino acids from the bile salts. These amino acids were quantified in the second reaction as follows: 5 µl from the BSH reactions diluted five times with 0.5 M sodium-citrate buffer pH 5.5 was mixed with 110 µl of the ninhydrin reagent, incubated 14 min at 97◦C in a PCR Thermal Cycler, and cooled down to 4◦C. Standard curves were made with 5 µl of glycine in place of the sample. Abs<sup>570</sup> was measured after 30 min in an EpochTM Spectrophotometer (Biotek, Winooski, VT, United States). BSH activity was expressed in U/ml, since one unit of BSH activity was defined as the amount of enzyme that liberated 1 µmol glycine from GDCA per minute (Jiang et al., 2010).

#### Antibiotic Resistance Assay

The susceptibility profiles of the isolated strains to gentamycin, kanamycin, streptomycin, neomycin, tetracycline, erythromycin, clindamycin, and chloramphenicol were determined by broth microdilution using VetMIC plates for LAB (VetMIC Lact-1; National Veterinary Institute, Uppsala, Sweden). The inoculum for the test was prepared by picking colonies from fresh cultures grown on MRS agar plates (Difco, Detroit, MI, United States) and suspending them in sterile saline solution (NaCl 9 g/L) to reach an optical density corresponding to McFarland standard 1. The suspension was then diluted 1:1,000 in LAB susceptibility test medium (LSM), composed of 90% of Iso-Sensitest broth and 10% of MRS broth; 100 µl of the final bacterial suspension was added to each well of the VetMIC plates. The plates were then incubated at 37◦C for 48 h in anaerobic conditions. The minimum inhibitory concentration (MIC) was defined as the lowest antibiotic concentration at which no growth was observed (Huys et al., 2010). Results were compared to the cutoff values edited by EFSA (European Food Safety Authority, 2012). L. helveticus DSM 20075 was used as reference strain.

#### Statistical Analysis

To determine significant differences (P < 0.05) between the strains in relation to bile salt concentration, two-way analysis of variance (ANOVA) followed by Bonferroni multiple comparisons test was performed. One-way ANOVA followed by Tukey's multiple comparisons test was carried out to compare BSH results. Both analyses were implemented using GraphPad Prism5 (GraphPad Software, La Jolla, CA, United States).

### RESULTS AND DISCUSSION

The average genome size and GC content of the six newly sequenced L. helveticus strains were 1.9 Mb and 36.7%, respectively (**Supplementary Data S1**). Gene finding and annotation resulted in an average of 2,090 coding sequences (CDSs), with a coding density of 84.7%.

#### Comparative Genomics Analyses

Comparative genomic analysis evidenced that the L. helveticus pangenome can be considered as "open" since nearly 30 new genes are continuously added for each additional genome considered (**Figure 1**). This suggests a remarkable range of phenotypic variability between strains conferred by the presence of a very flexible genetic content and the presence of strain-specific genes (unique) on each genome. This genomic heterogeneity and high number of publicly available L. helveticus strains hampered a graphical representation of all the aligned genomes. Therefore, an additional investigation of the pangenome was performed to identify the entire set of strainspecific genes. Computational mining of genome sequences aimed to favor the selection of strains for biotechnological use (probiotic potential), to investigate niche association, and to study phylogenetic correlation. The 57 genomes were grouped in terms of strain isolation and geographic localization to test for specific genome associations and functional gene groups. The defined niche categories based on geographical localization were as follows: Canada (n = 1), China (n = 7), Croatia (n = 1), Europe (n = 2), France (n = 6), India (n = 1), Italy (n = 11), Russia (n = 2), Swiss Confederation (n = 20), Tajikistan (n = 1), and United States (n = 5). According to the isolation source, strains were classified as follows: commercial dietary supplement (n = 5); dairy product (n = 27), divided into cheese, fermented milk, and raw milk; human (n = 3); industrial dairy starter (n = 2); malt fermentation (n = 1); natural whey culture (n = 14); and "not available" (n = 5) (**Supplementary Data S1**).

From the phylogenetic tree, our strains were divided into two clusters based on SNPs (**Figure 2A**): the first was composed of UC1275, UC1285, and UC3147, and the second was composed of UC1035, UC1266, and UC1267. The former cluster also comprises FAM1450, ATCC 12046, MTCC 5463, and UC1156. All of the strains in this group were isolated in Europe, except MTCC 5463, which was isolated in Asia. MTCC 5463 is also an outlier in regard to the isolation source, as all strains in the cluster are of dairy origin but MTCC 5463 was isolated from the vaginal mucosa. The second cluster consists of CIRM-BIA 104, Lh12, M92, M3, ATCC10386, CGMCC 1.1877, CIRM-BIA 101, and DSM 20075 in addition to the already mentioned strains. This cluster is more homogeneous than the other one because all of the strains were isolated in Europe and derived from a

dairy environment. In both clusters, a probiotic strain is present, namely, MTCC 5463 in the first cluster (Senan et al., 2015b) and M92 in the second cluster (Beganovic et al., 2011 ´ ). Notably, all of the strains maintained the same distribution among the clusters considering both whole genome SNPs and the orthologous gene content (**Figure 2B**).

The Roary pangenome pipeline succeeded in identifying more than 8,000 different orthologous groups of proteins that, in relation to their distribution in the 57 strains, were organized into four different classes according to the number of strains sharing each orthologous group of proteins: "Core" (57 or 56 strains), "Soft-core" (54 or 55 strains), "Shell" (8–53 strains), and "Cloud" (less than eight strains) (**Figure 3A**). In the Roary analysis, some gene clusters were associated with the newly sequenced strains (**Figure 3B**). These gene clusters were inspected to understand which peculiarity they bestow on each strain; the most interesting are highlighted in **Figure 3B**. Specifically, five gene clusters were identified. Cluster 1, which was identified in UC1266 and UC1267, is characterized by the presence of enzymes belonging to the Shikimate pathway responsible for folate and aromatic amino acid (phenylalanine, tyrosine, and tryptophan) biosynthesis (Herrmann and Weaver, 1999). The production of B vitamins, such as folate (vitamin B9), by some strains of lactobacilli is considered to have a beneficial effect on the host in the case of vitamin deficiency (Wegkamp et al., 2004; Santos et al., 2008). Cluster 2, which was identified only in UC1267, is characterized by the presence of GARS, a mono-functional enzyme involved in purine biosynthesis (Kanai and Toh, 1999). Cluster 3, which was identified in UC1035, is typified by poly-gamma-glutamate (PGA) biosynthesis proteins. PGA allows bacteria to survive at high salt concentrations and may also be involved in virulence (Candela and Fouet, 2006). Cluster 4, which was also found in UC1035, is composed of phage proteins, the ArpU family of phage transcriptional regulators and holins (Wang et al., 2000). The analysis revealed the presence of a putative integrated prophage that was not identified by PHASTER. In the alignment obtained using Mauve software, we indeed identified a 40,500-bp strain-specific region in UC1035, absent in the reference genome (CAUH18), as shown in **Supplementary Figure S1** (region 3). Finally, Cluster 5, which was specific to UC1285, is characterized by enzymes involved in aromatic compound catabolism.

### Functional Categories Related to Probiotic Capabilities

Some specific features considered crucial for a gut-adapted microorganism were investigated more deeply at the genome level in order to assess the potential probiotic capabilities of each strain. Specifically, the presence of mobile genetic elements, epithelial adherence and aggregation features, stress response mechanisms, and host adaptation-related genes were evaluated. Considering the "mobile genetic elements" and "adhesion and aggregation" categories, the L. helveticus strains were comparable in terms of gene content, with seven and eight genes on average (**Figure 4**). The newly sequenced strains exhibited a similar number of adhesion and aggregation encoding genes compared to the other investigated strains (**Figure 4**). In relation to the "stress response" category, all 57 genomes have shown a high number of genes (ranging from 69 to 89; **Figure 4**). Of particular interest is the microbial capability

to tolerate acidic pH and surfactant-like molecules, such as bile salts. Among our strains, UC1035, UC1266, and UC3147 had the highest number of genes in stress-related category (83, 81, and 81 genes, respectively; **Figure 4**). The second most abundant category in terms of gene content (from 26 to 37) was "host adaptation" (**Figure 4**). Considering all the strains, MTCC 5463 and ATCC 12046 genomes showed a high number of genes (37 and 35, respectively); among the newly sequenced, this feature characterizes UC1266 and UC1285 (33 genes each strain).

#### Mobile Genetic Elements

In bacteria, mobile genetic elements such as prophages, integrases, and insertion sequences (ISs) are primary contributing factors to genetic diversity and niche adaptation. Among the six strains sequenced in this study, UC1285 has the highest number of mobile genetic elements, containing nine genes (**Figure 4**). The six UC strains included genes encoding prophages and integrases (**Supplementary Data S4**), as previously identified in L. helveticus MTCC 5463 (Senan et al., 2015b). Specifically, the "prophage DNA packaging protein NU1," a "phageassociated protein," a "group II intron-encoded maturase," a "putative integrase-recombinase," and some "integrases" were determined (**Supplementary Data S4**). RAST annotation identified additional genes related to the "phages and prophages" category in all newly sequenced strains, with UC1035 and UC1266 having the highest number (14 and 11 genes, respectively). A similar profile was identified in CIRM-BIA 103, CIRM-BIA 104, CIRM-BIA 953, CNRZ32, some FAM strains (13019, 14275, 14499, 19188, 19191, 23285, 8102, and 8106), Lh12, Lh23, and MB2-1. Phage-related sequences were detected recently in other dairy isolates (i.e., FAM 8105, FAM 8627, and FAM 22155) (Schmid et al., 2018). The strains MTCC 5463, ATCC 12046, FAM 1450, and Lh 23 have the highest number of genes related to this category (ranging from 13 to 16 genes, some of them in multiple copies), including ISs (**Supplementary Data S4**). The presence of mobile genetic elements, such as ISs, is associated with the genomic instability of a strain because it promotes chromosomal rearrangements, such as deletions, duplications, and inversions (Mahillon and Chandler, 1998;

Callanan et al., 2008). The lack of these features in the six L. helveticus strains analyzed in this study highlights a potentially high genomic stability, which is considered relevant for quality assurance of a probiotic strain (Sybesma et al., 2013).

#### Epithelial Adherence and Aggregation Features

Good adherence capacity is generally assumed to be a desirable trait for probiotic lactobacilli, as it can increase the gut residence time, improve efficiency of pathogen exclusion, and facilitate interactions with host cells. This latter feature is relevant for the protection of epithelial cells or immune modulation (Lebeer et al., 2008). Among the different factors involved in epithelial adherence and aggregation, EPSs generally play a role in the non-specific interactions of lactobacilli with abiotic and biotic surfaces. In this regard, EPSs seem to play a more specific role in the formation of microcolonies and biofilms (Branda et al., 2005).

Genome mining confirmed the experimental EPS assay performed only on the newly sequenced isolates but revealed also that six out of the 51 strains deposited (ATCC 10386, CGMCC 1.1877, CIRM-BIA 101, DSM 20075, M3, LMG 22464) seem not to have the genes coding for proteins involved in EPS biosynthesis. Five of these strains are phylogenetically related according to the SNP-based phylogenetic tree (**Figure 2A**). Unlike most of the examined strains, four of the newly sequenced strains are characterized by the presence of genes encoding for d-TDP-4-dehydro rhamnose reductase (i.e., UC1266, UC1275, UC1285, and UC3147), which converts dTDP-6-deoxy-L-mannose into dTDP-4-dehydro-6-deoxy-L-mannose. This suggests that EPSs in the newly sequenced strains could be synthesized from the precursor dTDP-rhamnose or the precursors converted through the Leloir pathway (UDP-glucose, UDP-galactose) (Barreto et al., 2005). However, previous studies found that some strains of L. helveticus produce EPSs using lactose as a substrate (Robijn et al., 1995; Stingele et al., 1997; Torino et al., 2001; Li et al., 2014). Among our isolates, the ropy phenotype was detected exclusively in UC1275 (data not shown). This phenotypic characteristic could be associated with the presence of two extra copies of d-TDP-4-dehydro rhamnose reductase, as well as the gene coding for dTDP-4-dehydrorhamnose 3,5-epimerase, which was also found in probiotic strain R0052. Moreover, four glycosyltransferase genes were specifically found only in the UC1275 strain, together with the epsIM gene (**Supplementary Data S4**).

According to genome mining, mucus-binding proteins were also identified (**Figure 5** and **Supplementary Data S4**). These kinds of proteins have already been recognized for their importance in adhesion to the intestinal mucosa layer and may assist L. helveticus in binding to intestinal mucus, especially in the small intestinal tract, and in protecting epithelial cells. Together with EPSs, the production of mucus-binding proteins could indicate a putative use of this species as a probiotic, especially in the treatment of small intestinal bacterial overgrowth (SIBO),

as suggested by Klopper et al. (2018). This class of proteins was found in all of the genomes of our strains, as well as in 44 out of 51 genomes under investigation, including the probiotic strains MTCC 5463, M92, and R0052 (**Supplementary Data S4**).

In the anchoring of mucus-binding proteins to the bacterial cell wall, sortases play a key role (Kleerebezem et al., 2010). Genes encoding sortase proteins were identified in all of the analyzed L. helveticus genomes, except FAM14499 (**Supplementary Data S4**).

Considering the S-layer proteins, their relevance to Lactobacillus spp. in supporting microbial persistence in the gut was assayed previously (Grosu-Tudor et al., 2016). These proteins can also interact with the cellular receptor dendritic cell-specific intercellular adhesion molecule-3-grabbing nonintegrin (DC-SIGN; CD209) (Prado Acosta et al., 2016), preventing infection by pathogenic bacteria through a process of competitive exclusion (Zhang et al., 2017). Interestingly, genomic analyses revealed a higher number of S-layer genes (n = 10) in the UC1285 isolate, compared to the other L. helveticus strains, which contained six genes on average (**Supplementary Data S4**).

#### Stress Response Mechanisms

Bile salt tolerance was phenotypically evaluated in the six newly sequenced strains. The selected L. helveticus strains exhibited similar bile salt tolerance patterns (**Figure 6**). The strains under investigation were tolerant to all of the tested compounds (i.e., were able to grow at 0.2% of each bile salt), except for GDCA, which inhibited all of the strains at the minimum concentration (0.1%). The effectiveness of GDCA was confirmed in L. acidophilus NCFM, which was unable to grow at a concentration higher than 0.05% (McAuliffe et al., 2005). L. helveticus UC1267, UC1285, and DSM 20075 demonstrated the highest tolerance, with a survival rate higher than 50% in the presence of TCA at 1% concentration and after exposure to TDCA. Statistical analyses confirmed a significant difference in the bile salt tolerance exhibited by these strains (P < 0.05) compared to the others.

The BSH activity in cell extracts of the six L. helveticus strains is reported in **Figure 7**. DSM 20075 and UC1267 had the highest bile salt deconjugation activity and UC1035 had the lowest. A significant difference (P < 0.05) was found in the BSH activity of UC1267 and DSM 20075 compared to the other strains. However, in the six strains, the level of BSH was lower than the values reported in literature for other Lactobacillus species (Liong and Shah, 2005) and Bifidobacterium longum (Tanaka et al., 2000). These findings suggest weak BSH activity and contrast with the results reported by Jiang et al. (2010). In the latter study, BSH activity associated with deconjugation of GDC in L. helveticus Lh1 was not detected. Nevertheless, our results are in accordance with Tanaka et al. (1999), who reported a low incidence of BSH activity in typical dairy bacteria species, such as L. helveticus and Lactobacillus delbrueckii. This outcome is in contrast with isolates from mammalian intestines, which are all BSH-active strains (e.g., L. acidophilus, Lactobacillus gasseri, and Lactobacillus johnsonii). As the standard annotation approach was not effective in identifying the presence of BSH

genes and the closely related PVA genes, a dedicated analysis was performed. A hidden Markov model-based procedure (O'Flaherty et al., 2018) targeting the identification of the BSH and PVA gene repertoire was applied. BSH was present in 31 strains (E-value < E-99), including all of the newly sequenced strains and DSM 20075, which had high bile salt deconjugation activity. Moreover, the BSH proteins detected in our strains were identical to DSM 20075 in terms of both length and amino acid residues. In contrast, PVA was identified in all strains only with an E-value higher than E-99, with the most significant results obtained for 24 strains (including the newly sequenced). Considering other features related to bile salt tolerance, genome annotations revealed the presence of cyclopropane fatty acid synthase encoding gene in all strains. This protein has been related to the bacterial ability of countering high bile acid content typical of the gut environment (Grogan and Cronan, 1997). Finally, transporters of the MFS were previously found to be overexpressed in response to bile exposure in L. acidophilus (Pfeiler et al., 2007). Genes encoding for this defense mechanism against inhibitory compounds were also found in the newly isolated L. helveticus strains (**Supplementary Data S4**).

A deep evaluation at the genomic level highlighted features involved in the adaptation to stress induced by gut transit. All of the analyzed strains possess genes responsible for acid tolerance, the heat and cold shock response, and oxidative and general stress (**Supplementary Data S4**). Proteins involved in establishing proton motive force, such as multisubunit F0F<sup>1</sup> ATPase, Na+/H<sup>+</sup> antiporters, and H+/K+-exchanging ATPases, are present in all 57 strains and may be putatively involved in pH homeostasis (Senan et al., 2015b). This is expected because some of these proteins play a key role in the basal functioning of the cellular machinery. The phenotypic assay performed to test low pH tolerance (i.e., pH 3) revealed the survival of all the considered strains up to 3 h, except for UC1266 and UC1285 (**Supplementary Figure S2**). In contrast, genome mining indicated that other genes have a more scattered distribution within the L. helveticus species. Among the histidine kinase signal transduction systems, a specific gene (WP\_003633555.1) was found to be common in all six newly sequenced strains, as well as another 19 dairy isolates. This HAMP domaincontaining histidine kinase can be a useful molecular marker for identification of L. helveticus strains associated to dairy ecological niches.

#### Host Adaptation

Antagonism and cooperation for space and resources both contribute to relationships between lactobacilli and other gut microorganisms. For instance, the production of proteins belonging to the cell wall hydrolase/autolysin class can often be related to the control mechanisms of microbial populations sharing the same ecological niche (Salazar and Asenjo, 2007). As suggested previously (Slattery et al., 2010), the presence of maltose-degrading enzymes, along with multiple copies of glucosidase genes, can be considered a putative indicator of gut-adapted microorganisms. Therefore, it is of great interest to highlight the presence of multiple copies (seven to nine copies) of the 6-phosphoβ-glucosidase gene in all our strains (**Supplementary Data S4**), with UC1035 including an extra oligo-1,6-glucosidase gene. These genes were also found in the human-origin

probiotic strains D75 and D76 isolated from human gut of a healthy Russian child (Roshchina et al., 2018) and MTCC 5463 isolated from vaginal swab of a healthy Indian female (Senan et al., 2015a). Considering maltose, a gene coding for maltose o-acetyltransferase was found in all of our isolates except for UC1266 (**Supplementary Data S4**). However, among the newly sequenced strains, only UC1035 encodes a complete maltose transporter (MalEFG), which is also present in the probiotic strain R0052 and in the proposed probiotic strains M92 and MTCC 5463 (**Supplementary Data S4**).

#### Antibiotic Resistance and Defense Mechanisms

The results of the antibiotic resistance assessment and EFSA breakpoints are presented in **Table 1**. All strains were highly sensitive to most of the tested antibiotics, demonstrating that they can be used as safe starter strains. The absence of ARG within the chromosomal sequences of our strains and the lack of plasmids were further confirmed by RGI analysis. Thus, the tested strains can be considered safe in terms of antibiotic resistance transferability.

Regarding the CRISPR-Cas system, RAST annotation identified CRISPR-associated genes (Cas) in four of the newly sequenced strains, with UC1035 having five genes and UC1275, UC1285, and UC3147 having two genes. The highest number of genes in this category was exhibited by the H9 strain (seven genes), whereas ATCC 10386, CGMCC 1.1877, CIRM-BIA 101, DSM 20075, LH99, and M3 all contained six genes. Additionally, CRISPRCasFinder identified CRISPRs in all newly sequenced strains, with UC1266 showing the highest number (three). Strains UC1275, UC1285, and UC3147 presented two CRISPRs, whereas UC1035 and UC1267 had only one repeat (**Supplementary Data S3**). Among the other 51 strains analyzed, the highest CRISPRs content was exhibited by D75 and D76 (four sequences), followed by H9 and KLDS1.8701 (three sequences); all these strains have proven probiotic capabilities (Chen et al., 2015; Li et al., 2017; Roshchina et al., 2018).

The presence of bacteriocin-producing genes was also evaluated, because it is an important probiotic trait for bacterial competition in a complex microbial environment, such as the human gut. Bacteriocins may directly inhibit the invasion of competing strains or pathogens, or modulate the composition of the microbiota and influence the host immune system by enhancing human health (Dobson et al., 2012). The genomic analysis highlighted the presence of genes encoding bacteriocins. Specifically, enterolysin A was found in four of our strains (UC1035, UC1266, UC1285, and UC3147), as was helveticin (UC1266, UC1267, UC1275, and UC3147; **Supplementary Data S4**), both belonging to class 3 bacteriocins, which are high-molecular-weight and heat-labile antimicrobial proteins (Alvarez-Sieiro et al., 2016). Additional bacteriocin genes were found in our strains, with UC1266, UC1267, UC3147, and UC1285 having the highest content (four and three genes, respectively; **Supplementary Data S2**, **S4**). The scattered distribution of bacteriocin-encoding genes in the examined strains was also confirmed by the association of these genes with strain-specific regions, as shown by the pangenome analysis. Regarding the bacterial secretion system, bacteriocin ABC transporters were found only in UC1266 (two genes) and UC1285 (three genes) (**Supplementary Data S2**). Nevertheless, some of the genes encoding components of the Sec-dependent transporter system were found in all of the strains except for SecG, which was absent in UC1275 (**Supplementary Data S2**). The Sec-dependent transporter system has also been found to be involved in bacteriocin secretion (Herranz and Driessen, 2005).

#### CONCLUSION

This study describes a genome-centric strategy to select strains of L. helveticus for probiotic purposes. To address this intent, a large-scale genomic analysis was performed by considering all of the publicly available genomic sequences, along with six newly sequenced strains isolated from natural whey cultures. The genome-based investigation of probiotic features was also supported by specific phenotypic assays.

Pangenome analysis revealed gene clusters specifically present in the new isolates, including enzymes responsible for folate biosynthesis in UC1266 and UC1267. The correlation between BSH activity and bile salt resistance was confirmed regarding TCA and TDCA. Indeed, the strains with the highest enzyme activity (DSM 20075 and UC1267) were also the most resistant to these bile salts. However, considering GCA and GDCA bile salts, this correlation fails, indicating the involvement of other mechanisms of resistance to bile salts in addition to BSH activity. This hypothesis is also supported by the encoded BSH proteins in the new isolates with low bile salt tolerance. Indeed, their BSH proteins are identical in both length and amino acid residues to those of the tolerant strains DSM 20075 and UC1267. Concerning the safety profile of the new strains, the lack of antibiotic resistance, both phenotypically and genotypically, was assessed. Two strains, UC3147 and UC1285, were characterized by the highest number of bacteriocin genes among the new isolates. Considering EPS production, the ropy phenotype was detected exclusively in UC1275, probably related to the dTDP-rhamnose reductive pathway. Finally, the presence of maltose-degrading enzymes and multiple copies of the 6-phospho-β-glucosidase gene in our natural whey culture isolates indicates the capability to metabolize sugars other than lactose.


Antibiotics abbreviations: Gm, gentamycin; Km, kanamycin; Sm, streptomycin; Nm, neomycin; Tc, tetracycline; Em, erythromycin; Cl, clindamycin; Cm, chloramphenicol.

Considering both the phenotypic and genotypic properties of the investigated L. helveticus strains, more pronounced adaptability to the gut environment was shown in the newly sequenced strain UC1267 and in DSM 20075. Specifically, the highest BSH activity and bile salt tolerance, the presence of maltose-degrading enzymes, and multiple copies of glucosidase genes highlighted their potential to survive in the GIT. Moreover, the presence of multiple genes encoding bacteriocins and a complete pathway for folate production could also be involved in the health-promoting effects on the host. Further studies will be necessary to test their probiotic efficacy in vivo.

### AUTHOR CONTRIBUTIONS

All authors performed the analysis, prepared the manuscript, and contributed to editing and critical reviewing.

#### FUNDING

This work was supported by Fondazione Cariplo and Regione Lombardia under the project "Cremona Food-LAB."

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2019.01380/full#supplementary-material

#### REFERENCES


FIGURE S1 | Genomic alignments. The genomes of the six L. helveticus strains sequenced in the present study were aligned considering CAUH18 as a reference. Some relevant genomic regions identified in the six strains and absent in CAUH18 are highlighted and a summary of the genes included is reported in the right part of the figure. Local co-linear blocks (syntenic regions) identified are represented by different colors.

FIGURE S2 | Low pH tolerance (pH 3) of the six newly sequenced L. helveticus strains and DSM 20075.

DATA S1 | Dataset A. "Newly sequenced strains" worksheet. Genome properties and details regarding the six newly sequenced L. helveticus strains. Dataset B. "NCBI strains" worksheet. Genome properties and details regarding the 51 L. helveticus strains publicly available.

DATA S2 | Dataset A. "Genes distribution" worksheet. Gene "presence/absence" in all 57 L. helveticus strains determined using Roary and correspondent coverage calculation based on raw sequences alignment. Dataset B. "SEED annotation" worksheet. Number of genes identified for each SEED category in all L. helveticus strains.

DATA S3 | Dataset A. "BSH hmm results" worksheet. Results of the HMM analysis performed to identify genes encoding bile salt hydrolase (BSH). Dataset B. "PVA hmm results" worksheet. Results of the HMM analysis performed to identify genes encoding Penicillin-V Acylase (PVA). Dataset C. Results of the analysis performed using CRISPRCasFinder to identify CRISPR systems.

DATA S4 | Dataset A. "Adhesion and aggregation genes" worksheet. Results from genome mining as presence/absence of genes known to be related to adhesion and aggregation features in all L. helveticus strains. Dataset B. "Host adaptation" worksheet. Results from genome mining as presence/absence of genes known to be related to host adaptation features in all L. helveticus strains. Dataset C. "Mobile genetic elements" worksheet. Results from genome mining as presence/absence of genes known to be related to mobile genetic elements features in all L. helveticus strains. Dataset D. "Stress response" worksheet. Results from genome mining as presence/absence of genes known to be related to stress response features in all L. helveticus strains. Dataset E. "Highlight stress response" worksheet. Genes responsible of stress response are grouped in general categories. Dataset F. Results of Bagel4 for the UC strains.




helveticus TY1-2. Carbohydr. Res. 302, 197–202. doi: 10.1016/S0008-6215(97) 00119-5


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Fontana, Falasconi, Molinari, Treu, Basile, Vezzi, Campanaro and Morelli. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genomic Insights Into the Distribution and Evolution of Group B Streptococcus

*Swaine L. Chen1,2 \**

*1 Division of Infectious Diseases, Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore, 2 Infectious Diseases Group, Genome Institute of Singapore, Singapore, Singapore*

#### *Edited by:*

*Pierre Renault, Institut National de la Recherche Agronomique (INRA), France*

#### *Reviewed by:*

*Bernard Beall, Centers for Disease Control and Prevention (CDC), United States Ran Wang, Beijing Academy of Agricultural and Forestry Sciences, China Pattanapon Kayansamruaj, Kasetsart University, Thailand*

> *\*Correspondence: Swaine L. Chen slchen@gis.a-star.edu.sg*

#### *Specialty section:*

*This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology*

*Received: 06 April 2019 Accepted: 11 June 2019 Published: 28 June 2019*

#### *Citation:*

*Chen SL (2019) Genomic Insights Into the Distribution and Evolution of Group B Streptococcus. Front. Microbiol. 10:1447. doi: 10.3389/fmicb.2019.01447*

*Streptococcus agalactiae*, also known as Group B Streptococcus (GBS), is a bacteria with truly protean biology. It infects a variety of hosts, among which the most commonly studied are humans, cattle, and fish. GBS holds a singular position in the history of bacterial genomics, as it was the substrate used to describe one of the first major conceptual advances of comparative genomics, the idea of the pan-genome. In this review, I describe a brief history of GBS and the major contributions of genomics to understanding its genome plasticity and evolution as well as its molecular epidemiology, focusing on the three hosts mentioned above. I also discuss one of the major recent paradigm shifts in our understanding of GBS evolution and disease burden: foodborne GBS can cause invasive infections in humans.

Keywords: group B Streptococcus, *Streptococcus agalactiae*, foodborne disease, review, genomics, molecular epidemiology, whole genome sequencing

### INTRODUCTION

*Streptococcus agalactiae* was first described in 1887 as a common bacteria infecting the udders of cattle, causing a disease termed mastitis (Nocard and Mollereau, 1887). This led to significantly lower milk production (Keefe, 1997; Ruegg, 2017), thus the species name *agalactiae* (from the Greek: a-, no; galactos, milk). In these cows, milk production can be reduced over 20% in infected cows, which, prior to the institution of active control measures, could conservatively affect 15–40% of all cows (Shaw and Beam, 1935). It came to be known by its other common name, Group B Streptococcus, with the Lancefield classification in 1933 (Lancefield, 1933), where the "Group B" references the species-specific carbohydrate "substance C" from Streptococci (Lancefield, 1928) recognized by rabbit immune sera (this substance C is associated with the cell wall and distinct from the polysaccharide capsule). Today, GBS is well known as the most common cause of neonatal meningitis, which is further classified into early onset (<7 days after birth) and late onset (7 days to 3 months after birth) (Berardi et al., 2007; Edwards et al., 2011; Nanduri et al., 2019). Transmission to the newborn can be vertical, through contact with mucus membranes, or through ingestion of infected amniotic fluid (Verani et al., 2010). Consistent with this, it is a frequent (20–40% of individuals) colonizer of the human gastrointestinal tract and the reproductive tract of women, based on two prevalence studies in a North American university (Bliss et al., 2002; Manning et al., 2004). However, GBS is also an increasingly common cause of severe invasive disease, typically in immunocompromised individuals and the elderly, since the 1960s (Farley et al., 1993; Schuchat, 1998; Farley, 2001; Francois Watkins et al., 2019).

These well-known facts about the history and medical importance of GBS parallel several deeper themes of GBS biology. The dual names by which we refer to this bacterium echo a split in how we have studied its biology in humans, cattle, and other animals. The rise of human infections, first in neonates and more recently in adults, matches a theme of ongoing evolution and niche expansion for the species. Finally, the shift in associated species from cattle to humans foreshadows additional potential species jumps that are apparently continuing to this day. GBS is now known to be widely distributed among diverse species of mammals, reptiles, amphibians, and fish (such as dogs, cats, goats, elephants, frogs, crocodiles, dolphins, seals, llamas, and camels) (Edelstein and Pegram, 1974; Bishop et al., 2007; Delannoy et al., 2013; Tavella et al., 2018), not only colonizing but in many cases also capable of causing severe invasive disease. Of particular note, besides its importance in human and bovine medicine, GBS is a significant pathogen for aquatic species, including those of importance for food production (Amal et al., 2011). Streptococcal infections are responsible for an estimated US\$150 million in global losses in farmed tilapia in 2000 (close to 10% of the total value) (Amal et al., 2011). As with human and bovine infections, GBS infection in fish was first described relatively recently, with two seminal reports in 1958 and 1966 (Hoshina et al., 1958; Robinson and Meyer, 1966). Even from these first reports, GBS was noted to be a particular danger to farmed fish, being highly contagious and usually fatal (Hoshina et al., 1958; Robinson and Meyer, 1966).

GBS has therefore been studied in multiple contexts: human health, veterinary medicine, and agriculture. Research has thus been motivated by both health and economic goals, which naturally vary in importance across these different disciplines. Beyond the proximal questions of how GBS colonizes and causes specific diseases in specific hosts, GBS is an intriguing case study for the larger questions of how broad host range at the species level is maintained despite evidence of variation at the subspecies level. These larger evolutionary questions are again made more urgent by convincing evidence of ongoing adaptation and emergence of pathogenicity and resistance in GBS.

To tackle these specific questions about disease mechanisms as well as broader evolutionary questions, genomics is a natural fit. Recent years have seen an explosion of genomic data available for GBS, as is the case for all other bacteria. The transition to a post-genomic era for GBS holds an additional promise for a unified understanding across the medical and veterinary fields, which may lead to a fuller appreciation of the importance and impact of this versatile bacterium. In this review, I will focus on two primary topics: (1) GBS genome plasticity and evolution and (2) GBS molecular epidemiology related to geography and host range (focused on humans, cattle, and fish). I have included some contextual information drawn from non-genomic papers using other typing systems, but this review does not aspire to be complete in regard to the entire corpus of GBS studies.

### PRE-GENOMIC CLASSIFICATION SYSTEMS

GBS, like many other bacteria of medical importance, has long been recognized to have intraspecies variation that can be tracked with a variety of molecular methods. Early studies utilized immunological reactions, resulting in a (still commonly used) serotyping scheme consisting of 10 major serotypes (Ia, Ib, II-IX) (Edwards et al., 2011). Numerous other systems have been applied to GBS, including ribotyping (Huet et al., 1993), RAPD (Random Amplification of Polymorphic DNA) (Limansky et al., 1998), RFLP (Restriction Fragment Length Polymorphism) (Hauge et al., 1996), PFGE (Pulsed-Field Gel Electrophoresis) (Rolland et al., 1999), and MLEE (Multilocus Enzyme Electrophoresis) (Musser et al., 1989). The other major pre-genomic classification system (although, ironically, published after the first genome sequences became available) that still remains in common use today is MLST (Multilocus Sequence Typing) (Jones et al., 2003), due to its balance between ease of typing, portability between labs, and reasonable resolution (Maiden, 2006). From these initial pre-genomic studies, the general outlines of the population structure of GBS were inferred. The different major hosts (humans, cattle, and fish) have largely distinct populations of strains, with some notable exceptions that may be indicative of cross-species jumps. Furthermore, changes in the epidemiology of disease and responsible serotypes or MLST types have been noted (see Section "The Pan-Genome"). However, overall, prior to the genomic era, GBS could generally (though not exclusively) be classified based on its host species (Finch and Martin, 1984; Bohnsack et al., 2004; Sukhnanand et al., 2005; Evans et al., 2008; Pereira et al., 2010), which then could be subdivided into several (3–5) large groups of closely related strains (termed clonal complexes in the MLST nomenclature) that accounted for most diseases (**Table 1**).

TABLE 1 | Major MLST Clonal Complexes of GBS found in humans, cattle, and fish.


#### GROUP B STREPTOCOCCUS GENOME PLASTICITY AND EVOLUTION

#### Initial Genome Sequences

The first full genome sequences of GBS were of the human isolates NEM316 (ST23, Serotype III) (Glaser et al., 2002) and 2603V/R (ST110 (a single locus variant of ST19), Serotype V) (Tettelin et al., 2002), both published in late 2002. These two genomes provided a heretofore unprecedented view of the organization and potential evolution of the GBS genome, which was initially gleaned from comparisons with genomes of *S. pyogenes* and *S. pneumoniae* strains. Generally, the GBS genome consists of a conserved "backbone" that is punctuated by 14–15 genomic islands of variable gene content (and many smaller islets). GBS was remarkable for its high number of tRNAs (80), ABC transporters (62), and signal transduction systems (17–20 two component systems). In addition, there are multiple classes of mobile DNA elements that presumably contributed to disrupting, duplicating, and transferring genes. In particular, insertion sequences, prophages, and a triplicated integrated plasmid in NEM316 (denoted pNEM316–1) (Glaser et al., 2002) were major contributors to variation in gene content that was specific to GBS and that also varies among different GBS strains. With only a single full genome to analyze, neither of these initial genome papers were able to directly identify large-scale chromosomal recombination occurring in GBS, though it was clear from comparative genome hybridization experiments that genes within the species-specific GBS islands were also more likely to vary among GBS strains as well (Tettelin et al., 2002).

#### The Pan-Genome

The structure of the GBS genomes, with a conserved backbone punctuated with islands of variable gene content, led directly to a simple hypothesis as to why different GBS strains may preferentially colonize or infect different hosts. GBS as a species might be defined by the conserved regions, while the variable islands, which often possessed features of genes important for virulence (Glaser et al., 2002; Tettelin et al., 2002), could carry genes that provided specific phenotypes important in different hosts or different disease settings. This structural stratification of gene conservation became more clear as more genomes were sequenced, not just for GBS but for other bacteria as well, most notably *E. coli* (Welch et al., 2002).

This represented, in a sense, a specific genomic extension to genetics. Clearly, different phenotypes for host specificity and disease would be traceable to genetic mechanisms; and now, perhaps, there was a structural genomic framework which would organize the adaptation and evolution of these traits. The first formalization of this idea was the pan-genome concept (though the potential *structural* aspect of genome organization was not noted). The seminal paper describing the concept of a pan-genome (the complete set of genes that is found in all individuals of a given species) was based on an analysis of eight GBS genome sequences (Tettelin et al., 2005). GBS therefore holds a special place in the early transition to the post-genomic era for bacteria. It was also the first organism described to have an "open" pan-genome; rarefaction analysis predicted that, even with an arbitrarily large number of genome sequences, every new genome sequence would contribute an extra 33 genes that had not previously been seen in any other GBS. The pan-genome concept was one of the first truly new results to arise from comparative genomics; it was a systematic rationalization of differing gene content that necessarily required the existence of multiple genome sequences, and it further gave rise to the companion concept of a core genome that consists of genes that are conserved across all individuals of a given species (Medini et al., 2005; Tettelin et al., 2005).

The initial ideas about core and pan-genomes were a remarkably useful organizational framework for thinking about a variety of issues relevant to bacteria and genomics, leading to their immense popularity. The pan-genome concept was closely related to ideas about genome plasticity, horizontal gene transfer, the concept of species for bacteria, and genome evolution (Medini et al., 2005). The conceptual simplicity of core (conserved) versus accessory (variable) genes in an organism was a natural fit for rationalizing differences in pathogenic potential, host range, and other variable phenotypes. Put simply, a core genome in some sense defined a species by providing conserved phenotypes and responses; of prime practical interest were those that medical microbiology leveraged to perform species identification in the lab. For an individual strain, the accessory genome (in other words, the subset of the pan-genome found in that individual above and beyond the core genome) could vary from other individuals, and would explain differences in observed phenotypes such as virulence; alternatively, as more genomes became sequenced, it became clear that sometimes gene set differences could also be related to different ecological niches, such as different geographical locations or host organisms. It was also further shown that large (up to 334 kb) chromosomal segments, including these islands, could be transferred horizontally between strains (Brochet et al., 2008).

The initial promise of the pan-genome concept for providing a holistic organizational framework for individual bacterial species, however, was difficult to fulfill. The initial eight GBS strains that were sequenced and analyzed were chosen for their representation of different serotypes and host organisms, both proxies for sampling the diversity of the species and its disease-relevant phenotypes (Tettelin et al., 2005). One implicit assumption embedded in this initial analysis was that new GBS genomes would sample similarly new subsets of the species diversity; returning to serotype or host organisms as proxies, larger data sets would inevitably begin to sample (at least some) very similar strains. In other words, the initial analysis looked at eight strains that were first chosen for sequencing because they were of different serotypes and MLSTs; they were chosen to represent the diversity of the species. With hundreds to thousands of genomes, however, an additional strain is unlikely to represent a divergent, previously unsampled clade or subclone. Thus the diversity modeled from eight relatively diverse strains may not be accurate when extrapolated to an arbitrary number of strains. Indeed, the high relatedness of many GBS isolates has become more clear in the observation of frequent clonal expansion of virulent or hypervirulent clones, especially among those causing disease (see examples below in the "Molecular Epidemiology" sections).

In addition, the original pan-genome paper inspired numerous similar analyses on other sets of genome sequences, and not all were limited to gene sequences (Lefébure and Stanhope, 2007; Liu et al., 2013; Kayansamruaj et al., 2015; Puymège et al., 2015; He et al., 2017; Wolf et al., 2018; Wang et al., 2018a); the result of an open pan-genome was consistently found. Beyond the use of genomics to calculate the sizes and "openness" of the core and pan-genomes, many studies performed additional analyses that provided several clear and general insights into overall GBS genome plasticity and evolution. GBS genomes have obviously evolved by recombination, likely driven by large-scale DNA transfers mediated by mobile elements (Bröker and Spellerberg, 2004; Brochet et al., 2008; Sørensen et al., 2010; Da Cunha et al., 2014). There is an interesting dichotomy between very clear evidence of large-scale recombination between different lineages of GBS (Springman et al., 2009; Da Cunha et al., 2014; Teatero et al., 2016; Campisi et al., 2016a) with very little recombination within expanding clones, which instead evolve largely by accumulation of mutations (Brochet et al., 2006; Flores et al., 2015; Almeida et al., 2017; Kalimuddin et al., 2017). Interestingly, there have been several noted instances of serotype switching, likely through recombination and after emergence of a successful lineage, which may confound earlier typing studies (Luan et al., 2005; Brochet et al., 2008; Martins et al., 2010; Bellais et al., 2012; Teatero et al., 2014; Neemuchwala et al., 2016; Wang et al., 2018a). These can occur through large-scale recombinations (most clearly seen in originally Serotype V ST1 strains that have converted to Serotype Ib, II, and IV through apparently single recombinations, ranging from 79 to 200Kbp, encompassing the capsule determining *cps* locus) (Neemuchwala et al., 2016). Notably, genomics provides the most clear view of this phenomenon, as previous studies using PFGE, MLST, lab-based serotyping, and sequencing of the cps locus estimated potential serotype switching events from 2 to 16% (Luan et al., 2005; Martins et al., 2010).

Therefore, the overall picture of GBS evolution is consistent with the hypothesis, most clearly articulated by the lab of Philippe Glaser, of a continuous generation of new lineages, through any mutational mechanism including mobile element activity, reductive evolution, or large-scale recombination, coupled with nearly clonal evolution of successful lineages, characterized by very little recombination and possibly by reductive evolution (Brochet et al., 2006, 2008; Lefébure and Stanhope, 2007; Sørensen et al., 2010; Rosinski-Chupin et al., 2013; Almeida et al., 2016). There are several additional strong results that provide insights into the mechanisms of subspecies adaptation. Notable examples are the consistent genome reduction in fish-adapted isolates (see Section "Group B Streptococcus Molecular Epidemiology in Fish") (Liu et al., 2013; Rosinski-Chupin et al., 2013; Delannoy et al., 2016); the consistent presence of the *scpA* and *lmb* virulence factors in human isolates (though they are still found, at lower frequency, in animal isolates) (Franken et al., 2001; Sørensen et al., 2010); and the development of metabolic modifications matched to expected nutrient sources, exemplified by the acquisition of the Lac.2 operon enabling lactose fermentation in cow-associated strains (Richards et al., 2013).

#### Group B Streptococcus Genome Sequencing Today

As of this writing (March, 2019), there are over 7,000 GBS strains for which genome sequencing data are publicly available in the GenBank Sequence Read Archive (SRA). As with other microorganisms, the number of data sets has grown exponentially over the past few years, and the literature contains reports of several notable survey studies that together have contributed several thousand data sets (Metcalf et al., 2017), though several of these appear not to have been published in manuscripts yet (**Table 2**). Furthermore, the advent of journals like Genome Announcements, which publishes only genome sequences without analysis, has led to a large growth in manuscripts describing single or multiple genome sequences, and genome sequencing is being more routinely used as a tool (as opposed to the main endpoint of a study) (**Table 3**).

#### Group B Streptococcus Molecular Epidemiology of Human Isolates

One of the major solutions that increases in genome sequencing throughput has delivered is definitive molecular epidemiology (Klemm and Dougan, 2016). The bacterial genomics field has variously described this also as phylogeography, phylogenomics, or global population structure studies; but the underlying concept of correlating strain relatedness with some other variable, such as isolation location, remains the same. The first major study to present enough sequences that were reasonably thought to represent nearly the full species diversity was published in 2014 (Da Cunha et al., 2014). This manuscript analyzed 229 strains, mostly (94%) isolated from humans, but also including 13 isolates from four other animals, all encompassing five different continents. This for the first time provided a global view of the species that could integrate the preceding MLST scheme. One prominent result from this analysis was the definitive conclusion that human GBS disease isolates arise from a limited number of clones. The clonal evolution and spread of individual clones was already strongly suggested by previous MLST studies, which had already identified clonal complexes that were variously associated with human and animal disease (**Table 1**). Interestingly, unlike several other bacterial pathogens, the distribution of GBS clones was not generally correlated with geography. There were some known exceptions [like the prevalence of CC26 in Africa (Brochet et al., 2009)], and the sampling of South America, Africa, and Asia was extremely low, both in this study and generally in the GBS literature (Dagnew et al., 2012; Johri et al., 2013; Kwatra et al., 2016). The overall conclusion from this first look, however, was that most of the major GBS CCs causing human disease had relatively low geographical stratification when compared with other surveyed bacteria.



A common theme for GBS, however, is that the general rules for the overall population are punctuated with notable exceptions. There has not been a similarly large-scale (i.e., species-wide) phylogeographic study of GBS since the Da Cunha 2014 publication. Instead, deeper analysis of individual clones, often with strong geographic biases, have added new perspectives to the distribution (and possibly evolution) of GBS in humans, and there has been encouraging recent progress in previously undersampled areas (not all using genomics) (Johri et al., 2013; Crespo-Ortiz et al., 2014; Dutra et al., 2014; Louthrenoo et al., 2014; Mitima et al., 2014; Belard et al., 2015; Bergal et al., 2015; Dangor et al., 2015; Eskandarian et al., 2015; Rivera et al., 2015; Villanueva-Uy et al., 2015; Wang et al., 2015, 2016; Le Doare et al., 2016; Seale et al., 2016, 2017; Sinha et al., 2016; Campisi et al., 2016b; Medugu et al., 2017; Slotved et al., 2017; Suhaimi et al., 2017; Veeraraghavan et al., 2017; Botelho et al., 2018; Emaneini et al., 2018; Guo et al., 2018; Li et al., 2018; Melo et al., 2018; Nkembe et al., 2018; Sigaúque et al., 2018; A'Hearn-Thomas et al., 2019; Lee et al., 2019; Mukesi et al., 2019; Nagano et al., 2019). However, as an example of the progress still needed, I could find only one study of human GBS each from Indonesia and the Philippines, and none from Vietnam, in Pubmed (Wibawan et al., 1992; Villanueva-Uy et al., 2015); Indonesia is the fourth most populous country in the world, and all three of these countries have populations that are larger than any individual European country.

There are several remarkable examples of what appears to be a single dominant clone of GBS causing the majority of disease in specific locations. From 1992 to 2013, more than 90% (210/229) of the invasive serotype V isolates were closely related isolates of an ST1 clone (however, this study did not examine other serotypes) (Flores et al., 2015). A similar serotype-restricted study found a rising incidence of serotype IV isolates in Minnesota from 2004 to 2008 [8.4% of 1,160 patients (Diedrick et al., 2010) compared to <1% of nearly 3,000 isolates from 1993 to 2002 from four cities including Minneapolis-St. Paul (Ferrieri et al., 2004)]. Subsequent genomic studies, again encompassing strains from Minnesota as well as Manitoba and Saskatchewan, Canada, determined that 89% of the serotype IV isolates from 2010 to 2014 were from the same clone of ST459 (Teatero et al., 2015a). Interestingly, in the geographically distant Toronto, where serotype V isolates were dominated by ST1, 81% of serotype IV isolates collected from 2009 to 2012 were comprised of just two STs, ST452 (CC23) and ST459 (CC1) (Teatero et al., 2015b), the same major STs previously found in Minneapolis (Diedrick et al., 2010). Additionally, there has recently been another dramatic reported expansion of a single clone causing human disease: ST283 (serotype III) in Southeast Asia (Ip et al., 2006; Kalimuddin et al., 2017) (see section "Emerging Group B Streptococcus Disease").

#### TABLE 3 | GBS genome sequencing reports.


#### Antibiotic Resistance

The headline result from the Da Cunha et al. (2014) analysis was that resistance to tetracycline in GBS drove its increased importance for human disease (Da Cunha et al., 2014). All of the major clonal complexes infecting humans had a high (>90%) rate of tetracycline resistance, mostly mediated by the *tetM* gene. The high rate of tetracycline resistance was thought to be due to initial acquisition through a mobile genetic element (*Tn916* or *Tn5801* in all but one strain) then expansion of a subsequent clone. Importantly, the insertion position of the transposon was identical within all strains of the expanded clones causing human disease. A Bayesian analysis predicted that the divergence date of the expanded tetracycline-resistant clones corresponded well with the introduction of tetracycline for clinical use in 1948 (Da Cunha et al., 2014). This raised the possibility that the increasing virulence of GBS, or at least the rise in GBS cases, was caused by the simultaneous selection for more virulent and more tetracycline-resistant strains (Da Cunha et al., 2014).

Rising resistance is a nearly universal feature of medically important bacteria. For GBS, this has been reported not only for tetracycline, but for also for fluoroquinolones and aminoglycosides (Hays et al., 2016). Fortunately, beta-lactam antibiotics, particularly penicillins, which are first-line therapy for GBS, have remained highly effective, with large surveys documenting less than 1% of strains as resistant (Hays et al., 2016; Metcalf et al., 2017). Interestingly, vancomycin resistance has recently been reported for the first time in two GBS strains, through introduction of two different *vanG* elements found integrated at the same chromosomal locus in both strains (Srinivasan et al., 2014). Finally, macrolide resistance, most commonly measured for erythromycin, has also been rising, with rates measured since 2010 in the range of 14–59% (Lamagni et al., 2013; Da Cunha et al., 2014; Hays et al., 2016; Metcalf et al., 2017). Of great interest, a large survey in France found an exception to this trend, with rates of macrolide resistance falling from 47 to 30% between 2007 and 2014 (Hays et al., 2016). Many of these surveys leverage strong antibiotic susceptibility testing infrastructure in first-world countries; however, this is beginning to give way to genomic predictions that may eventually provide solutions for low resource settings. With regard to this, GBS-specific analyses for the prediction of antibiotic resistance and serotypes from genomic data were shown to provide high accuracy (Metcalf et al., 2017). Genomics currently cannot fully replace traditional antibiotic testing, as previously unknown (or rare) resistance mechanisms cannot be predicted from sequence data alone. However, genomics has the additional advantage of providing greater insight into the dynamics driving spread of resistance and changes in resistance rates. For example, antibiotic resistance in GBS is largely mediated by resistance gene acquisition for all of the major antibiotic classes except for fluoroquinolones, which instead arise mostly by mutations in the *gyrA* and *parC* genes and penicillins (Metcalf et al., 2017). In addition, there are multiple examples of rising resistance rates being associated with expansion of individual clones, as seen for tetracycline as described above (Da Cunha et al., 2014), erythromycin resistance in ST1 (Flores et al., 2015), and beta-lactam resistance in several examples of closely related isolates (Metcalf et al., 2017). Overall, therefore, genomics paints a general overall picture similar to that described for clonal emergence above; antibiotic resistance is initially acquired through horizontal transfer or possibly recombination (also enabling acquisition of fluoroquinolone or penicillin resistance), followed by clonal expansion of successful lineages that drive increases in antibiotic resistance rates.

As in other bacteria, mobile genetic elements are often associated with more than just antibiotic resistance genes. Many of these additional genes have features that suggest they may be involved in virulence, such as surface attachment signals (LPXTG), predicted surface localization, homology to adhesin proteins, novel metabolic activities, or predicted secretion and toxicity. One well-described example is the co-occurrence of AlpST-1, a predicted surface-exposed adhesin protein, that is encoded within the same mobile genetic element (denoted the RDF.2 MGE) as the *tetM*-carrying *Tn916* in a collection of 202 ST1, serotype V strains from the US and Canada (Flores et al., 2015). In this clone, the authors argue that the close genetic linkage between *tetM* and the AlpST-1 virulence gene could account for the association between tetracycline resistance and virulence, so the virulence would not be due to the tetracycline resistance *per se* (Flores et al., 2015).

The concept that antibiotic resistance is associated with strains with high pathogenic potential for humans (and other animals) is not controversial. Indeed, therapeutic and agricultural antibiotic usage is perhaps one of the strongest influences that humans have exerted on the makeup of our microbial environment. However, we typically associate antibiotic resistance with a loss of fitness in bacteria, which is then overcome by the strong selection pressure of antibiotic administration (though numerous counterexamples exist). Antibiotic resistance is generally not considered a virulence determinant in and of itself. The observation of high tetracycline resistance rates, independently acquired in multiple lineages of common human GBS strains (CC1, 10, 17, 19, 23, and 26), is indeed an incontrovertibly strong demonstration that tetracycline use has dominated the evolution of GBS in humans (Da Cunha et al., 2014). As in the case of ST1, however, the underlying pathogenic mechanisms, which can lead to a stronger understanding of disease and novel strategies for prevention (Flores et al., 2015), remain the central unanswered question for essentially all GBS lineages.

#### Group B Streptococcus Molecular Epidemiology in Cattle

In contrast to the idea that most of the GBS population was not strongly stratified by geography, it has long been known that GBS demonstrates relatively strong host specificity. In addition to human disease, the economic impact of lower milk production due to mastitis has driven substantial *S. agalactiae* research (Keefe, 1997; Ruegg, 2017). Substantial work, therefore, has also examined the potential commonality in the rise of human and cattle infections, most obviously mediated through milk and close contact with dairy farmers (Bliss et al., 2002; Bohnsack et al., 2004; Sukhnanand et al., 2005; Foxman et al., 2007; Manning et al., 2010).

The literature describing GBS that infects bovine hosts is extensive, in keeping with the initial *S. agalactiae* nomenclature (which refers to its effects on dairy cows). As with human isolates, pre-genomic techniques had already been used to sketch a general outline of the population structure. Bovine isolates generally fall into two main clusters, represented in the MLST scheme by CC67 and CC23. A significant minority of strains were found to fall into CCs that overlapped with human isolates, most prominently CC17 (Sørensen et al., 2010). As seen with human GBS, however, there are examples of recent changes in prevalent GBS clones which may differ based on geography. A recent rise in CC61 strains was noted in Portugal, with an origin estimated in the early 1990s (Almeida et al., 2016). From 84 bovine isolates collected from milk from 14 dairy farms in China between 2011 and 2016, all were either CC61 or CC103 (Pang et al., 2017). Interestingly, CC103 strains have also been noted in Denmark and Norway, but not the UK or US (Bisharat et al., 2004; Zadoks et al., 2011; Jørgensen et al., 2016).

The fact that GBS was known first as a bovine pathogen and only later noted to cause increasing rates of neonatal and adult disease led to a strong interest in exploring a potential zoonotic transmission from cows to humans (and *vice versa*). There was suggestive increase in GBS colonization rate among university students who drank milk (*n* = 150), but this was not statistically significant (Bliss et al., 2002). A larger, longitudinal

study (3-week intervals over 3 months) in a similar population found no significant association between GBS colonization and beef or milk consumption; notably, there were positive associations with sexual activity and fish consumption (Foxman et al., 2007). Another epidemiological study examined the colonizing GBS in humans who had regular close contact with cattle. Of 68 matched human-cattle stool samples, one set had the same GBS as measured by MLST, capsule, RAPD pattern, and antibiotic resistance profile. Of note, the cattle sampled were not symptomatic, and stool is not a typical sampling location in cattle. Perhaps more interestingly, human colonization was significantly associated with exposure to cattle in the previous week (Manning et al., 2010). Overall, therefore, transmission between humans and cattle seemed rare if not nonexistent.

The conclusion that the human pathogenic CC17 lineage was derived from a bovine GBS ancestor was therefore a dramatic result (Bisharat et al., 2004). This initial report was based on an analysis of MLST gene sequences. In what has become a repeated testament to the value of whole genome analyses, this initial MLST result has been challenged by subsequent genomic studies (Brochet et al., 2006; Sørensen et al., 2010; Richards et al., 2013). Comparative whole genome hybridization on microarrays to assess gene content in a collection of 75 strains from humans and multiple animal hosts indicated that the common ancestor of the bovine ST61 and ST17 strains was likely more similar to a human ST17 strain, implying the reverse transmission direction (Brochet et al., 2006). An examination of 15 genes (including the seven MLST genes) in addition to several virulence-associated traits in a representative set of 55 strains (drawn from 238 in total) showed differences between the human and bovine CC17 strains that were not captured by the MLST genes. Combined with differences between human and bovine strains with respect to the ability to ferment lactose, the presence of the humanassociated *scpB* virulence gene, and the presence of the PI-1 pilus island, the authors concluded that human CC17 strains evolved separately, perhaps from a common diverse pool, from the bovine CC17 isolates (Sørensen et al., 2010). Finally, whole genome sequencing data from 202 strains isolated from human and animal hosts was used as part of an analysis aimed at understanding the factors responsible for survival in the bovine mammary gland. Acquisition of the Lac.2 operon, which enables fermentation of lactose (the primary sugar in cow milk), was a consistent feature of bovine strains. In contrast, among ST17 strains of human and bovine origin, none of the human strains had the Lac.2 operon, which was interpreted as unlikely if a bovine strain was the ancestor of the human isolates (Richards et al., 2013). Notably, as with the association of human strains with presence of the *scpA* and *lmb* genes noted above, the association of the Lac.2 operon with bovine isolates, while strong, is not exclusive; 8/151 human strains in one survey carried Lac.2 (Richards et al., 2013), though for some strains, such as NEM316, the origin (possibly *S. gordonii*) may be distinct from the origin suspected in bovine strains (*S. dysgalactiae* subsp. *dysgalactiae*) (Richards et al., 2011). Therefore, with the benefit of larger scale genomic data, it still appears that bovine and human strains are largely (if not completely) separate, at least in terms of the severe disease-causing strains, though the possibility of overlap continues to be discussed (Lyhs et al., 2016; Wang et al., 2018b). There seem to be cases of colonization in both directions, but this appears to be transient in both cattle and humans (Jensen, 1982; Betsy Foxman et al., 2006). On a longer evolutionary scale, however, the genomic data are clear that some bovine- and human-adapted lineages do at least share common ancestors, most notably for CC17 and CC67 (Sørensen et al., 2010; Pang et al., 2017).

#### Group B Streptococcus Molecular Epidemiology in Fish

The next most extensively studied host organism for GBS is fish. Again, much of the motivation for this research has been economic interest, as Streptococcosis is a major disease affecting farmed fish (Amal et al., 2011). In fish, GBS are generally limited to certain MLST types, which leads to their occasional reference in this literature as biotypes, with a strong association with serotype. One early study proposed two new species, *S. shiloi* and *S. difficile*, as pathogens of fish in Israel starting in 1986 (Eldar et al., 1994). These were later resolved to *S. iniae* and *S. agalactiae* (serotype Ib), respectively (Eldar et al., 1995; Vandamme et al., 1997). Subsequently, the fish literature began to refer to biotypes of GBS, which persists today in the marketing for fish vaccines1 . Biotype I corresponds to serotype Ia, ST7, and is mostly found in Asia. Biotype II corresponds with serotype Ib, ST260 or ST261 (and is referred to as CC552, though this has recently been refined), and is found generally throughout the globe (Delannoy et al., 2013; Paul, 2014; Munang'andu et al., 2016; Barony et al., 2017).

Similar to the situation with bovine GBS, one major topic of research has been the possibility of cross-species infections between humans and fish. Reliance on MLST again indicated that there was evidence for transmission: ST7 is found in both humans and fish (Evans et al., 2008; Delannoy et al., 2013), as well as other aquatic animals, and a large outbreak of fish infections in Kuwait Bay was thought to be due to human GBS isolates entering the water through sewage contamination (Jafar et al., 2008). Three genomic studies later examined human and fish ST7 strains more closely. In one study, the human and fish ST7 strains were very similar, with similar gene content and a uniformly low nucleotide divergence throughout the genome (Rosinski-Chupin et al., 2013). A second study noted that human and piscine ST7 strains were closely related based on genome content (CRISPR arrays, prophages, and virulence-associated genes) (Liu et al., 2013). The third study, which included fish strains from both CC7 and CC552, identified eight genomic regions, mostly located within genomic islands, containing genes associated with or specific to piscine strains, which were then confirmed by PCR screening across a larger collection of 43 isolates from various hosts (Delannoy et al., 2016). In addition, strains

<sup>1</sup> https://www.aquavac-vaccines.com/products/

of ST23, which are common human pathogens, were unable to cause experimental infections in tilapia, while one of two human ST1 strains was equally pathogenic to tilapia compared a *bona fide* piscine ST7 isolate (Delannoy et al., 2016; Wang et al., 2017). Overall, therefore, it appears that much of the population of piscine and human isolates are largely separate, but at least some strains of CC7, may be able to infect both hosts. Another common theme is the idea that host specificity has evolved through genome reduction, where genes and pathways no longer needed in other environments (hosts such as humans) are lost once GBS specializes to colonize fish, most clearly demonstrated in CC7 and CC552 (Liu et al., 2013; Rosinski-Chupin et al., 2013; Delannoy et al., 2016). For example, several studies found multiple fish strains with genome sizes in the 1.7–1.8 Mbp range, compared with 2.0–2.2 Mbp for many human isolates (Liu et al., 2013; Rosinski-Chupin et al., 2013). This 200–300 Kbp accounts for over 100 otherwise "core" genes lost in fish isolates, which are enriched for carbohydrate transport and metabolism functions (Liu et al., 2013). Interestingly, another study of human ST19 strains found that at least some are able to cause experimental disease in tilapia, a phenotype that correlates with capsule type (Wang et al., 2018a).

One of the notable features of the research on piscine GBS is that the sampling covers a complementary geographic range to the human and much of the cattle GBS literature. Aquaculture is a rapidly growing business in South America, the Middle East, and Asia, particularly Southeast Asia; these are all areas that have been notably undersampled in humanfocused GBS studies. In these areas, it appears that many of the major food fishes, found in both freshwater and saltwater, can be affected (including rainbow trout, seabream, tilapia, yellowtail, catfish, croaker, killfish, and pomfret) (Amal et al., 2011). Interestingly, our knowledge of the population structure of piscine GBS is actually stronger than that for human disease-causing isolates in many countries, for a variety of reasons including economics, infrastructure, and the assumption that seemingly globally distributed clones (for human disease) would be similarly dominant in unsampled regions. Accordingly, there are potentially emerging clones of piscine GBS that are being reported (such as Serotype III and IX in Southeast Asia and China, respectively) (Kalimuddin et al., 2017; Zhang et al., 2018).

#### Emerging Group B Streptococcus Disease

The emerging GBS clone in Southeast Asia, ST283 (Serotype III), neatly intertwines the previously mentioned themes of ongoing GBS evolution (leading to potential cross-species host jumps), the potential for some clones to have strong geographical associations, and the value of genomics for enabling integration across different disciplines. Emerging clones such as ST283 therefore challenge our understanding of GBS ecology, evolution, disease, and management.

In 2015, Singapore experienced an outbreak of foodborne GBS infections. Associated with consumption of a local raw fish dish (魚生, yu sheng), over 200 patients suffered severe invasive disease, with bacteremia, meningitis, and septic arthritis (Rajendram et al., 2016; Tan et al., 2016; Kalimuddin et al., 2017) 2 . Prior to this outbreak, GBS had never been thought to be transmitted by the foodborne route. In retrospect, there had been indications that food consumption was associated with gastrointestinal colonization (Bliss et al., 2002; Foxman et al., 2007), and there had been an example of lizards contracting GBS sepsis through consuming contaminated mice (Hetzel et al., 2003). Furthermore, it seems clear that GI colonization by any organism in humans seems reasonable to assign to oral consumption until proven otherwise. Given that GBS is a common GI colonizer of both men and women, then, one interpretation is that even early onset neonatal meningitis is ultimately a foodborne disease (*via* vaginal colonization through the GI tract, then infecting the newborn during birth).

Interestingly, the outbreak organism, ST283, is a common MLST type that causes aquaculture-associated fish Streptococcosis. An ST283 strain and a single locus variant, ST491, had been identified in farmed fish in Vietnam and Thailand, isolated in the early 2000s (Delannoy et al., 2013). ST283 strains had also been reported to cause invasive infections similar to those seen in the 2015 Singapore outbreak in otherwise healthy humans from Hong Kong and Singapore as early as 1998 (Wilder-Smith et al., 2000; Ip et al., 2006, 2016; Barkham et al., 2018). Investigating after the 2015 Singapore outbreak, it was found that 71% of freshwater fish sold for raw consumption (which was associated with the outbreak) in Singapore carried GBS, with 14% carrying ST283; in contrast, 9–33% of saltwater fish carried GBS, with none being ST283 (Chau et al., 2017). Singapore imports the vast majority (>90%) of its food, including fish3 ; indeed, 4.6% of fish samples were already positive for GBS, with 1% of fish positive for ST283, at entry ports to Singapore (Chau et al., 2017), leading to a suspicion that regional aquaculture fish may also be colonized by ST283 (while ST283 is known to cause disease outbreaks in farmed fish, the fish being imported and sold appeared healthy). Troublingly, a recent report has identified ST283 as a cause of GBS outbreaks in at least five fish farms in 2016–2017 from four different states in Brazil, and this is suspected to be due to import of fish from Southeast Asia (Leal et al., 2019).

ST283 was originally described an emerging cause of invasive infections in Hong Kong and Singapore (Ip et al., 2006). It was only identified as a potential foodborne pathogen in 2015 (Rajendram et al., 2016; Tan et al., 2016; Kalimuddin et al., 2017). Epidemiological data for human infections are not consistently available in many countries in Southeast Asia, but recent data suggest that the endemic incidence of ST283 infections outside Singapore may be comparable to or higher than the incidence at the peak of the Singapore outbreak (Barkham et al., 2019; https://outbreakwatch.blogspot.com/2018/07/ proahedr-strep-group-b-singapore.html). If ST283 is also a foodborne pathogen outside Singapore, as recently suspected in Thailand (Kayansamruaj et al., 2018), this suggests the remarkable interpretation that GBS infections in Southeast Asia

<sup>2</sup> http://10minus6cosm.tumblr.com/post/132590510556/serious-group-b-streptococcalinfection-in-adults

<sup>3</sup> https://www.ava.gov.sg/

are actually first and foremost a foodborne illness (in stark contrast to the current paradigm of neonatal and immunocompromised disease found in current medical teaching). This further raises the possibility that invasive GBS disease caused by non-ST283 strains in human adults may be at least partially foodborne, expanding our understanding of GBS in general as a possibly long unappreciated zoonotic pathogen, with attendant implications for food safety. As food consumption is associated with vaginal colonization in women (Foxman et al., 2007), a further implication is that neonatal infections may also ultimately be a late sequelae of foodborne consumption of GBS. Interestingly, there is a case report of a late-onset neonatal infection caused by a genomically indistinguishable strain that was consumed by the mother *via* placental pills (Buser et al., 2017).

#### CONCLUSIONS AND OUTLOOK

GBS has a uniquely dynamic biology. At the overall species level, there is remarkably broad host range and global geographical reach. Standing in counterpoint to this broad generalism are numerous examples of strong host specificity and local geographical stratification. The source of this intraspecies heterogeneity likely traces back to similarly dynamic processes shaping the genome, with the capacity for large-scale chromosomal recombinations paired with highly independent, clonal evolution of individual successful lineages. GBS is not only an emerging pathogen in the traditional sense of rising incidence; a closer examination of its biology indicates that it is a continuously emerging pathogen that, in little more than a single human lifetime, has altered and continues to alter the tenets of epidemiology of human, bovine, and piscine colonization, infection, and economics. Since the first GBS genome sequences, genomics has been an ideal technology to capture the dynamism of this species. It therefore seems apt that GBS was the organism that birthed the concept of the pan-genome; its description as having an open pan-genome, with infinite possibility for evolution and adaptation, is a fitting metaphor for the recent history of GBS. GBS, in turn, provides

#### REFERENCES


a compelling argument for the continued progress of genomics and our need, as a community, to implement broad genomic monitoring in both developed and developing countries. The recent genomic history has demonstrated shifts in our understanding of GBS with respect to cattle, human neonates, human adults, aquacultured fish, and now the interface between fish and humans and food safety and economic development. Where else is GBS lurking? Further advances in genomics will hopefully enable us not only to reconstruct *post hoc* what GBS will do next, but to catch it in the early stages of its next evolutionary jump.

### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and has approved it for publication.

#### FUNDING

The author was supported by the National Medical Research Council, Ministry of Health, Singapore (grant numbers NMRC/ OFIRG/0009/2016 and NMRC/CIRG/1467/2017); the National Research Foundation, Singapore (NRF2018-ITS003-008); the Temasek Foundation Innovates through its Singapore Millennium Foundation Research Grant Programme; and the Genome Institute of Singapore (GIS)/Agency for Science, Technology and Research (A\*STAR).

#### ACKNOWLEDGMENTS

I would like to thank Timothy Barkham and Ruth Zadoks for their education on aspects of GBS microbiology and genomics; the members of the Chen lab for useful discussions about the ideas presented herein and for their constant dedication and support; and Hsu Li Yang and Koh Tse Hsien for their initial role, with Timothy Barkham, in convincing me to learn and care about GBS.


improper sourcing and handling of fish for raw consumption, Singapore, 2015-2016. *Emerg. Infect. Dis.* 23, 2002–2010. doi: 10.3201/eid2312.170596


patients. *Eur. J. Clin. Microbiol. Infect. Dis.* 34, 579–584. doi: 10.1007/ s10096-014-2265-x


Streptococcus agalactiae strain S25 isolated from peritoneal liquid of Nile Tilapia. *Genome Announc.* 4. doi: 10.1128/genomeA.00784-16


and population-based surveillance. *JAMA Pediatr.* 173, 224–233. doi: 10.1001/ jamapediatrics.2018.4826


clonal sequence type 459 strains. *J. Clin. Microbiol.* 53, 2919–2926. doi: 10.1128/JCM.01128-15


disseminated serotypes Ib and V as prevalent serotypes of *Streptococcus agalactiae* from 2007 to 2012. *J. Microbiol. Immunol. Infect.* 49, 672–678. doi: 10.1016/j.jmii.2015.05.022


**Conflict of Interest Statement:** SC is a co-inventor on a patent application for a GBS ST283-specific test.

*Copyright © 2019 Chen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Accurate and Strict Identification of Probiotic Species Based on Coverage of Whole-Metagenome Shotgun Sequencing Data

Donghyeok Seol1,2† , So Yun Jhang1,3† , Hyaekang Kim1,2, Se-Young Kim<sup>4</sup> , Hyo-Sun Kwak<sup>5</sup> , Soon Han Kim<sup>5</sup> , Woojung Lee<sup>5</sup> , Sewook Park<sup>5</sup> , Heebal Kim1,2,3, Seoae Cho<sup>1</sup> and Woori Kwak<sup>1</sup> \*

<sup>1</sup> C&K Genomics, Songpa-gu, South Korea, <sup>2</sup> Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, South Korea, <sup>3</sup> Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, South Korea, <sup>4</sup> R&D Center, CTCBIO, Inc., Hwaseong-si, South Korea, <sup>5</sup> Division of Microbiology, Ministry of Food and Drug Safety, Cheongju-si, South Korea

#### Edited by:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### Reviewed by:

Rosario Lombardo, University of Trento, Italy Yousef Nami, Agricultural Biotechnology Research Institute of Iran, Iran

> \*Correspondence: Woori Kwak asleo@cnkgenomics.com

†These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

Received: 08 September 2018 Accepted: 08 July 2019 Published: 07 August 2019

#### Citation:

Seol D, Jhang SY, Kim H, Kim S-Y, Kwak H-S, Kim SH, Lee W, Park S, Kim H, Cho S and Kwak W (2019) Accurate and Strict Identification of Probiotic Species Based on Coverage of Whole-Metagenome Shotgun Sequencing Data. Front. Microbiol. 10:1683. doi: 10.3389/fmicb.2019.01683 Identifying the microbes present in probiotic products is an important issue in product quality control and public health. The most common methods used to identify genera containing species that produce lactic acid are matrix-assisted laser desorption/ionization–time of flight mass spectrometry (MALDI-TOF MS) and 16S rRNA sequence analysis. However, the high cost of operation, difficulty in distinguishing between similar species, and limitations of the current sequencing technologies have made it difficult to obtain accurate results using these tools. To overcome these problems, a whole-genome shotgun sequencing approach has been developed along with various metagenomic classification tools. Widely used tools include the marker gene and k-mer methods, but their inevitable false-positives (FPs) hampered an accurate analysis. We therefore, designed a coverage-based pipeline to reduce the FP problem and to achieve a more reliable identification of species. The coverage-based pipeline described here not only shows higher accuracy for the detection of species and proportion analysis, based on mapping depth, but can be applied regardless of the sequencing platform. We believe that the coverage-based pipeline described in this study can provide appropriate support for probiotic quality control, addressing current labeling issues.

Keywords: NGS, probiotics, lactic acid bacteria, whole genome shotgun sequencing, mapping coverage, identification, metagenomics

#### INTRODUCTION

In light of the trend toward increasing interest in health, many probiotic products are emerging. The global probiotics market exceeded 40 billion USD in 2017 and more than 12 million tons of these products are expected to be consumed by 2024<sup>1</sup> . Probiotics are now used not only for nutrition, but also for medical purposes, such as to promote the development of the infant immune

<sup>1</sup>https://www.gminsights.com/industry-analysis/probiotics-market

system (O'Toole et al., 2017; Michelini et al., 2018). In this growing market, defective products are also increasing, which can pose some risks to consumers (Lewis et al., 2015). Although authorities such as the United States Food and Drug Administration (FDA) check all probiotic products before they permit their sale, they could pass products without knowing whether the bacteria in these products might be mislabeled. Thus, to safely manage the probiotic market, it is necessary to verify whether probiotic products actually contain the species mentioned on their labels. Such genera include Lactobacillus, Bifidobacterium, and Bacillus, which are referred to as genera containing species that produce lactic acid (GSLA) throughout this manuscript.

There are many ways to identify GSLA at the species level (Herbel et al., 2013), such as matrix-assisted laser desorption/ionization–time of flight mass spectrometry (MALDI-TOF MS) and 16S rRNA sequence analysis (Angelakis et al., 2011; Garcia et al., 2016). For MALDI-TOF MS, the initial cost is high (Wieser et al., 2012) and the approach to identifying species is library-based, which may lead to difficulty detecting species that are not listed in the spectral database (Singhal et al., 2015). Even if information is present in the database, being able to accurately identify similar species remains a challenge (Dušková et al., 2012; Bailey et al., 2013). In a similar manner, 16S rRNA sequences may be difficult to analyze because full-length 16S rRNA must be read for accurate profiling, and the sequencing must be carried out with high accuracy (Edgar, 2018a,b). Notably, the Illumina and Ion Torrent platforms are based on short read lengths of less than 400 bp (Hodkinson and Grice, 2014) which makes it difficult to compare 1,600 bp, the full length of the 16S rRNA gene, with sequences in public databases (Yang et al., 2011). Conversely, the Pacbio and Nanopore platforms are capable of long read sequencing over 2,000 bp, but with error rates of more than 10% (Rhoads and Au, 2015); thus, comparison of 16S rRNA at the 97% similarity level for species classification is not suitable (Wagner et al., 2016). Although the circular consensus sequencing (CCS) method of Pacbio can read the full length of 16S rRNA with high accuracy (Frank et al., 2016; Pootakham et al., 2017), it costs more than the common 16S amplicon method used by the Illumina platform.

As a solution to the above problems, the whole genome shotgun sequencing method has been proposed and widely applied in numerous microbial community analyses (Loman et al., 2012; Quince et al., 2017). One requirement for the whole-genome shotgun sequencing approach is metagenomic classification, which can follow various strategies (Breitwieser et al., 2017) including matching k-mers [e.g., Kraken (Wood and Salzberg, 2014), k-SLAM (Ainsworth et al., 2017), and CLARK (Ounit et al., 2015)], aligning to marker genes [e.g., MetaPhlAn 2 (Truong et al., 2015) and GOTTCHA (Freitas et al., 2015)] and translating into amino acid sequences [e.g., Kaiju (Menzel et al., 2016)]. These methods use a specific region of interest for detection instead of the whole genome, causing markers to lose their specificity. For example, if a new species is not available as a reference due to the absence of assembly data, but shares similar regions with other species due to horizontal gene transfer (HGT) (Hiraoka et al., 2016), the markers may detect other species. In addition, sequencing and assembly errors in the reference data can affect the detection of species, causing problems if it is necessary to rigorously determine the presence or absence of a species (Peabody et al., 2015).

In this study, we introduce a new GSLA classification pipeline that effectively reduces the false-positive (FP) rate using mapping coverage. The coverage yielded by alignment to the representative strain of a species was the coverage criterion. Due to the fact that the classification pipeline was based on the whole genome, the accuracy of the proportion analysis based on mapping depth was high, and FPs at the species level were not present; thus, more reliable results were achieved than with other metagenomic classification methods. We expect that the coverage-based pipeline presented in this study will facilitate efficient quality control of probiotic products, as well as the relabeling of products with inaccurate information. Overall, application of our pipeline could have a positive contribution to public health.

### MATERIALS AND METHODS

Our pipeline consists of two stages: database construction and species detection. During the database construction stage, oneto-one coverage was calculated for each species of GSLA, and a representative strain was selected for construction of a database to detect that species. Based on coverage, the detection threshold was also determined. During the second stage, the probiotic metagenomic data were mapped to the database created in the first stage. Species exceeding the coverage threshold were recorded as the detected species. A more detailed explanation of the GSLA detection pipeline is provided in **Figure 1**.

### Determination of the Representative Strain

The complete genomes of 126 species and 597 strains of GSLA were downloaded from the National Center for Biotechnology Information (NCBI<sup>2</sup> ) (**Supplementary Table S1**). One-to-one pairs of average nucleotide identity (ANI) were obtained within species and filtered at a threshold of 95% identity. Illumina paired-end simulated data were generated using the ART simulator (art\_illumina) program with the following parameters, based on the HiSeq 2000 platform (2 × 100 bp): mean size of DNA fragments: 350 bp, read coverage: 100 fold, and standard deviation of DNA fragment size: 10 (Huang et al., 2012). The reference genome was assigned one-to-one in the manner described above to determine coverage using bowtie2 with default settings (Langmead and Salzberg, 2012). After comparing the minimum coverage value by setting different strains as the reference genome, the strain with the highest minimum coverage value was selected as the representative strain for that species. At this point, if subspecies existed within a given species, if any strain group had an ANI value less than 95%, despite belonging to the same species, or if more than two groups clustered distinctly on the heatmap of all pairwise one-to-one ANI values,

<sup>2</sup>http://www.ncbi.nlm.nih.gov/

0.87). Thus, the strain with the highest coverage is the representative strain of that species. A reference database was constructed using the representative strains and whole-metagenome shotgun sequencing data of probiotic probiotics were aligned to it. Only species exceeding 0.7137 coverage were judged to be present in the probiotic product.

we selected additional representative strains. After that, the coverage threshold for detecting GSLA species was set to the lowest minimum coverage value out of the representative strain selected for each species. A reference set was then constructed for GSLA classification by combining the representative strains into a multi-FASTA file. In order to determine the coverage criterion, the values obtained from mapping the sequence reads to only one representative strain and to all representative strains combined into a set must be similar. This is because it explains how accurately the sequence reads are aligned to the representative strain of the species to which they belong.

## Sequencing of Probiotic Products

We sequenced the GSLA species in six probiotic products: one with Illumina and five with Ion Torrent technology. Considering that the Illumina platform was used for processing the simulated data and data from the NCBI sequence read archive (SRA<sup>3</sup> ), and produced reliable results, testing real

<sup>3</sup>http://www.ncbi.nlm.nih.gov/SRA

data using a different sequencing platform, such as the Ion Torrent, can reduce the platform bias of our pipeline. With the Illumina platform, library preparation was carried out using the TruSeq Nano DNA LT Kit (Illumina), and sequencing was then conducted using the NextSeq 500 sequencer (Illumina) in paired-end read mode. The read length was 150 bp per read. With the Ion Torrent platform, the prepared libraries were sequenced using the Ion S5 sequencer (Ion Torrent) and the read length was 350 bp.

## Detection Ability Test

Whole-genome shotgun sequencing data for a single species were downloaded from NCBI SRA and mapped to a reference set to determine whether that bacterial species was present. If two or more bacteria were detected that could not be distinguished based on the ANI criterion, an additional analysis was conducted. In this additional analysis, all complete genomes of the species identified in the detection test were used as reference sequences and aligned using the bowtie2 options of "–a (search for all alignments)" and "–a –score-min 'C,0,–1' (search for all

alignments with perfect match)." The species with the highest resulting coverage was designated as the detected species.

Next, to examine the detection capability of the metagenomics method, the program performed three processing steps to yield simulated data, SRA data, and real probiotic sample data. First, using simulated data, we created one large metagenome dataset by combining reads for 13 species obtained through ART simulation (**Supplementary Table S2**). Second, data for 10 different species obtained from SRA, which were all collected with the same platform and read length, were downloaded from NCBI and combined into one dataset (**Supplementary Table S3**). Finally, to examine the detection capability of bacteria in actual probiotics using whole-genome shotgun sequencing data, we used Illumina paired-end read data for 19 GSLA, and Ion Torrent platform data for 4∼11 GSLA. We used Trimmomatic (TRAILING: 30) for quality control. For Ion Torrent platform data, we used the TMAP aligner instead of bowtie2 as an alignment program, with the setting of stage 1 map 4. For the 30 Gb <sup>∗</sup> 2 of Illumina data, the processing time required was measured with the file size reduced to 15 Gb <sup>∗</sup> 2, 7.5 Gb <sup>∗</sup> 2, 3 Gb <sup>∗</sup> 2, and 1.5 Gb <sup>∗</sup> 2 through random sampling.

Subsequently, the complete genomes of 19 GSLA species approved by the Ministry of Food and Drug Safety (MFDS; Korean Food & Drug Administration) as probiotics were used to calculate the proportional abundance of the species in the sample (**Supplementary Table S4**). All the reported strains of 19 species at the complete genome level were concatenated according to species, to create a single FASTA file. A reference dataset was then constructed for proportion analysis by combining all of these files into a multi-FASTA file. The species proportions were calculated according to the relative ratio of the mapping depth of a given group, divided by the average length of sequences for that group. Furthermore, only simulated and SRA data were used, and we have combined 10 species to have equal proportions of 10%. For simulated data, we used a number of reads for each species that was in proportion to the sequence length for that species, to simulate the actual data product (**Supplementary Table S5**). For SRA data, we carried out additional analysis to identify the most similar strain to each downloaded SRA sequence, and the read count in proportion to the sequence length of that strain, and then combined these strains into one dataset. We also used the same data as for the detection capability test. All of these detectability tests were repeated using several other metagenomic classifiers, such as MetaPhlAn 1 (Segata et al., 2012), MetaPhlAn 2 (Truong et al., 2015), Kaiju (Menzel et al., 2016), k-SLAM (Ainsworth et al., 2017), CLARK-S (Ounit and Lonardi, 2016), and KrakenHLL (Breitwieser and Salzberg, 2018) for a comparison of the results with those from our pipeline.

#### Data Availability

The sequencing data analyzed for this study are available via the NCBI Sequence Read Archive (SRA) under accession number BioProject PRJNA508569. The document for python source code and the reference sequence data index file used for detection and proportion analysis in this study are freely available from the Github repository<sup>4</sup> and Google Drive<sup>5</sup> .

#### RESULTS

#### Building a Representative Genome Set and Determining the Coverage Criterion

In this study, complete genomes for a total of 126 species and 597 strains of GSLA were downloaded from NCBI (**Supplementary Table S1**). Rather than using all 597 strains, we selected representative strains for each species to form a representative genome set, due to high sequence identity among the genomes of strains within a species. The representative strain was that having the highest minimum coverage in all pairwise comparisons between genomes of strains within a given species. Before analyzing the coverage data, ANI analysis was performed to verify whether the genomes represented the same species. In general, if ANI exceeds 95%, genomes can be classified as the same species (Goris et al., 2007). However, pairwise ANI calculations showed that some strain genomes did not exceed the ANI criterion despite being from the same species. In our research, Bacillus pumilus, Bacillus amyloliquefaciens, Lactobacillus casei, and Lactococcus lactis contained strains that were not considered to be of the same species, and which were instead classified into two groups based on 95% ANI (**Supplementary Figure S1**). For example, when mapping a shotgun read simulated from the genome of the L. casei type strain (GCF\_000019245.4) to a genome of L. casei (GCF\_000829055.1) in another group, we found that the read mapping coverage was very low, to the extent that it cannot be regarded as the same species (**Supplementary Figure S2**; Fontana et al., 2018). Although L. casei did not have

<sup>5</sup>https://drive.google.com/drive/folders/1fOakwxOp7QbxQooi8pHYjfrKIbxPuryl

<sup>4</sup>https://github.com/asleofn/APD

any officially named subspecies, the nine strains analyzed were divided in two groups consisting of seven and two strains, with the latter two strains being L. casei LC5 and L. casei ATCC 393. Similarly, B. pumilus had no officially named subspecies, but it was consistent with the previous work in which the whole genome phylogenetic tree analysis showed that B. pumilus was divided into two clades. One of the clades was clustered with Bacillus altitudinis (Tirumalai et al., 2018). Moreover, L. lactis was classified into two groups based on 95% ANI, such that it explained the presence of two subspecies, L. lactis subsp. lactis and L. lactis subsp. cremoris, through their NCBI accession numbers (Salama et al., 1991). Another study also showed the presence of two subspecies for B. amyloliquefaciens, which were B. amyloliquefaciens subsp. amyloliquefaciens and B. amyloliquefaciens subsp. plantarum along with the result of the ANI analysis (Borriss et al., 2011). Therefore, the classification of two groups indicated the subspecies within those species.

Unlike the species listed above, Bifidobacterium longum is reported to have three subspecies based on the different ANI criterion (Mattarelli et al., 2008). Interestingly, 95% of the ANI cutoff defined B. longum as one species, however, it was classified into three subspecies when the cutoff increased to 97% (**Supplementary Figure S1E**). In order to investigate whether B. longum should be divided into three subspecies based on ANI criteria for accurate subspecies classification, we first checked the NCBI accession number of each strain and confirmed that one of the three subspecies was B. longum subsp. infantis. The other two groups could not be identified based on the NCBI accession number, we therefore indirectly determined whether the subspecies were represented by using data of subspecies of B. longum downloaded from the SRA. As a result, the strains were divided into three groups according to the coverage standard: B. longum subsp. longum, B. longum subsp. suis, and B. longum subsp. infantis (**Supplementary Figure S3**; Mattarelli et al., 2008). Therefore, a total of 132 strains, including three strains of B. longum, two strains each of L. lactis, B. pumilus, B. amyloliquefaciens, and L. casei, and 121 strains of other individual species were selected for the 126 species analyzed, and a representative genome set was constructed from these sequences (**Supplementary Table S6**).

In the meantime, when selecting the representative strain, the minimum coverage varied greatly depending on which strain was used. In the case of B. longum, for which 18 strains were reported, the minimum coverage was 0.7137 when the representative genome was used, while it reduced to 0.5534 when a nonrepresentative genome from strain GCF\_000020425.1 was used (**Supplementary Figure S4**). In addition, the minimum coverage of 0.7137 was similar to the result of 70% obtained from DNA-DNA hybridization (DDH), which was used for experimental identification (Goris et al., 2007). Furthermore, the highest minimum coverage values for the representative strains ranged between 0.7137 and 0.993 across species (**Figure 2**). Although the minimum mapping coverage of B. longum obtained 0.7137, it increased to 0.8453 when representative strains from each subspecies were considered. However, because the value of 0.8453 was obtained without considering variants of other species that may or may not be present in the reference dataset, we set the lowest value obtained for mapping coverage of all GSLA of 0.7137 as the baseline for species detection.

As we calculated the ANI and mapping coverage, we wanted to see the relationship between them. As a result, it showed a positive correlation in most species, but the strength of this correlation differed among species. For example, the coverage and ANI values for Enterococcus faecalis and Pediococcus pentosaceus were not related (**Supplementary Figure S5**).

Meanwhile, the baseline for species detection was assigned when reads were aligned to a single genome. The representative genome set contained 132 strains in total, but the results of read mapping coverage targeting a single genome could differ due to the presence of homologous regions between species. Thus, we checked whether the same results were obtained using only the representative strain versus the entire set of representative genomes as a mapping target. In this test, we used simulated reads of nine strains for two species, Lactobacillus helveticus and Lactobacillus brevis. No significant difference in mapping coverage was observed (<0.0017) on aligning each strain to the representative strain of the same species, or to the reference set containing all 132 strains (**Supplementary Table S7**).

#### No False Positive Results in Detection Ability Test

We performed a detection test to determine whether the representative genome set, and the baseline were applicable to actual data rather than simulated data. In the detection ability test, four types of data were used. First, single-species data downloaded from NCBI SRA were tested and we then executed the program with data representing various GSLA species in the order of simulated data, SRA, and real data. For single-species data, we investigated the 19 probiotic GSLA species approved by the MFDS; from the SRA data of the 16 species, only one species was correctly detected for each dataset. The maximum coverage of species other than the detected species was as low as 0.01–0.25, confirming that only one species was detected without considering the possibility of false-negative (FN) results. In contrast, two species were identified in SRA data of the following three species: L. casei, Lactobacillus paracasei, and L. helveticus (**Supplementary Figure S6**). The additional species detected in their data were L. paracasei, L. casei, and Lactobacillus gallinarum, respectively, which were considered as the same species based on the ANI criteria for each species. In the case of L. casei, reads comprising the dataset were generated from sequencing only a single strain of L. casei. Nonetheless, the sequences of L. paracasei and L. casei shared similar regions that happened to be aligned in L. paracasei, eventually exceeding our mapping coverage baseline for both species. As a result, of the 126 species analyzed in total, seven one-to-one pairs included different species that were classified as same species based on ANI (**Table 1**). To address this problem, an additional analysis was conducted using the complete genomes of all species that are not distinguishable from other species based on ANI as a reference genome set (i.e., L. casei – L. paracasei and L. helveticus – L. gallinarum). Reads were next mapped in all regions using the "–a" option of the bowtie2 program, which is a tool used for aligning all reads at

the same loci. Among these reads, the strain with the highest coverage, i.e., that which is most similar to the genome from which the reads were generated, was assigned as the detected species. As a result, all three species were accurately detected: L. paracasei with a coverage value of 0.9119, L. helveticus with 1 and L. casei with 0.9122 (**Supplementary Table S8**). Because this additional analysis was used to determine all alignments, the time required can vary greatly depending on the size of the reference dataset. In the case described above, it took about 50 min for 600 Mb <sup>∗</sup> 2 L. paracasei Illumina sequencing data to be aligned to the reference genome set containing 18 strains of L. casei + L. paracasei.

Testing our SRA data with MetaPhlAn 1, MetaPhlAn 2, CLARK-S, k-SLAM, Kaiju, and KrakenHLL resulted in multiple FPs, despite the use of single-species data. Seven and nine FPs were obtained from MetaPhlAn 2 and MetaPhlAn 1, respectively. Moreover, several hundred FPs occurred among CLARK-S, k-SLAM, and Kaiju. KrakenHLL provided an ideal threshold for the unique k-mer count per sample read (unique k−mer = 2000∗million read), but up to 11 FPs were still found in the filtered results (**Table 2**).

Detection ability tests for single species did not allow detection of FNs or FPs, and thus showed perfect results. Nonetheless, if the data are complex due to a mixture of different species, high-identity problems may occur, such as increased coverage of species that are not included in the sample and increased FPs. Moreover, FNs may occur if the sample does not have sufficient coverage of a species that makes up a small proportion of the sample. Therefore, we processed simulated data, SRA, and real data to determine how accurately our pipeline detected species in complex data.

First, the simulated data of 13 species combined using the ART simulator revealed 13 species in our pipeline, but all classifications contained FPs. The numbers of FPs obtained using MetaPhlAn 1, MetaPhlAn 2, KrakenHLL, k-SLAM, and Kaiju were 1, 2, 3, 20, and 2,847, respectively (**Figure 3A**). Despite the use of simulated data, one FN was found in each of the MetaPhlAn 1 and MetaPhlAn 2 results. Meanwhile, 100% Campylobacter curvus was detected using CLARK-S for unknown reasons.

Second, based on the analysis of data for 10 species combined, our pipeline detected 10 species and demonstrated better results than other programs such as MetaPhlAn 1,



ANI, average nucleotide identity.

MetaPhlAn 2 and KrakenHLL, which detected 41, 37, and 32 species, respectively. The other programs using k-mers or protein sequence data detected a much greater number of species.

Lastly, real data were analyzed using 19 species in Illumina data, and four to 11 species in Ion Torrent data. In the Illumina data, our pipeline detected 18 species and a FN, whereas MetaPhlAn 1, MetaPhlAn 2, and KrakenHLL detected 19 species along with two, five, and one additional species, respectively (**Figure 3B**). Among the five Ion Torrent samples analyzed (**Figures 3C–G**), our pipeline yielded one FN in the Probiotics\_4 sample (**Figure 3E**). In MetaPhlAn 2, false detection occurred in the Probiotics\_5 and Probiotics\_6 samples; one FN species and two FPs were detected in Probiotics\_5 (**Figure 3F**), and four FP species in Probiotics\_6 (**Figure 3G**). Despite filtering based on the suggested criteria, KrakenHLL resulted FPs across all five probiotic products, with one, one, one, two, and three FPs detected, respectively (**Figures 3C–G**). MetaPhlAn 1 showed similar performance to MetaPhlAn 2 and KrakenHLL based on data collected on the Illumina platform, but at least 300, and sometimes more than 1,000, FPs were obtained with the Ion Torrent data. CLARK-S, k-SLAM, and Kaiju exhibited more than 100 FPs in all of the tests described above, regardless of platform (**Figures 3B–G**).

#### High Accuracy of Proportion Analysis

To control the quality of probiotic products, it is essential not only to detect the presence of species, but also their relative ratios. The cost of probiotic products varies based on the species present, and species that make up a small proportion of the total bacteria may gradually disappear from a product over time. For proportion analysis, the number of reads as a proportion of the genome size of each species was standardized so that the data showed the same ratio (i.e., 10%) for all 10 species of GSLA. As in the detection ability test described above, FP species appeared in all programs tested except for our pipeline, however, only the relative quantities of the 10 species of interest were compared, without consideration of the FP species. All other programs were executed based on the proportions revealed by their results, whereas the calculation was based on mapping depth for our pipeline. Using simulated data, the variance in proportions was 0.11 in our pipeline, versus 1.56 in MetaPhlAn 1, 1.75 in MetaPhlAn 2, 4.78 in k-SLAM, 2.72 in Kaiju and 2.76 in KrakenHLL (**Figure 4A**). As in the detection ability test, CLARK-S detected 100% C. curvus species. Using SRA data, the variance in proportions was 0.27 for our pipeline, 1.49 for MetaPhlAn 1, 2.15 for MetaPhlAn 2, 3.17 for k-SLAM, 2.12 for CLARK-S, 5.51 for Kaiju and 2.04 for KrakenHLL (**Figure 4B**).

#### Time Required for Species Detection

It is important to determine the number of reads and time required to detect the species when using any method because both time and monetary costs depend on the size of the dataset. Through random sampling, we controlled costs by reducing the size of the Probiotics\_1 dataset, which was the largest dataset (30 Gb <sup>∗</sup> 2) used in the detection ability test.


TABLE 2 | The results of single species data from the SRA.

fmicb-10-01683 August 6, 2019 Time: 17:19 # 7

The number of FP species is represented by "+."

line represents the precision of each classification.

Bifidobacterium bifidum was not detected at first, while Lactobacillus fermentum, with 0.6989 coverage, was not detected when the file size was reduced to 50%. As a result of a further reduction in file size, from 10 to 5%, three additional FNs appeared. At 5% of the original data set size, B. longum, L. paracasei, and Lactobacillus reuteri were not detected, with 0.713, 0.6936, and 0.5766 coverage, respectively (**Figure 5**). According to proportion analysis of these Illumina data, we confirmed that B. bifidum accounted for 0.01% of the sample, L. fermentum for 0.11%, B. longum for 1.05%, L. paracasei for 0.74%, and L. reuteri for 0.17%. Considering these results, we determined that at least 3 Gb <sup>∗</sup> 2 of data was required for species detection in Illumina paired-end data, accounting for about 1% of the sample. When the file size was reduced, the time required for processing was also dramatically reduced: 452 min for 30 Gb <sup>∗</sup> 2 and 25 min for 3 Gb <sup>∗</sup> 2 (**Table 3**).

#### DISCUSSION

Our pipeline, which is based on mapping coverage, provides new criteria for determining the presence or absence of GSLA in a sample, adequately controlling for false detections and showing high accuracy in proportion analysis.

A benefit of using all available genome information is that it is possible to address problems such as structural variations in the genome of an individual species. However, when the same loci are present at the mapping target, due to homology, most shortread aligners are randomly mapped to one of them, affecting the calculation of genome coverage for each species. Those reads

can be mapped to the same loci by adjusting the alignment parameters, but increases mapping time significantly and adds reads artificially, leading to incorrect results in subsequent proportion analysis. Moreover, due to the difference among strains in the number of genomes available in the current database, it is difficult to set the coverage criterion for the detection test. According to the detection test, there was no difference in performance between the pipeline that used only the genomes of representative strains and that using all available genomes of all strains. Thus, to obtain data for proportion analysis in a shorter time without compromising detection ability, we utilized the representative genome set as reference data.

As this methodology uses only one representative genome for each species, there is a tremendous difference in the results depending on which strain's genome is used as the representative genome. For example, B. longum had different minimum coverage when the representative genome or a non-representative genome from strain GCF\_000020425.1 was used. When the representative genome was used, the coverage was similar to the results of 70% DDH, whereas, when the non-representative genome was used, the coverage reduced so that it was too low to be used as a criterion for mapping coverage. The criterion cannot be too high or too low because of the ability of detection. If it is too low, even species that should not be detected will be detected. Thus, it has to be reasonable, such that the value that was similar to the result of 70% DDH was set as the coverage criterion for our pipeline. Additionally, this result confirms the importance of selecting a representative genome for species determination using our pipeline, as well as, showing why we selected a representative genome for each species by calculating all pairwise minimum coverage values for all strains with available genomic data. The highest minimum coverage values for the representative strains varies across species. This results may have been caused by the myriad of genomic structural variants present in certain species (Lan and Reeves, 2000). For instance, the minimum one-to-one pairwise coverage value for B. longum increased when all representative strains were used because it considers the structural variation, compared to when aligning to only one representative strain.

During the process of species identification, two problems were observed: (1) strains that came from the same species separated into different species based on ANI criteria, and (2) two different species grouped together and classified as the same species. The first problem was solved by selecting an additional representative strain for each group that was divided based on 95% ANI. As a result, we were able to identify the strains at the species level regardless of which group they belonged to. However, the other metagenomic classification tools such as MetaPhlAn showed a downside in classifying species. For example, in the detection ability test, most samples with L. casei had high coverage, with the representative strain of the group containing seven strains. Meanwhile, in two samples, i.e., the simulated data and Probiotics\_5 on Ion Torrent, the coverage for the representative strain of the two strain groups, and not that for

#### TABLE 3 | Data processing time required when the data set size was reduced (Min).


the seven-strain group, was greater than 0.7137. In these samples, MetaPhlAn showed false detections, while MetaPhlAn 1 did not detect L. casei at all and MetaPhlAn 2 detected Lactobacillus zeae as a FP instead of L. casei. This FP occurred because L. zeae falls under the L. casei group based on NCBI taxonomy, and the two strains were very similar to L. zeae (Kang et al., 2017). For the second case, the problem is that two species were detected even though the sample contains only one species. For example, L. gallinarum was detected in L. helveticus single-species data, because these species have high identity (Jebava et al., 2014). In other words, the two species shared reads used in the process of aligning. To prevent this issue, it was necessary to accurately classify the data through additional analysis. However, if the proportion of that GSLA in the product was low or if insufficient sequencing data were produced, both species may be undetected due to their shared reads. Therefore, in such situations, only one species per pair was included in the reference set to ensure sufficient coverage, and when that species was detected, accurate species detection was carried out through additional analysis. However, whether or not both species in a pair are present in a product remains to be addressed. It is therefore necessary to reclassify GSLA based on their genetic and phenotypic relatedness (Salvetti et al., 2018).

Our pipeline is based on the mapping coverage which is thought to have a positive correlation with ANI. As expected, the relationship in most species showed a positive correlation, but such species including E. faecalis and P. pentosaceus had no correlation. This result may indicate a limitation of the ANI, as it only uses sequences with the best match in BLASTn after trimming the overall sequence to 1,020 bp (Arahal, 2014). Furthermore, cases, where the species classification was unclear based on the ANI, confirmed that ANI should be modified based on coverage or that a new method should be developed to address this problem (Rosselló-Móra and Amann, 2015; Varghese et al., 2015).

The classification programs used in this study required filtering of several FPs. Such filtering was easy when analyzing a single-species sample, but when multiple species were mixed, different filtering criteria were needed for accurate detection. That is, if information about the sample is not known, or if only a small amount of GSLA is present in the sample, the filtering value must be set blindly such that false detection cannot be controlled. This may lead to problems such as unresolved labeling errors.

As our pipeline involved the use of all reads mapped to the whole genome, the results of proportion analysis showed high consistency. Other classifications based on a specific sequence region of interest, such as those using k-mer value, had high variance values of two to three, showing that they are common for proportion analysis. Whole genomes were used to obtain more reliable results, which could be compared with identification data

#### REFERENCES

Ainsworth, D., Sternberg, M. J. E., Raczy, C., and Butcher, S. A. (2017). k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets. Nucleic Acids Res. 45, 1649–1656. doi: 10.1093/nar/ gkw1248

obtained using only 16S rRNA, as well as in the cases described above. For classification at the species level, it is difficult to obtain sufficient resolution with current sequencing technologies. Moreover, to conduct proportion analysis, a case-control study is the most commonly used method; furthermore, this method does not show errors when the amount of each species changes. However, targeting the 16S rRNA to determine the relative ratios of species is problematic because of differing numbers of 16S rRNA genes among species of microbes and the variation in copy numbers within species (Klappenbach et al., 2001).

In conclusion, we have shown that a pipeline using coverage was better in terms of coverage accuracy than other classification schemes. Constructing the reference dataset from representative strains was effective and allowed the pipeline to run with a reduced computational load. The reliable results obtained by our pipeline, with respect to GSLA detection (and proportions thereof) in probiotic products are expected to improve the quality of probiotics and associated safety management practices. Furthermore, although the microbes detected were limited to GSLA in this study, our pipeline can be extended to other microbes in the soil environment, viruses, and other microbial groups of interest.

#### AUTHOR CONTRIBUTIONS

WK designed the experiments, interpreted the data, and supervised the study. Funding, computing resources, and server time were granted by SC and HBK. DS performed the bioinformatic analyses, interpreted the data, and drafted the manuscript. HKK performed the experiments. S-YK provided the sequencing data for probiotic products and contributed to the discussion of the results. H-SK, SHK, WL, and SP provided critical comments and helped to direct the study. SYJ contributed to the revision and editing of the manuscript. All authors reviewed the manuscript.

### FUNDING

This work was supported by a grant (17162MFDS043 and 17161MFDS033) from the Ministry of Food and Drug Safety (MFDS) in 2018.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2019.01683/full#supplementary-material


I. Sutcliffe, and J. Chun (Cambridge, MA: Academic Press), 103–122.



Yang, X., Zola, J., and Aluru, S. (2011). "Parallel Metagenomic Sequence Clustering Via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clouds," in Proceedings of the Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International Parallel & Distributed Processing Symposium, Anchorage, AK.

**Conflict of Interest Statement:** DS, SYJ, HKK, HBK, SC, and WK were employed by company C&K Genomics. S-YK was employed by company CTCBIO, Inc.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Seol, Jhang, Kim, Kim, Kwak, Kim, Lee, Park, Kim, Cho and Kwak. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Comparative Genomics of Streptococcus thermophilus Support Important Traits Concerning the Evolution, Biology and Technological Properties of the Species

#### Edited by:

Nikos Kyrpides, Lawrence Berkeley National Laboratory, United States

#### Reviewed by:

Stefano Campanaro, University of Padua, Italy Anastasia Chasapi, Centre for Research & Technology Hellas, Greece

\*Correspondence:

Konstantinos Papadimitriou kpapadimitriou@aua.gr

#### †Present address:

Konstantinos Papadimitriou, Department of Food Science and Technology, Faculty of Agriculture and Foods, University of Peloponnese, Kalamata, Greece

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 09 August 2019 Accepted: 03 December 2019 Published: 20 December 2019

#### Citation:

Alexandraki V, Kazou M, Blom J, Pot B, Papadimitriou K and Tsakalidou E (2019) Comparative Genomics of Streptococcus thermophilus Support Important Traits Concerning the Evolution, Biology and Technological Properties of the Species. Front. Microbiol. 10:2916. doi: 10.3389/fmicb.2019.02916 Voula Alexandraki<sup>1</sup> , Maria Kazou<sup>1</sup> , Jochen Blom<sup>2</sup> , Bruno Pot<sup>3</sup> , Konstantinos Papadimitriou<sup>1</sup> \* † and Effie Tsakalidou<sup>1</sup>

<sup>1</sup> Laboratory of Dairy Research, Department of Food Science and Human Nutrition, Agricultural University of Athens, Athens, Greece, <sup>2</sup> Bioinformatics and Systems Biology, Justus Liebig University Giessen, Giessen, Germany, <sup>3</sup> Research Group of Industrial Microbiology and Food Biotechnology (IMDO), Department of Bioengineering Sciences (DBIT), Vrije Universiteit Brussel, Brussels, Belgium

Streptococcus thermophilus is a major starter for the dairy industry with great economic importance. In this study we analyzed 23 fully sequenced genomes of S. thermophilus to highlight novel aspects of the evolution, biology and technological properties of this species. Pan/core genome analysis revealed that the species has an important number of conserved genes and that the pan genome is probably going to be closed soon. According to whole genome phylogeny and average nucleotide identity (ANI) analysis, most S. thermophilus strains were grouped in two major clusters (i.e., clusters A and B). More specifically, cluster A includes strains with chromosomes above 1.83 Mbp, while cluster B includes chromosomes below this threshold. This observation suggests that strains belonging to the two clusters may be differentiated by gene gain or gene loss events. Furthermore, certain strains of cluster A could be further subdivided in subgroups, i.e., subgroup I (ASCC 1275, DGCC 7710, KLDS SM, MN-BM-A02, and ND07), II (MN-BM-A01 and MN-ZLW-002), III (LMD-9 and SMQ-301), and IV (APC151 and ND03). In cluster B certain strains formed one distinct subgroup, i.e., subgroup I (CNRZ1066, CS8, EPS, and S9). Clusters and subgroups observed for S. thermophilus indicate the existence of lineages within the species, an observation which was further supported to a variable degree by the distribution and/or the architecture of several genomic traits. These would include exopolysaccharide (EPS) gene clusters, Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs)-CRISPR associated (Cas) systems, as well as restriction-modification (R-M) systems and genomic islands (GIs). Of note, the histidine biosynthetic cluster was found present in all cluster A strains (plus strain NCTC12958<sup>T</sup> ) but was absent from all strains in cluster B. Other loci related to lactose/galactose catabolism and urea metabolism, aminopeptidases, the majority of amino acid and peptide transporters, as well as amino acid biosynthetic pathways

were found to be conserved in all strains suggesting their central role for the species. Our study highlights the necessity of sequencing and analyzing more S. thermophilus complete genomes to further elucidate important aspects of strain diversity within this starter culture that may be related to its application in the dairy industry.

Keywords: lineage, horizontal gene transfer, genomic islands, milk, yogurt, cheese, pan genome, CRISPR

### INTRODUCTION

Lactic acid bacteria (LAB) include several species, which are extensively used as starters in dairy fermentations (Kongo, 2013). Among them, Streptococcus thermophilus constitutes a major starter for the dairy industry. It is primarily used in the production of yogurt, alongside with Lactobacillus delbrueckii subsp. bulgaricus, but also in the production of several cheese varieties, such as Feta and Mozzarella (Purwandari et al., 2007; Rantsiou et al., 2008; Anbukkarasi et al., 2013). S. thermophilus is the only species which was granted the generally recognized as safe (GRAS) status according to the Food and Drug Administration [FDA], 2007 and the qualified presumption of safety (QPS) status according to the European Food Safety Authority [EFSA], 2007 within the Streptococcus genus, which consists mainly of commensals and pathogenic species. As it is attested by the large number of pseudogenes identified in the genomes of the S. thermophilus strains sequenced so far, the species has undergone significant genome decay probably due to its adaptation to the dairy environment, which is particularly rich in nutrients (Bolotin et al., 2004; Hols et al., 2005; Goh et al., 2011). The regressive evolution of the species has led to genome reduction and simplification of its metabolism (Mayo et al., 2008). The latter is reflected in the deterioration of genes involved, among others, in sugar utilization. S. thermophilus has also lost typical streptococcal pathogenic features presumably through strain selection during domestication toward a starter culture (Bolotin et al., 2004; Hols et al., 2005; Goh et al., 2011; Papadimitriou et al., 2015b). Furthermore, the protocooperation with L. bulgaricus during the production of yogurt has further shaped the metabolic properties of S. thermophilus toward this symbiotic relationship (Mayo et al., 2008).

Typical technological features of S. thermophilus, such as milk acidification, lactose and galactose utilization, proteolytic activity and exopolysaccharide (EPS) production, contribute in shaping the organoleptic characteristics of the final products (Cui et al., 2016). In addition, the stress responses of the species define its performance under the unfavorable conditions prevailing during food production (Zotta et al., 2008; Cui et al., 2016). S. thermophilus also carries Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs)-CRISPR associated (Cas) (CRISPR-Cas) and restriction-modification (R-M) systems, which may contribute to competitiveness in microbial food ecosystems and resistance against bacteriophages and other parasitic DNA (Horvath and Barrangou, 2010; Dupuis et al., 2013). Moreover, genes in genomic islands (GIs), which have been acquired most probably through horizontal gene transfer (HGT) events, may ascribe a number of adaptive traits to S. thermophilus and could be related to technological characteristics, such as EPS production, bacteriocin biosynthesis, and protocooperation (Liu et al., 2009; Eng et al., 2011).

Another topic that has attracted some attention concerning S. thermophilus was the biodiversity of strains within the species. Original studies used typing techniques like random amplification of polymorphic DNA-PCR (RAPD-PCR) or pulsed-field gel electrophoresis (PFGE), while more recent ones used multilocus sequence typing (MLST) (Moschetti et al., 1998; Giraffa et al., 2001; Mora et al., 2002; Ercolini et al., 2005; Delorme et al., 2010, 2017; Yu et al., 2015). Application of MLST in S. thermophilus had to be optimized to increase discriminating power, given the fact that the species may exhibit limited genetic variability (Delorme et al., 2017). In this study the authors reported 116 sequence types and the existence of groups of strains based on phylogenetic analysis of concatenated sequences of housekeeping genes. Additional analysis revealed clustering of strains based on core genome and CRISPR spacer analysis of 25 sequenced strains (both complete and partial). The authors reported that the clustering based on MLST and whole genome analysis was in agreement but differed from that of CRISPR analysis. With the MLST scheme developed and the wide sample of S. thermophilus strains (n = 178), it was feasible to detect relationship between strains and geographic location.

Furthermore, due to the economic importance of S. thermophilus as a starter, a number of groundbreaking studies have been conducted in an attempt to elucidate the genetic basis behind the physiological and the metabolic properties of the species, which define its technological and probiotic potential. Comparative genomics of S. thermophilus was carried out early on and provided significant information about its adaptation to the milk environment and technological traits (Bolotin et al., 2004; Hols et al., 2005; Goh et al., 2011). However, these studies relied only on a limited number of genome sequences. The current accumulation of completely sequenced S. thermophilus genomes can increase the predictive

**Abbreviations:** ABC, ATP-binding cassette; ANI, average nucleotide identity; APC, amino acid-polyamine-organocation; blp, bacteriocin-like peptide; BPGA, bacterial pan genome analysis; COG, clusters of orthologous groups; CRISPR-Cas, clustered regularly interspaced short palindromic repeats-CRISPR associated; DRs, direct repeats; EFSA, European Food Safety Authority; EPS, exopolysaccharide; FDA, Food and Drug Administration; GABA, gamma-aminobutyric acid; GIs, genomic islands; GIT, gastrointestinal tract; GRAS, generally recognized as safe; GSH, glutathione; HGT, horizontal gene transfer; ISs, insertion sequences; KEGG, Kyoto Encyclopedia of Genes and Genomes; KOALA, KEGG orthology and links annotation; LAB, lactic acid bacteria; LCBs, local collinear blocks; MiGA, microbial genomes atlas; MLST, multilocus sequence typing; ORFs, open reading frames; PFGE, pulsed-field gel electrophoresis; PFL, pyruvate formate lyase; PFLA, pyruvate formate-lyase activating; PGAP, prokaryotic genome annotation pipeline; QPS, qualified presumption of safety; RAPD, random amplification of polymorphic DNA; RAST, rapid annotation using subsystem technology; R-M, restriction-modification; ROS, reactive oxygen species; SBSEC, Streptococcus bovis/Streptococcus equinus complex.

power of comparative analysis and enhance the interpretation of the acquired data about the genome architecture, functionality and evolution. Furthermore, the advancement of bioinformatics tools and the demand of the dairy industry for novel starter strains render an updated analysis of the species essential. In the present study, the results of an in depth analysis of 23 complete S. thermophilus genomes are presented, focusing on main technological features of the species.

#### MATERIALS AND METHODS

#### Strains

The 23 S. thermophilus genomes designated as "complete" up to RefSeq release 88, were selected for analysis in this study (**Table 1**). The majority of S. thermophilus strains have been isolated from yogurt (strains LMG 18311, CNRZ1066, LMD-9, MN-ZLW-002, MN-BM-A01, KLDS SM, KLDS 3.1003, and ACA-DC 2) and milk (strains JIM 8232, SMQ-301, ND03, ND07, B59671, EPS, GABA, and NCTC12958<sup>T</sup> ). Furthermore, three isolates, namely strains S9, MN-BM-A02, and CS8, derived from traditional Chinese dairy products. More specifically, MN-BM-A02 was isolated from Fan, a traditional Chinese cheese-like product, while CS8 from Rubing, a Chinese fresh goat milk cheese. Finally, strains APC151 and ST3 were isolated from fish intestine and commercial dietary supplements, respectively.

#### Comparative and Evolutionary Genomics

ProgressiveMauve was used for the whole genome alignment of the 23 S. thermophilus strains analyzed in this study (Darling et al., 2010). GenSkew online application was employed in the evaluation of the chromosomal inversions in strains EPS, MN-BM-A01, and MN-ZLW-002<sup>1</sup> . The pan/core genome analysis was performed with the bacterial pan genome analysis (BPGA) pipeline v.1.3 using USEARCH v.9.2.64 for clustering gene families (Edgar, 2010) with a 60% sequence identity cut-off and 20 random permutations of genomes to avoid any bias in the sequential addition of new genomes. The protein coding sequences assigned in the core, accessory and unique gene families were further analyzed for clusters of orthologous groups (COG) categories within the BPGA pipeline (Chaudhari et al., 2016). Alternatively, protein coding sequences of S. thermophilus strains were also analyzed for COG categories with the eggnog-mapper based on eggNOG v.4.5 orthology database, as highlighted in the text (Huerta-Cepas et al., 2016, 2017). The EDGAR tool was also employed to assist analysis of orthologs whenever necessary, as well as for core genome phylogenetic analysis among S. thermophilus strains (Blom et al., 2016). For the latter, the alignments of the core gene sets were executed with MUSCLE and concatenated to one complete core alignment, which was used to generate the phylogenetic tree by the neighborjoining method as implemented in the PHYLIP package. The consensus tree topology was verified by 100 bootstrap iterations. The EDGAR software was also exploited for the investigation of the relatedness among S. thermophilus strains through the construction of average nucleotide identity (ANI) heat map. The ANI values were computed as described by Goris et al. (2007) and as implemented in the JSpecies package (Richter and Rossello-Mora, 2009). The resulting phylogenetic distance values were arranged in an ANI matrix, clustered according to their distance patterns and visualized as a color-coded heatmap, with dark and light orange for high and low similarity regions, respectively. Box Plot Generator was employed for the visualization of genome size differences between the two clusters of the S. thermophilus strains<sup>2</sup> . Statistical differences in genome size were accessed with the Mann–Whitney U Test for p < 0.05. The quality of the genome assemblies was evaluated with the microbial genomes atlas (MiGA) webserver (Rodriguez-R et al., 2018). The COG frequency and the accessory genes presence/absence heatmaps were generated with the RStudio using the heatmap.2 function included in the Gplots package<sup>3</sup> . Kyoto encyclopedia of genes and genomes (KEGG) orthology and links Annotation (KOALA) was employed for K number assignment to S. thermophilus protein coding sequences (Kanehisa et al., 2016b), while KEGG Mapper tools were exploited for further processing of KO annotations (Kanehisa et al., 2016a). The PHASTER web server was used for the identification of putative prophages (Arndt et al., 2016). The comparison of the EPS gene clusters was performed with the Easyfig tool (Sullivan et al., 2011). The transporters were determined using the TransportDB database (Elbourne et al., 2017). The CRISPRs were identified with CRISPRFinder web tool (Grissa et al., 2007), while comparison of the predicted spacers was performed with CD-HIT Suite (Huang et al., 2010). The REBASE database was used for verifying the R-M systems (Roberts et al., 2015). Finally, the GIs were obtained through the IslandViewer 4 web-based resource (Bertelli et al., 2017). For our analysis, GIs characterized as integrated by the IslandViewer tool were analyzed.

### RESULTS AND DISCUSSION

#### General Genomic Features

The general genome features of the 23 S. thermophilus strains used in this study are presented in **Table 2**. The chromosome length of the strains ranges between 1.73 and 2.10 Mbp, with an average of 1.85 Mbp, while the % GC content is around 39.0. The number of genes varied between 1,847 and 2,237 including protein coding sequences that varied between 1,555 and 1,854. The percentage of pseudogenes ranged between 9.64 and 13.97%. These variations in genome size, gene and pseudogene content indicate important differences in both gene gain and gene loss events during the evolution of the different strains. It has been previously reported that S. thermophilus owns some of the smallest genomes within streptococci while Streptococcus salivarius some of the largest (Delorme et al., 2015). Based on the complete genome sequences within the salivarius group we found that the percentage of pseudogenes of S. salivarius (12 complete genomes) may reach up to 4%

<sup>1</sup>http://genskew.csb.univie.ac.at/

<sup>2</sup>https://plot.ly

<sup>3</sup>http://www.rstudio.org

TABLE 1 | Streptococcus thermophilus strains with complete genomes analyzed in this study.


while the percentage of pseudogenes of Streptococcus vestibularis NCTC12167, the only strain with a complete genome, was around 8%. These findings suggest a variable degree of evolution through genome decay within the group. Beyond the salivarius group, high percentages of pseudogenes have also been reported for Streptococcus macedonicus and Streptococcus infantarius that are also associated with the dairy environment (Jans et al., 2013a; Papadimitriou et al., 2014). A high number of pseudogenes has also been reported for certain strains of Streptococcus pneumoniae (see for example the studies by Junges et al., 2019; Scott et al., 2019). Interestingly, extensive genome decay seems to be compatible with adaptation in milk (Bolotin et al., 2004; Hols et al., 2005; Jans et al., 2013a; Papadimitriou et al., 2014) or a pathogenic lifestyle (Lerat and Ochman, 2005). Obviously more research is needed to appreciate the strains/species within streptococci that have evolved through reductive processes and to test whether this evolution path can be correlated with the niches they occupy.

Fourteen out of 23 strains carry 18 rRNA genes and the rest carry 15. Interestingly, strains with 18 rRNA genes also own a higher number of tRNA genes (ranging from 67 to 69) compared to strains with 15 rRNAs which own fewer tRNA genes (ranging from 55 to 57). A general comment that can be made about this difference is that strains with a higher number of rRNA and tRNA genes could potentially exhibit a higher growth/metabolic rate (Wassenaar and Lukjancenko, 2014).

Comparison of the chromosomal architecture of the 23 S. thermophilus strains was performed through full-length sequence alignments (**Supplementary Figure S1**). All strains were synchronized from the dnaA so as to simplify the alignment. Analysis revealed a high degree of conservation among different strains. However, strain-specific differences could also be detected. More specifically, low similarity regions, represented as white regions inside the local collinear blocks (LCBs), were found in all strains. Furthermore, many unique regions, represented as blank spaces between the LCBs, were also identified in all strains. In strain EPS a large inversion (1.47 Mbp) was present, while in strains MN-BM-A01 and MN-ZLW-002, a ∼300 kbp inverted region was identified between coordinates 768,310–1,068,868 and 740,416–1,040,999 bp, respectively. These inversions could be either genuine or could be ascribed to assembly artifacts. If the first is true, our observations may correspond to an inversion around the origin of replication for strain EPS, or to an inversion around the terminus of replication for strains MN-BM-A01 and MN-ZLW-002. Such inversions have been described before for bacterial genomes as part of their evolution (Eisen et al., 2000; Darling et al., 2008; Repar and Warnecke, 2017).

### Pan/Core Genome Analysis and Phylogenomics

The pan genome of the 23 S. thermophilus strains contains a total number of 2,516 genes, including 1,082 and 997 genes in the core and accessory genomes, respectively (**Figure 1A**). The number of genes in the accessory genome of each strain varied between 432 and 568 and a total of 437 unique genes (singletons) were identified in 14 strains (**Supplementary Table S1** and

TABLE 2 | General genome features of S. thermophilus strains with complete genomes analyzed in this study.


Cluster A strains start with strain JIM 8232 and end with strain ND03. Cluster B strains start with strain B59671 and end with strain ACA-DC 2. <sup>1</sup>Out of 111 essential genes. <sup>2</sup>Genome completeness as calculated by MiGA webserver considering 111 essential genes. <sup>3</sup>Corrected genome completeness considering 106 essential genes after the omission of glyS, proS, pheT, nahD, rpoC1 missing from all S. thermophilus genomes from the list of essential genes used by MiGA webserver.

**Figure 1A**). According to BPGA analysis, the b value of 0.14 in the power-law regression model is indicative of an open pan genome for S. thermophilus that is probably going to be closed soon (**Figure 1B**). This may also be supported by the fact that within the total of unique genes identified in S. thermophilus strains, 71% belong to three strains, namely KLDS 3.1003 (n = 41), JIM 8232 (n = 67), and NCTC12958<sup>T</sup> (n = 204), while strains APC151, ASCC 1275, CNRZ1066, CS8, MN-BM-A01, MN-BM-A02, ND03, ND07, and S9 have no unique genes (**Supplementary Tables S1, S2**). BPGA analysis also revealed the number of exclusively absent genes per strain (**Supplementary Table S1**). Core, accessory and unique genes were further classified into COG categories, as implemented within the BPGA pipeline (**Supplementary Figure S2**). The analysis revealed that approximately 90% of the core, 60% of the accessory and 40% of the unique genes were assigned to various COG categories, with the rest having no prediction. We then excluded the poorly characterized categories R and S from further analysis. The majority of core genes encode proteins involved primarily in housekeeping and metabolic processes. The three most abundant COG categories were J (translation, ribosomal structure, and biogenesis, 12.7%), E (amino acid transport and metabolism, 11.8%), and L (replication, recombination, and repair, 7.3%). In the case of the accessory and unique genes the categories with the highest percentages included categories E, L, and K (transcription) and L, K, and V (defense mechanisms), respectively. In general, accessory and unique genes encoded among others transposases, Cas proteins, R-M systems, glycosyltranferases, polysaccharide biosynthesis proteins, amino acid biosynthesis proteins, proteolytic enzymes, stress related proteins, as well as transporters which may contribute to strain-specific technological traits (please see below).

The phylogenetic relationship among the S. thermophilus strains was determined based on the core genome of the strains and revealed two main clusters containing 15 (APC151, ASCC 1275, DGCC 7710, GABA, JIM 8232, KLDS 3.1003, KLDS SM, LMD-9, MN-BM-A01, MN-BM-A02, MN-ZLW-002, ND03, ND07, SMQ-301, and ST3; cluster A) and seven (ACA-DC 2, B59671, CNRZ1066, CS8, EPS, LMG 18311, and S9; cluster B) strains, respectively, while strain NCTC12958<sup>T</sup> was placed separately (**Figure 2**). Moreover, the ANI phylogenetic tree had practically an identical topology to that of the phylogenetic tree (**Figure 3**). There was only one exception with strain KLDS 3.1003 being placed in cluster B. A more detailed inspection of the potential differences between strains in the two clusters revealed that cluster A strains had larger genomes beyond 1.83 Mbp, while those in cluster B had smaller genomes (**Table 2** and **Supplementary Figure S3**). This difference was found to

be statistically significant (p < 0.05) suggesting that strains in the two clusters may have been separated by distinct gene gain and/or gene loss events. Within these two main clusters, subgroups of S. thermophilus strains could also be identified during both phylogenetic and ANI analysis. These subgroups include strains ASCC 1275, DGCC 7710, KLDS SM, MN-BM-A02, and ND07 (subgroup AI), MN-BM-A01 and MN-ZLW-002 (subgroup AII), LMD-9 and SMQ-301 (subgroup AIII), APC151 and ND03 (subgroup AIV), and finally CNRZ1066, CS8, EPS, and S9 (subgroup BI) (**Figures 2**, **3**). As already mentioned, core genome phylogeny was also previously performed in a dataset of 25 S. thermophilus strains employing genomes sequenced to a variable degree of completeness (Delorme et al., 2017). In this study 1,311 core proteins were reported. Of note, an earliest study was performed based on three S. thermophilus genome sequences reporting 1,487 core genes (Lefébure and Stanhope, 2007). Our

core genome was estimated to consist of 1,082 core proteins. This may suggest a more stringent selection of core proteins during our analysis. Despite the fact that several different strains were analyzed in our study and the study by Delorme et al. (2017), phylogenetic clustering of strains exhibited similarities supporting more or less the distinction we propose between cluster A and B strains and the subgroups observed within them. Differences in the topology of the two phylogenetic trees can be attributed to the different dataset of genomes analyzed as well as the different methods employed to construct the trees. The fact that we concentrated our analysis solely on strains with complete genome sequences presents an important advantage, since we were able to support clustering of strains based on the comparative genomic analysis of additional genomic traits as follows. Completeness of genome sequence is of utmost importance when the presence/absence of specific loci or their exact organization are the main factors for strain diversification.

The subgroups mentioned above appeared at high ANI values (>99.9%) which may suggest relatively subtle genomic differences. Such differences may indicate that strains of the same subgroup may be very similar but may deviate from the strict definition of clones. However, clonal relationships may be masked among strains due to aberrations in genome assembly that may come into play at such high ANI values (Burall et al., 2016). To avoid this pitfall, we investigated the quality of the assemblies of all S. thermophilus genomes analyzed in this study using the MiGA webserver (**Table 2**). Our analysis indicated that from the list of the 111 essential genes used to access genome completeness by MiGA, five (i.e., glyS, proS, pheT, nhaD, and rpoC1) were systematically missing from all S. thermophilus genomes. This observation suggested that they do not belong to the gene pool of the species, which is also supported by data presented previously for essential genes in Firmicutes (Albertsen et al., 2013). We thus corrected the completeness score of the genomes by calculating a total of 106 essential genes. Fifteen genomes received 100% genome completeness. Five genomes missed only secE, two missed secE plus an additional gene (rpiX or uvrb) and one missed only ychF receiving scores above 98.1%. The presence/absence frequency of secE may indicate that it is an accessory gene for S. thermophilus. In all cases the completeness scores of S. thermophilus genomes suggest perfect or nearly perfect assemblies. This is also corroborated by the quality scores for the genome assemblies that were all found "excellent" by MiGA webserver.

Hierarchical clustering of the COG frequency heat map generated for all S. thermophilus strains also supported the existence of the clusters and subgroups mentioned above, with minor alterations (**Figure 4**). Strains GABA and B59671 were placed in opposite clusters, while strains of the BI subgroup were associated more loosely (i.e., not forming a distinct subgroup). The most abundant category in all strains was E, followed by J and L. The prevalence of the E category may support adaptation of S. thermophilus to milk and the necessity of the organism to use amino acids from the environment.

The presence/absence heat map of the accessory genes of S. thermophilus strains supported once again the existence of clusters A and B (**Figure 5A**). The analysis allowed the identification of genes, which may contribute to the grouping of the strains. As shown in the horizontal axis of the heat map, genes within clusters 4 and 6 are characteristic of clusters B and A, respectively. Moreover, genes of clusters 1, 2, 3, 5, and 4 seem to be present in specific subgroups, namely AII, AIV, AIII, AI, and BI, respectively. Further analysis of the accessory proteins, specifically of those involved in metabolic processes, revealed that cluster A strains (including NCTC12958<sup>T</sup> ) carry the entire set of genes responsible for the biosynthesis of histidine that are basically absent from cluster B (**Figure 5B**). Based on these findings it is plausible to state that strains of S. thermophilus exhibit lineage-type relationships.

#### Lactose and Galactose Metabolism

Streptococcus thermophilus ferments preferentially lactose over glucose (Geertsma et al., 2005). Lactose is the main carbohydrate of milk and therefore constitutes the primary carbon and

energy source for S. thermophilus, due to the adaptation of the microorganism to this particular niche (Bolotin et al., 2004; Hols et al., 2005; Goh et al., 2011). The genes implicated in the fermentation of lactose and galactose are organized in two adjacent operons (galRKTEM-lacSZ) (Vaughan et al., 2001). We found the complete locus in all S. thermophilus strains analyzed, with the exception of three strains in which lacS (strains B59671 and KLDS 3.1003) or galR (strain NCTC12958<sup>T</sup> ) are putative pseudogenes (**Supplementary Table S3**). The importance of these inactivations needs to be experimentally investigated, but the high degree of conservation of the gal-lac gene clusters among the different S. thermophilus strains, both at sequence and organization levels, reveals its importance in the catabolism of lactose in milk. Apart from galE coding for the enzyme UDPglucose 4-epimerase that is located in the Leloir gene cluster, a second or even a third distal galE gene was identified in certain strains (**Supplementary Table S3**). It has been demonstrated that the activity of this enzyme is positively correlated with the biosynthesis of precursors for EPS production in EPS producing Gal<sup>−</sup> S. thermophilus strains (Degeest and De Vuyst, 2000). Furthermore, the galactose moiety generated by the hydrolysis of lactose is translocated outside the cell via the dedicated antiporter LacS, which is implicated in the uptake of lactose in exchange to galactose (Vaughan et al., 2003). The majority of S. thermophilus strains are unable to metabolize both free and intracellularly produced galactose, probably either due to insufficient activities

of galK and galM genes or due to mutations in the galR-galK promoter region, which may interfere with the expression levels of the respective enzymes (De Vin et al., 2005; Vaillancourt et al., 2008; Anbukkarasi et al., 2014; Sørensen et al., 2016). Recently, Xiong et al. (2019b) demonstrated that the Gal<sup>+</sup> phenotype of S. thermophilus depends upon the expression of the gal operon, which can be widely affected by a single point mutation at the -9 box in the galK promoter. Since the accumulation of galactose in the medium by S. thermophilus may be important from a technological or nutritional perspective (Giaretta et al., 2018), we examined the presence of the mutation at the -9 box in the galK promoter in the strains analyzed. Accordingly, only B59671, CS8, EPS, and NCTC12958<sup>T</sup> seem to be able to catabolize galactose, as they own the relevant G to A mutation in the position -9 of the -10 box related Gal<sup>+</sup> phenotype (data not shown). However, experimental verification is required to validate this prediction.

#### Biosynthesis of EPS

One of the key technological properties of S. thermophilus is the production of EPS, which has been related to desirable textural properties and reduced syneresis in fermented dairy products (Lluis-Arroyo et al., 2014; Han et al., 2016). In a recent study, the EPS clusters of several strains were compared suggesting variations in the gene content of these loci (Cui et al., 2017). Our analysis revealed the presence of EPS gene clusters in all S. thermophilus strains examined. The size of

FIGURE 4 | Clusters of orthologous groups (COG) frequency heat map based on a two-dimensional hierarchical clustering. The horizontal axis corresponds to the percentage frequency of proteins involved in the respective COG functional categories: Information storage and processing: translation, ribosomal structure, and biogenesis (J), transcription (K), replication, recombination, and repair (L); cellular processes and signaling: cell cycle control, cell division, chromosome partitioning (D), cell wall/membrane/envelope biogenesis (M), cell motility (N), post-translational modification, protein turnover, chaperones (O), signal transduction mechanisms (T), intracellular trafficking, secretion, and vesicular transport (U), defense mechanisms (V); metabolism: energy production and conversion (C), amino acid transport and metabolism (E), nucleotide transport and metabolism (F), carbohydrate transport and metabolism (G), coenzyme transport and metabolism (H), lipid transport and metabolism (I), inorganic ion transport and metabolism (P), secondary metabolites biosynthesis, transport and catabolism (Q). The vertical axis shows the 23 S. thermophilus strains. Strains were grouped in two clusters (A,B). Subgroups within the clusters are also highlighted (AI, AII, AIII, AIV, and BI). Categories R and S, concerning poorly characterized proteins, were not included in the analysis.

the clusters ranged between 18,661 and 35,973 bp and the % GC content (34.3–36.4%) was found to be lower than the % GC content calculated for the complete genomes of all strains (**Supplementary Table S4**). All clusters are flanked by a purinenucleoside phosphorylase (deoD) and a transporter protein as their boundaries (**Figure 6**). The alignment of the EPS loci showed that they are highly conserved at the 5<sup>0</sup> and the 3<sup>0</sup> ends and their differences are located mainly in the middle of the clusters. At the 5<sup>0</sup> end, genes epsA, epsB, epsC, and epsD were found in all EPS gene clusters and their role has been associated with the regulation of eps genes and chain elongation of the EPS molecules (Cui et al., 2017). The adjacent epsE gene coding a galactosyl-1-phosphate transferase was found in five out of 23 EPS gene clusters (strains CNRZ1066, CS8, EPS, S9, and SMQ-301). In the rest EPS clusters, epsE seems to encode a glycosyl-1-phosphate transferase. These enzymes initiate the assembly of the EPS repeating components through the transfer of phosphorylated sugars to the undecaprenyl-phosphate lipid carrier on the cytoplasmic side of the bacterial membrane (Broadbent et al., 2003; Wu et al., 2014). The sugar is transferred to the outer side of the membrane and this translocation process is probably facilitated by a flippase protein (Manat et al., 2014). All cluster A strains, including strain NCTC12958<sup>T</sup> , carried one flippase coding gene with the exception of strain ST3 which carried two. In contrast, all strains from cluster B seem to lack the respective gene with the exception of strain B59671.

The genes downstream epsE encode proteins with various functions related to EPS biosynthesis. Among them, glycosyltransferases are involved in the consecutive transport of nucleotide sugar moieties to the lipid carrier. Both the number

in cluster A or B as well as specific subgroups of strains is highlighted with blue or black frames, respectively. Presence/absence heat map and hierarchical clustering of S. thermophilus strains based on accessory genes with clusters of orthologous groups (COG) assignment involved in metabolism (categories C, E, F, G, H, I, P, and Q) (B). Colored areas represent the presence of genes in the respective S. thermophilus strains, while white areas indicate the absence of genes. Genes implicated in the biosynthesis of histidine are highlighted with a black frame. Of note, panel (B) is a composite figure of an excel generated table manually colored, while clustering was exported from RStudio. This was necessary to achieve clustering of genes grouped based on COG categories.

and the type of the respective genes in the EPS clusters are variable and may influence the composition of the produced EPS (Cui et al., 2017). The current analysis revealed the presence of transferases commonly encountered in S. thermophilus EPS clusters, such as glucosyltransferases, galactosyltransferases, and rhamnosyltransferases. A UDP-galactopyranose mutase involved in the synthesis of UDP-galactofuranose was identified in half of the EPS clusters. Interestingly, only strains JIM 8232, GABA, and NCTC12958<sup>T</sup> were found to carry a gene encoding a putative galactofuranosyl-transferase. Finally, genes implicated in the polymerization and translocation of the EPS repeating units have been also identified in all EPS clusters, as reported previously for S. thermophilus strains (Goh et al., 2011; Wu et al., 2014; Cui et al., 2017; Evivie et al., 2017).

Based on synteny, the EPS gene clusters can be categorized practically in distinct groups, supporting AI, AII, AIV, and BI subgroups. EPS clusters of strains KLDS 3.1003 and ST3 were highly similar to subgroup AI. Similarities in EPS clusters were also observed beyond lineages, as in the case of strains GABA and NCTC12958<sup>T</sup> . Certain EPS gene clusters, namely those of strains ACA-DC 2, LMD-9, LMG 18311, and B59671, presented higher structural variability due to the presence of many unique genes, which are coding mostly hypothetical proteins and glycosyltransferases. These observations are in accordance with previous findings for strains LMD-9 and LMG 18311 (Goh et al., 2011). Of note, three recent studies have been performed to highlight molecular mechanisms of EPS production in strains ASCC 1275 and KLDS SM (Li et al., 2018; Padmanabhan et al., 2018; Wu and Shah, 2018), while a fourth study suggests a protective role of purified EPS isolated from strain MN-BM-A01 against colitis in mice (Chen et al., 2019). The lineage likepatterns we observed among EPS gene clusters could potentially

FIGURE 6 | Multiple sequence alignment of the exopolysaccharide (EPS) gene clusters of the 23 S. thermophilus strains after BLASTN analysis. Gray shading represents the % identity among the nucleotide sequences according to the color gradient presented at the lower right corner of the figure. Protein coding genes are highlighted in dark blue, putative pseudogenes in orange, the deoD in yellow, the transporter gene in green and the unique genes for each S. thermophilus strain in beize. Clusters and subgroups of strains are highlighted.

be useful for extrapolating findings from one strain to another. In all cases understanding of the EPS biosynthesis in S. thermophilus may allow a better selection of strains or even their engineering for improved dairy and probiotic products (Xiong et al., 2019a).

#### Proteolytic System

fmicb-10-02916 December 20, 2019 Time: 12:58 # 12

The proteolytic system of LAB has been extensively investigated. A number of studies have revealed the diversity of its components, i.e., cell-wall bound proteinases, peptide and amino acid transporters and peptidases, among various LAB species (Savijoki et al., 2006; Liu et al., 2010). In the present study, the proteolytic system of S. thermophilus strains was examined on the basis of the scheme published by Liu et al. (2010) and the recent work of Tian et al. (2018). The results acquired from the TransportDB database were also employed.

Due to the limited availability of free amino acids and peptides in milk, the degradation of caseins is essential for growth. In S. thermophilus, the cell-wall associated proteinase PrtS is implicated in the initiation of the proteolytic cascade (Hols et al., 2005; Goh et al., 2011; Tian et al., 2018). prtS is present in almost half of the strains examined. The analysis showed that the respective gene is present (intact or truncated) solely in cluster A strains with the exception of strains APC151, KLDS 3.1003, and ND03 (**Supplementary Table S5A**). As it has been previously reported, PrtS presents 95% identity to the PrtS protein of Streptococcus suis and the distribution of prtS in S. thermophilus strains is infrequent in historical collections compared to industrial ones, indicating acquisition by lateral transfer in the species population (Delorme et al., 2010). PrtS has been related to the rapid growth of S. thermophilus in milk as a mono-culture and therefore in the rapid acidification of milk, which is a desirable technological trait. However, the sole presence of prtS is not sufficient for the rapid milk acidification by S. thermophilus. Milk acidification seems to be a complex phenotypic trait, which involves the overexpression of several genes (Galia et al., 2016). Furthermore, it was demonstrated that S. thermophilus strains, irrespective of the prtS+/<sup>−</sup> status, may present cellassociated extracellular peptidase activities. These activities, albeit weaker than that of PrtS, could probably provide amino acids essential for S. thermophilus growth (Hafeez et al., 2015). The extracellular presence of PepX aminopeptidase in S. thermophilus was recently suggested (Hafeez et al., 2019). Nevertheless, it has been supported that only prtS<sup>−</sup> S. thermophilus strains can perform protocooperation with L. bulgaricus (Settachaimongkon et al., 2014).

Several peptide and amino acid transporters of various families have been predicted in all S. thermophilus strains (**Supplementary Table S5B**). The majority of these transporters belong to the ATP-binding cassette (ABC) superfamily and include one oligopeptide Opp ABC transporter, one branched-chain amino acid ABC transporter, one glutamine ABC transporter, four amino acid ABC transporters, one spermidine/putrescine ABC transporter and one methionine ABC transporter. In a number of instances, the gene clusters of these transporters may contain putative pseudogenes and thus may be not functional. It has been previously reported that strain LMD-9 carries a second Opp ABC transporter, which is homologous to that of Bifidobacterium species (Goh et al., 2011). This transporter is also present in strains SMQ-301 and ST3. Strains B59671, GABA, and NCTC12958<sup>T</sup> have one extra amino acid ABC transporter, which displays high identity (90%) with the respective one of S. salivarius (data not shown). Furthermore, all strains carry four amino acid permeases of the amino acidpolyamine-organocation (APC) family. Additionally, strains ACA-DC 2, APC151, B59671, GABA, KLDS 3.1003, and ND03 carry a glutamate/GABA antiporter (gadC) (**Supplementary Table S5B**). The latter gene along with glutamate decarboxylase gene (gadB) are responsible for gamma-aminobutyric acid (GABA) production. It was recently demonstrated that strain APC151 is a high-yield GABA producer (Linares et al., 2016). In strain KLDS 3.1003 a unique histidine/histamine antiporter has been also identified (hdcP) (**Supplementary Table S5B**). The respective gene is located adjacently to a unique histidine decarboxylase gene (hdcA) and along with hdcB form the hdc cluster, probably acquired by HGT (please see below) which has been previously described in two other strains of S. thermophilus (Calles-Enríquez et al., 2010). From a physiological point of view, this gene cluster is probably implicated in cell protection under acidic conditions (De Angelis and Gobetti, 2011). The use of histamine-producing S. thermophilus strains should be avoided in dairy manufacture, since it has been demonstrated that hdcA<sup>+</sup> S. thermophilus used as starter in cheese production was associated with the accumulation of histamine in the final product (Gardini et al., 2012). One di-tripeptide transporter is present in all strains. A branched-chain amino acid permease and an amino efflux protein are also present in all strains, but for B59671 and ST3, respectively. The transport of the branchedchain amino acids leucine, isoleucine, and valine, as well as alanine, serine/threonine and glutamate/aspartate is probably facilitated by four symporters, three of them being present in all strains and only one in six strains (**Supplementary Table S5B**). In addition, a number of incomplete ABC transporters has been also predicted in all the strains analyzed (data not shown).

Besides PrtS, 12 highly conserved cytoplasmic peptidases have been identified in all strains, namely pepA, pepC, pepF, pepM, pepN, pepO, pepP, pepQ, pepS, pepT, pepV, and pepX (**Supplementary Table S5A**). Moreover, a number of peptidases, which have been identified in several LAB species, are missing from all S. thermophilus strains (Liu et al., 2010). More specifically, pyrrolidone-carboxylate peptidase (pcp) and proline peptidases pepI, pepR, and pepL are absent. Cysteine aminopeptidase (pepE/pepG) presents 40% identity with aminopeptidase C in all S. thermophilus strains, while a putative dipeptidase pepD is present but truncated in 14 S. thermophilus strains. It should be mentioned that the universal distribution of the majority of genes encoding proteins of the proteolytic system of S. thermophilus supports the essential role of the system.

#### Amino Acids Biosynthesis

The in silico analysis of amino acid biosynthetic pathways has been addressed in S. thermophilus (Hols et al., 2005). Experimental data for the species have been acquired for the biosynthesis of proline, branched-chain amino acids, glutamine and aspartate (Limauro et al., 1996; Garault et al., 2000; Monnet

et al., 2005; Arioli et al., 2007). Furthermore, Pastink et al. (2009) studied the amino acid metabolism and amino acid dependency of strain LMG 18311 through amino acid omission experiments, concluding that the minimal amino acid auxotrophy for the strain involves histidine and one of the sulfur-containing amino acids (methionine or cysteine). In some S. thermophilus strains amino acid requirements for growth involve at least four amino acids (Glu, Cys, His, and Met; Letort and Juillard, 2001). It seems that amino acid auxotrophy may be a strain dependent trait.

Most amino acid biosynthetic pathways are highly conserved in the 23 S. thermophilus strains (**Supplementary Figure S4** and **Supplementary Table S6**). Analysis of S. thermophilus protein coding sequences, based on KEGG orthology assignments and Hols et al. (2005), revealed that the majority of the amino acid biosynthetic pathways are present in all strains examined. Complete biosynthetic pathways in all S. thermophilus strains were predicted for threonine, cysteine, glycine, proline, glutamine, asparagine, phenylalanine, alanine, aspartate, and glutamate. Current annotations of all S. thermophilus strains in Refseq with prokaryotic genome annotation pipeline (PGAP) do not seem to support biosynthesis of lysine due to the absence of dapE, dapH, and dapF (**Supplementary Table S6**). An incomplete Dap-pathway was also reported for strains LMG 18311 (Hols et al., 2005) and LMD-9 (Goh et al., 2011). However, experimental evidence suggests biosynthesis of lysine in strains LMG 18311 (Pastink et al., 2009) and MN-ZLW-002 (Qiao et al., 2018) presumably through a complete Dappathway. We found that this discrepancy may be an artifact of annotation with the PGAP tool. Older S. thermophilus GenBank files, annotated with tools other than PGAP included a locus with three genes, the second of which is identified as a (truncated) dapE (data not shown). In contrast, in the same locus, PGAP predicts a single gene corresponding to a putative M20 peptidase pseudogene (e.g., locus\_tag Y1U\_RS01580 in strain MN-ZLW-002). We also tested other annotation tools, like rapid annotation using subsystem technology (RAST; Aziz et al., 2008) and FGenesB (Solovyev and Salamov, 2011) that also supported a three-gene architecture in the same locus, suggesting that further investigation is required to resolve this matter.

The most striking difference in the biosynthesis of amino acids among S. thermophilus strains examined concerns histidine. Hols et al. (2005) reported absence of this gene cluster in strains CNR1066 and LMG 18311 but its presence in strain LMD-9. As mentioned above, the respective pathway is complete in strains of cluster A and strain NCTC12958<sup>T</sup> , while strains of cluster B carry only one related gene, namely hisK (**Supplementary Table S6** and **Figure 5B**). Furthermore, several amino acid biosynthetic pathways seem to be incomplete in a number of strains. Analysis revealed that in strain B59671 several genes involved in amino acid biosynthesis are putative pseudogenes or absent compared to the other strains. In this strain glutamate, serine, methionine and tyrosine biosynthetic pathways may be non-functional. Concerning the rest of the strains analyzed, incomplete biosynthetic pathways have been identified for methionine in NCTC12958<sup>T</sup> and ST3, arginine in MN-BM-A01, branched-chain amino acids in JIM 8232 and tryptophan in EPS (**Supplementary Table S6**).

In some cases, differences among genes involved in specific biosynthetic steps during amino acid biosynthesis have been also identified. In tryptophan biosynthesis, two adjacent genes, namely aroG1 and aroG2 (Hols et al., 2005), encoding 70% identical proteins, have been identified in all strains except for strains ST3, CNRZ1066, and CS8. The first strain carries only aroG1, while the last two only aroG2. These genes are involved in the first step of chorismate synthesis, an intermediate product during tryptophan biosynthesis. Concerning the biosynthesis of branched-chain amino acids, in all S. thermophilus genomes two ilvD genes have been identified; one belongs to the ilvDBNC operon, while the second is located remotely from the ilvDBNC locus and its functionality is yet to be studied (Hols et al., 2005). The ilvD within the operon is a putative pseudogene in most strains and it seems to be functional only in KLDS 3.1003, LMD-9, NCTC12958<sup>T</sup> , and SMQ-301. These observations need further experimental investigation.

#### Urea Metabolism

Streptococcus thermophilus is perhaps the sole species among the dairy LAB with the ability to hydrolyze urea, a phenotypic trait, which affects adversely the milk acidification rate (Pernoud et al., 2004; Iyer et al., 2010). The urease gene cluster is highly conserved in all S. thermophilus strains analyzed and comprises 11 genes in the form of an operon of 8.2 kbp size (**Supplementary Table S7**). It includes the acid-activated ureI gene, the structural genes ureABC, the accessory genes ureEFGD and the genes encoding the cobalt/nickel uptake system ureMQO (or cbiMQO) (Mora et al., 2004; Iyer et al., 2010). The ureI gene is located upstream the structural genes and is coding a pH-dependent urea channel, which is probably activated for compensating the increase of the extracellular acidity. The ureABC genes are coding the three structural subunits of the enzyme, with ureC coding the large subunit and the remaining two genes coding the two smaller subunits (Ninova-Nikolova and Urshev, 2013). The auxiliary genes ureEFGD encode metallochaperones involved in nickel metallocenter biosynthesis and the delivery of nickel ions to the active site of the urease. More specifically, the urease apoenzyme forms a complex with the UreD, UreF, and UreG proteins, which is activated by the addition of nickel, bicarbonate and the metallochaperone UreE (Sujoy and Aparna, 2013). The ureMQO system is probably responsible for the translocation of nickel ions into the bacterial cell as indicated by functional analysis of the homologous genes in S. salivarius (Chen and Burne, 2003).

The physiological role of S. thermophilus urease has not been thoroughly evaluated. Although it is considered a response mechanism to acid stress, it has been demonstrated that urease is produced at low levels also at neutral pH (Mora et al., 2005). The ureolytic activity of S. thermophilus is probably related not only to the biosynthesis of essential amino acids, e.g., glutamine, but to the overall nitrogen metabolism of the species, with the expression of the ure operon depending on aspartate, glutamate, glutamine, and NH<sup>3</sup> concentrations (Monnet et al., 2005; Arioli et al., 2007). However, the rather uncommon urease-negative phenotype has been also reported for S. thermophilus strains, indicating that urease activity may not hold a vital role in milk fermentation (Mora et al., 2002). Recently, spontaneous urease-deficient mutants of S. thermophilus were isolated from S. thermophilus populations deriving from industrial yogurt starters. The stability of the mutated phenotype was confirmed, providing promising results regarding the potential use of ureasedeficient strains as starters in dairy fermentations (Ninova-Nikolova and Urshev, 2013). However, in a recent study employing urease deficient mutants it was suggested that urease activity is important for yogurt acidification and that its absence inhibits fermentation acceleration during protocooperation with L. bulgaricus (Yamauchi et al., 2019).

#### CRISPR-Cas Systems

fmicb-10-02916 December 20, 2019 Time: 12:58 # 14

The CRISPR-Cas systems are defense mechanisms widely distributed in prokaryotes, providing acquired immunity against foreign genetic elements like viruses and plasmids (Horvath and Barrangou, 2010). This immunity mechanism has been extensively studied in S. thermophilus, providing information concerning the environmental adaptability and the anti-phage activity of this microorganism (Sapranauskas et al., 2011; Louis et al., 2017; Hao et al., 2018). In addition, in certain studies spacers within CRISPR arrays in S. thermophilus were employed for assessing diversity among strains of the species (Horvath et al., 2008; Delorme et al., 2017). As mentioned above, Delorme et al. (2017) reported that MLST and whole genome based phylogeny differed from those inferred by CRISPR analysis. Here we revisit clustering of S. thermophilus strains based on CRISPR analysis in the context of complete genome sequences that allowed us further validation of the diversity scheme we propose in this study.

As reported previously (Horvath and Barrangou, 2010), up to four distinct CRISPR-Cas loci, i.e., CRISPR1, CRISPR2, CRISPR3, and CRISPR4 were identified in our S. thermophilus strains (**Supplementary Tables S8, S9A** and **Figure 7**). CRISPR1 and CRISPR3 both belong to Class 2/subtype II-A CRISPR-Cas systems, while CRISPR2 and CRISPR4 belong to Class 1/subtype III-A and Class 1/subtype I-E CRISPR-Cas systems, respectively (Horvath et al., 2008; Makarova et al., 2015; Hao et al., 2018). Furthermore, one putative orphan CRISPR array structure was predicted by CRISPRFinder in strains JIM 8232 and LMG 18311, characterized by the absence of adjacent Cas proteins. The direct repeats (DRs) of this array in JIM 8232 were identical to the DRs of CRISPR3 in other strains, suggesting that it must have owned the relevant Cas proteins originally and subsequently lost them. In contrast, the DRs of LMG 18311 in the orphan array did not match any other DRs.

CRISPR1 was found in 22 out of the 23 S. thermophilus strains analyzed here, with ACA-DC 2 carrying no CRISPR array despite retaining CRISPR-related genes (Alexandraki et al., 2017). CRISPR1 array size ranged between 760 and 2,805 bp. This size variability is associated with the number of spacers (11–42) in the arrays of the different strains. This is the largest CRISPR array within the S. thermophilus strains analyzed with the exception of strain ST3 (**Figure 7**) and it has been reported to be ubiquitous in S. thermophilus strains (Horvath et al., 2008). In strains B59671 and KLDS 3.1003 the gene coding the Cas9 protein is a putative pseudogene, indicating that the respective CRISPR-Cas systems might have been inactivated. Strains ASCC 1275, APC151, DGCC 7710, GABA, KLDS 3.1003, KLDS SM, LMD-9, MN-BM-A02, MN-ZLW-002, ND03, ND07, NCTC12958<sup>T</sup> , SMQ-301, and ST3 also carry CRISPR3. This CRISPR contains 8 to 26 spacers and in most cases is shorter than CRISPR1 (**Figure 7**). A higher activity for CRISPR1 in comparison to CRISPR3 has been experimentally validated (Horvath et al., 2008). In the case of CRISPR3, cas9 is a putative pseudogene in strain MN-BM-A01, indicating that the specific system may have been also inactivated. It should be emphasized that CRISPR1 is detected in both cluster A and B strains, while CRISPR3 is present only in cluster A (apart from strain JIM 8232) and it is totally absent from cluster B strains. Based on analysis of the LMD-9 genome sequence, Horvath et al. (2008) proposed that the entire CRISPR3-Cas system may have been deleted or inserted in S. thermophilus strains through a recombination event between a repeat present in the terminal repeat of CRISPR3 and a repeat close to serB which flanks the system from one side.

CRISPR2 was found in strains ASCC 1275, DGCC 7710, GABA, KLDS 3.1003, KLDS SM, JIM 8232, LMD-9, LMG 18311, MN-BM-A02, ND07, SMQ-301, and ST3. Thus, CRISPR2 was present only in cluster A strains, apart from strain LMG 18311 which belongs to cluster B. Among the Cas proteins of CRISPR2, cas1 is a putative pseudogene in strains KLDS 3.1003 and LMG 18311, while cas10 is a putative pseudogene in strains LMD-9 and SMQ-301. Furthermore, the respective CRISPR-Cas systems of strains GABA and ST3 carry only three CRISPR-associated genes (cas1, cas2, and csm6) which indicates that they are incomplete. All these CRISPR2 systems carried a CRISPR array. However, additional "possible" CRISPR2 systems were predicted by CRISPRFinder owning an incomplete set of Cas proteins followed by a single spacer within two DRs (**Supplementary Table S9A**). Our findings suggest inactivation and/or degeneration of CRISPR2 in several strains. Horvath et al. (2008) reported that CRISPR2 may indeed be inactivated in certain strains, however Tamulaitis et al. (2014) were able to demonstrate its activity in at least another strain. CRISPR4 was identified in strains ASCC 1275, B59671, DGCC 7710, KLDS SM, MN-BM-A02, and ND07. Genes cse1 and cas2 in strains ASCC 1275 and B59671, respectively, are putative pseudogenes. Interestingly, the CRISPR4 was basically found in subgroup AI strains. Further subgrouping could be supported not only through the presence/absence of CRISPR-Cas systems, but also through the distribution of different spacers, as discussed below.

A total of 997 spacers were found in the confirmed CRISPR-Cas systems of the 22 S. thermophilus strains with 93% being assigned in CRISPR1 and CRISPR3. Analysis of the respective sequences revealed that 258 are unique among 11 strains, namely NCTC12958<sup>T</sup> , JIM 8232, GABA, KLDS 3.1003, ST3, B59671, LMG 18311, LMD-9, SMQ-301, S9, and EPS, while 253 appeared more than once in the CRISPR arrays. As shown previously, CRISPR arrays may be employed for accessing strain diversity within S. thermophilus (Horvath et al., 2008; Delorme et al., 2017). Indeed, looking into the architecture of the CRISPR arrays we could identify once more patterns that are not shared by all S. thermophilus strains, but they are specific to the grouping of strains we have already described. For example, CRISPR1



FIGURE 7 | Spacer sequences alignment of the various clustered regularly interspaced short palindromic repeats-CRISPR associated (CRISPR-Cas) system types found in the 22 S. thermophilus strains. In the alignments only the spacer sequences have been used. In each type of CRISPR-Cas system each spacer is represented by the combination of a character and a font color. The spacers represented in black font with the letter U correspond to unique spacers. Spacers represented by the same combination of a character and a font color correspond to identical spacers. Spacers of CRISPR1 (A), CRISPR2 (B), CRISPR4 (C), and CRISPR3 (D).

supports subgroups AI, AII, AIII, and AIV. Subgroup BI is partially supported, since only strains CNRZ1066 and CS8 share the same CRISPR array. CRISPR3 supports subgroups AI, AII, AIII, and AIV. CRISPR4 has a unique pattern of spacers for subgroup AI. As mentioned above CRISPR2 is present only in cluster A strains, apart from strain LMG 18311 which belongs to cluster B, but the spacer pattern in the arrays could not distinguish any subgroup (**Figure 7**). Most spacers were unique for each subgroup and were present in a specific order in the array. This observation suggests that this part of the array was present in the common ancestor of these subgroups of strains. However, in certain instances, a specific spacer could be found common between two seemingly unrelated arrays belonging to different subgroups of strains. Most probably such spacers were acquired by the common ancestor of each subgroup due to exposure to the same exogenous DNA that resulted in the acquisition of the same part of sequence into the specific CRISPR array. Evidently, these spacers were identified only in arrays of the same class and subtype CRISPR-Cas systems. Similar analysis of spacers to infer evolutionary relationships among S. thermophilus strains have been reported previously (Horvath et al., 2008). However, when looking solely to the architecture of the CRISPR array it is very difficult to distinguish between clones or complexes of very similar strains that are not actual clones.

BLASTN analysis of the spacers showed that 317 sequences matched several different S. thermophilus bacteriophages (**Supplementary Table S9B**). Almost half of the spacers analyzed could be related to phages 7201, Sfi19, Sfi21, DT1, and Sfi11. This finding may indicate a high frequency of exposure of S. thermophilus to the specific phages. Finally, six spacers were highly identical to Lactococcus phages, while 12 spacers were highly identical to plasmids of Enterococcus faecium, S. suis, Streptococcus pyogenes, S. pneumoniae, Lactobacillus salivarius and Lactococcus lactis. These findings indicate that S. thermophilus has been found in the same environment with these bacteria. Furthermore, it could be hypothesized that at least some potential HGT events of plasmid donation toward S. thermophilus were aborted through the activity of CRISPR-Cas systems. Overall our findings are in agreement with previous results (Bolotin et al., 2005; Horvath et al., 2008).

#### R-M Systems and Prophages

Another immunity mechanism employed by the prokaryotes against foreign DNA are the R-M systems. All S. thermophilus strains analyzed carry several R-M systems, classified into four types (Roberts et al., 2005, 2015; **Supplementary Table S10** and **Supplementary Figure S5**). The majority of strains carry one complete type I R-M system with strains EPS, NCTC12958<sup>T</sup> , GABA, and KLDS 3.1003 carrying two. No type I R-M system was predicted for strain B59671, while in strains MN-ZLW-002, MN-BM-A01, and ST3 the predicted type I R-M system was incomplete due to the absence or inactivation of one or more of the necessary genes. This was the case for additional predicted type I systems in several strains. Certain S. thermophilus strains carry at least one type II system with strains LMD-9, MN-BM-A01, ND03, and APC151 owning three such systems. Unlike type I R-M systems, most type II systems seem to be complete and potentially active. One type III system is present in strains ACA-DC 2, CNRZ1066, CS8, EPS, S9, LMG 18311, NCTC12958<sup>T</sup> , and GABA. Finally, a type IV system has been predicted in almost half of the strains analyzed, which contains only a restriction enzyme that recognizes and cuts modified DNA.

A more detailed investigation revealed that type II and type III R-M systems are absent or inactivated from S. thermophilus strains of subgroup AI. For this reason, we wanted to examine whether the presence/absence pattern of R-M systems in S. thermophilus is lineage specific. As demonstrated in **Supplementary Figure S5** in several instances R-M systems are distributed on the chromosome in a manner that is characteristic for a potential lineage. This is particularly obvious for cluster A and more specifically for subgroups I, II, III, and IV. The R-M systems of strains in cluster B presented some similarities within the same subgroup, but they were more variable.

Despite the presence of the aforementioned defense mechanisms, complete prophages were also predicted for strains APC151, ND03, and NCTC12958<sup>T</sup> , while in the rest of the examined genomes only remnants of prophages have been identified (data not shown). In strains APC151 and ND03 the same prophage was predicted located within the EPS cluster of each strain. In strain NCTC12958<sup>T</sup> the intact prophage was previously described as phage 20617 by Arioli et al. (2018). Interestingly, the authors of this study demonstrated that the lysogenic strain NCTC12958<sup>T</sup> (DSM 20617<sup>T</sup> ) exhibited higher adhesion to solid surfaces and heat resistance compared to the phage-cured derivative strain, suggesting some competitive advantage due to the stable association of the phage and the host.

#### Genomic Islands

Genomic islands acquired through HGT can provide adaptive and technological traits to the host microorganism (Juhas et al., 2009). In silico prediction of HGT in S. thermophilus has been previously reported (Hols et al., 2005; Liu et al., 2009; Eng et al., 2011). In this study, the GIs predicted by IslandViewer 4 in S. thermophilus ranged from 5 to 23 per strain, with sizes between 3.5 and 58 kbp and variable GC content from 26.1 to 45.2% (**Supplementary Table S11**). A total of 253 GIs were predicted, 31 of which were unique in 11 strains. The rest of the GIs have been identified in at least two S. thermophilus strains, either complete or partial. Of note, the genome array of ribosomal proteins was predicted as part of a GI in a number of strains. This is a false positive result, since it has been reported that the nucleotide composition of these arrays differentiates significantly from the rest protein coding genes (Hols et al., 2005; Fernandez-Gomez et al., 2012). Thus, these GIs (14 in total) were excluded from further analysis (**Supplementary Tables S11, S12A**). Several GIs were found to be present in both clusters A and B strains, while others were present either in cluster A or B strains. The first type of GIs was most probably acquired earlier than the second, i.e., before clusters A and B were separated. In accordance with what has been reported above for other genomic features, certain subgroups of strains display a unique distribution pattern of specific GIs that can support subgroups AI–AIV and BI (**Supplementary Figure S6**).

BLASTN analysis of the predicted GIs could not always reveal a potential donor. Nevertheless, a number of GIs could be traced back to specific microorganisms (coverage >70%, identity >90%; **Supplementary Table S12B**). The majority of species acting as potential donors belongs to the Streptococcus genus but also to other LAB like L. lactis, Lactobacillus casei, and Leuconostoc gelidum. In these last three cases GIs present high identity to plasmids carried by these organisms. In detail, specific GIs in subgroups AI, AII, and one GI in strain NCTC12958<sup>T</sup> present high identity to plasmids pLd7/p229C of L. lactis subsp. lactis (Kelleher et al., 2017; Van Mastrigt et al., 2018), pBD-II/pLC2W of L. casei (Ai et al., 2011; Chen et al., 2011; Song et al., 2018) and plasmid 1 of L. gelidum subsp. gasicomitatum (Andreevskaya et al., 2016), respectively. It is interesting to highlight that strains of S. thermophilus seem to have also interacted with members of the Streptococcus bovis/Streptococcus equinus complex (SBSEC), namely S. macedonicus, S. infantarius subsp. infantarius, Streptococcus gallolyticus, and S. equinus. Members of the complex are established members of the gastrointestinal tract (GIT) of ruminants, while certain species like S. macedonicus and S. infantarius are increasingly associated with fermented foods, especially of dairy origin (Jans et al., 2013a,b; Papadimitriou et al., 2014, 2015a).

A detailed investigation of the annotated features of S. thermophilus GIs revealed that they could be involved in EPS biosynthesis in accordance with previous findings reported for strains CNRZ1066, LMD-9, and LMG 18311 (Liu et al., 2009). CRISPR-Cas and complete R-M systems have been also identified in GIs. This would include CRIPR3 and CRISPR4 and type I and III R-M systems. In addition, the 38.5 kbp GI 9 contains most part of the intact prophage in strain NCTC12958<sup>T</sup> (**Supplementary Table S12A**). Our analysis supports the presence of bacteriocin coding genes in the GIs of a number of strains. However, Hols et al. (2005) suggested that the activity of these antimicrobial peptides may not be always guaranteed due to the absence of genes coding for transport or immunity proteins or other differences. For example, the locus of a class II bacteriocin-like peptide (blp) was experimentally studied in strains CNRZ1066, LMG 18311, and LMD-9 and it was concluded that it is only functional in the last strain (Hols et al., 2005). In strain B59671, GI 5 carries genes of the blp gene cluster involved in the production of the bacteriocin thermophilin 110 (Renye et al., 2017). Finally, in GI 6 of strain GABA we found a locus containing several genes coding for leader peptides (including mutacin IV, BlpU, and bovicin 255), but transport or immunity proteins seem to be inactive or absent (**Supplementary Table S12A**). Moreover, several genes involved in amino acid transport have been found in the predicted GIs of S. thermophilus strains. Some of these include a glutamate:GABA antiporter in strains APC151, GABA, and ND03, a dicarboxylate/amino acid:cation symporter in strains APC151, KLDS 3.1003, MN-BM-A01, MN-ZLW-002, ND03, and ST3 and a complete amino acid ABC transporter in strains CS8, EPS, KLDS 3.1003, and S9. The hdc cluster of strain KLDS 3.1003 was also identified in a GI and BLASTN analysis revealed possible HGT from a satellite phage. Furthermore, GI 7 of strain JIM 8232 corresponds to the biosynthetic gene cluster of histidine. As already mentioned, this region is also present in all cluster A S. thermophilus strains (plus strain NCTC12958<sup>T</sup> ) but for unknown reasons it was assigned as a GI only in JIM 8232. BLASTN analysis revealed that this region presents high identity to the SBSEC member S. equinus (92%) supporting its potential acquisition by HGT in S. thermophilus chromosome. In addition, genes involved in fatty acid biosynthesis were identified in GIs of strains

APC151, GABA, MN-BM-A01, MN-ZLW-002, and ND03, while stress response genes, e.g., coding for cold-shock proteins were also identified in a number of strains, including ASCC 1275, CNRZ1066, KLDS 3.1003, LMG 18311, ND03, and ST3. Finally, the gene cluster cbs-cblB-cysE involved in the metabolism of sulfur-containing amino acids has been previously suggested to have been transmitted by HGT from L. bulgaricus or Lactobacillus helveticus to S. thermophilus (Liu et al., 2009). Current analysis revealed that the respective cluster was predicted as part of a bigger GI in 17 S. thermophilus strains. More specifically, this GI along with the three genes were identified in strains APC151, GABA, KLDS 3.1003, LMD-9, LMG 18311, MN-BM-A01, MN-ZLW-002, ND03, and SMQ-301, while in strains ACA-DC 2, ASCC 1275, CNRZ1066, CS8, MN-BM-A02, ND07, S9, and ST3 the cysE is a putative pseudogene (**Supplementary Table S12A**).

It should be mentioned that Selle et al. (2015) identified four expendable GIs in the genome of strain LMD-9 with variable distribution in other sequenced strains. IslandViewer 4 did not predict GIs 1 and 2 reported in that study, while it detected GIs overlapping or included in GIs 3 and 4. These differences can be explained by the in silico methods employed to detect GIs. Selle et al. (2015) employed a strategy combining the location of potentially essential open reading frames (ORFs) and highly similar insertion sequences (ISs) which is distinct from the strategies employed by the tools included in IslandViewer 4.

### S. thermophilus Genes Implicated in Protocooperation With L. bulgaricus

The bacterial pair of S. thermophilus and L. bulgaricus is routinely employed in yogurt production. The mutually beneficial interaction between these bacteria in the yogurt ecosystem, known as protocooperation, is based on the exchange of metabolites and results in improved metabolic performance related to accelerated acidification, enhanced EPS production and abundance of aroma volatiles. Initially, S. thermophilus boosts the growth of L. bulgaricus by lowering the pH and providing formic, pyruvic and folic acid as well as carbon dioxide. Subsequently, L. bulgaricus stimulates S. thermophilus growth by producing peptides and free amino acids (Settachaimongkon et al., 2014). Transcriptome analysis of a mixed S. thermophilus and L. bulgaricus culture also supports that metabolites like formic and folic acid produced by S. thermophilus are utilized by L. bulgaricus as precursors in purine biosynthesis (Sieuwerts et al., 2010). S. thermophilus carries genes encoding pyruvate formate lyase (PFL) and pyruvate formate-lyase activating (PFLA) enzyme, while L. bulgaricus lacks these genes (Nishimura et al., 2013). Our analysis revealed the presence of both pfl and pflA in all S. thermophilus strains examined (**Supplementary Table S13**).

In addition, a number of studies have been performed concerning the role of PrtS produced by S. thermophilus during manufacture of dairy products, especially yogurt. For example, PrtS production may positively affect S. thermophilus growth in a pure culture, but it may be neutral in a mixture with L. bulgaricus strains producing the protease PrtB (Courtin et al., 2002). In a more recent study, it was demonstrated that only nonproteolytic S. thermophilus strains performed protocooperation with L. bulgaricus (Settachaimongkon et al., 2014). As already mentioned, the majority of cluster A strains carries prtS, while it is absent from all cluster B strains, indicating that the latter may be more appropriate for protocooperation. However, specific S. thermophilus strains carrying the prtS have been shown to exhibit weak or no PrtS activity (Galia et al., 2009; Cui et al., 2016). In our dataset in strains MN-BM-A01 and SMQ-301 prtS was found to be truncated, an observation that may support to a degree the findings by Galia et al. (2009). Furthermore, it was recently reported that prtS<sup>+</sup> strains may also present some technological advantages (Tian et al., 2018). We thus believe that more research is needed to establish the actual role of prtS regarding protocooperation.

The response of S. thermophilus to H2O<sup>2</sup> produced by L. bulgaricus has also been studied. It appears that there is an inverse correlation between iron intake by S. thermophilus and H2O<sup>2</sup> production by L. bulgaricus, and that S. thermophilus in the presence of H2O<sup>2</sup> is regulating iron metabolism in order to diminish the production of harmful reactive oxygen species (ROS) (Herve-Jimenez et al., 2009; Sieuwerts et al., 2010). However, the results of two different studies are rather diverge. In one study, the expression patterns of S. thermophilus genes related to iron transport in the presence of L. bulgaricus were found to be upregulated (Sieuwerts et al., 2010), while in another study downregulated (Herve-Jimenez et al., 2009). Only dpr (peroxide resistance protein) and fur (ferric transport regulator protein) were found upregulated in both studies. In silico analysis of the 23 S. thermophilus strains revealed that dpr and fur belong to the core genome, while the iron ABC transporter is absent from strains JIM 8232, MN-ZLW-002, ND03, APC151, MN-BM-A01, and ST3 (**Supplementary Table S13**).

A novel protocooperation relationship between S. thermophilus and L. bulgaricus in yogurt fermentation concerns the bi-functional glutathione (GSH) synthetase gene of S. thermophilus, which produces GSH (Wang et al., 2016). The respective gene was found to be conserved in all 23 S. thermophilus strains analyzed (**Supplementary Table S13**). In a recent study, it was demonstrated that GSH produced by S. thermophilus provided protection to both S. thermophilus and L. bulgaricus cells toward acid stress. Additionally, the secreted GSH could enhance the growth of L. bulgaricus (Wang et al., 2016). Finally, genes related to EPS production were found to be upregulated in both microorganisms in a mixed culture when compared to monocultures, and thus they may play an important role in the texture of the final product (Sieuwerts et al., 2010). Given the heterogeneity observed in the EPS gene cluster of S. thermophilus strains, no mechanistic insight could be inferred.

### CONCLUSION

Streptococcus thermophilus is a starter of great economic significance for the dairy industry contributing to the production of world-wide consumed dairy products like yogurt and cheeses. A number of studies have been published in an attempt to explore and interpret various features of the species biology related to its technological potential. This became more feasible

during the last two decades with the sequencing of genomes of S. thermophilus strains. In this study we analyzed 23 fully sequenced genomes of S. thermophilus in order to examine features of the species related to technological and evolutionary traits. Even from the beginning of our study, it became evident that strains of S. thermophilus present some variability considering the properties of the genomes (e.g., size, gene content, % of pseudogenes, rRNA and tRNA content). Core genome and ANI phylogenetic analysis revealed a specific pattern of clustering of strains (**Figure 8**). A main observation was that most strains could be separated in two major clusters. Cluster A was characterized by larger genomes, the presence of prtS in the majority of strains, the inclusion of a histidine biosynthesis gene cluster, as well as the presence of certain CRISPR-Cas system types and specific GIs. Strains in cluster B diversified from those in cluster A in all these aspects. These observations indicated the existence of at least two major lineages in S. thermophilus that appear at ANI values >98%. Further investigation suggested the presence of subgroups within the two clusters, i.e., subgroups AI–AIV and BI. The existence of these subgroups was also supported to a variable degree during COG analysis as well as the presence/absence pattern of specific loci and/or their organization, i.e., EPS clusters, CRISPR arrays, R-M systems and GIs. Clustering of S. thermophilus strains based on the spacers of CRISPR arrays has been performed before (Horvath et al., 2008; Delorme et al., 2017). Given the fact that CRISPR arrays can provide a retrospective view of the history of each strain based on the parasitic DNA it was exposed to, spacer sequences of the CRISPR1 which is present practically in all strains support the existence of evolutionary distinct lineages in S. thermophilus. Biodiversity within strains of S. thermophilus has been previously suggested using CRISPR array and/or MLST clustering (Horvath et al., 2008; Delorme et al., 2010, 2017; Yu et al., 2015). In our opinion, clustering of strains according to CRISPR array architecture or even MLST has important advantages (e.g., the ability to screen many strains), but these approaches may derive more easily to the characterization of potential clonal strains due to the use of limited genomic information. In contrast, whole genome phylogeny based on core genes should be more robust, while analysis of complete genome sequences may provide even more information concerning the discrimination of strains based on loci beyond core genome, like accessory genes or even unique genes. The subgroups we describe appeared at ANI values well above 99%, an observation that could indicate that they derive from clonal strains. A closer investigation of the data presented in this study suggests in some cases differences among strains of the same subgroup. For example, this becomes obvious when considering the exact sizes of the chromosome of the strains, the exact gene content (including accessory genes but also genes that are exclusively absent from a specific strain). In some instances, differences were observed in the EPS clusters, the distribution of R-M systems and GIs of strains within the same subgroup. Even though the differences among strains of the same subgroup may be rather subtle thus justifying the high ANI values at which their relatedness appears, they diversify strains beyond the strict definition of clones. Our analysis concerning the genome assemblies of the strains suggested a quality level that may not interfere with the grouping scheme we describe. Nonetheless, apart from the differences identified among the strains, our analysis also validated common features or features beyond the clustering pattern mentioned above (**Figure 8**). These would include characteristic traits for the adaptation of S. thermophilus to milk, like the conserved gallac and urease operons, the extended arsenal of peptidases and amino acid/peptide transporters in parallel to genes related to protocooperation. The high percentage of pseudogenes has been related to the reductive evolution of S. thermophilus during adaptation to rich in nutrients dairy niches (Bolotin et al., 2004; Hols et al., 2005; Goh et al., 2011). This trait was also apparent in all strains analyzed here. Interestingly, features related to milk adaptation seem to be also present in APC151. The strain does not diversify from the dairy strains, even though it was the only strain in our dataset that was isolated from a nondairy environment, i.e., the fish intestine. This was also suggested previously (Linares et al., 2017). This relatively odd observation highlights the need to study strains found in environments different than milk and dairy products to fully apprehend the evolution of the species. Finally, the pan genome of the species is not closed yet, suggesting that sequencing of additional strains will be important. Certain new complete genomes have appeared in the databases since the initiation of our analysis (Proust et al., 2018; Renye et al., 2019), but more are required to further expand and validate any lineage-like patterns that may exist and could be related to the technological/probiotic repertoire of S. thermophilus.

### DATA AVAILABILITY STATEMENT

All datasets generated for this study are included in the article/**Supplementary Material**.

### AUTHOR CONTRIBUTIONS

VA and MK performed genome analysis and participated in the writing of the manuscript. JB and BP performed genome analysis. KP conceived the project, performed genome analysis, and participated in the writing of the manuscript. ET conceived the project and participated in the writing of the manuscript. All authors read and approved the final manuscript.

#### FUNDING

The present work was co-financed by the European Social Fund and the National Resources EPEAEK and YPEPTH through the Thales project.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2019. 02916/full#supplementary-material

## REFERENCES

fmicb-10-02916 December 20, 2019 Time: 12:58 # 20



to whole-genome sequence similarities. Int. J. Syst. Evol. Microbiol. 57, 81–91. doi: 10.1099/ijs.0.64483-0



Streptococcus thermophilus N4L. Microbiol. Resour. Announc. 7:e01029-18. doi: 10.1128/MRA.01029-18



**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Alexandraki, Kazou, Blom, Pot, Papadimitriou and Tsakalidou. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# State-Wide Genomic and Epidemiological Analyses of Vancomycin-Resistant Enterococcus faecium in Tasmania's Public Hospitals

#### Edited by:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### Reviewed by:

Ana R. Freitas, University of Porto, Portugal Peter Kinnevey, Dublin Dental University Hospital, Ireland Francesca Biavasco, Marche Polytechnic University, Italy

> \*Correspondence: Ronan F. O'Toole r.otoole@latrobe.edu.au

#### Specialty section:

This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology

Received: 20 September 2019 Accepted: 06 December 2019 Published: 15 January 2020

#### Citation:

Leong KWC, Kalukottege R, Cooley LA, Anderson TL, Wells A, Langford E and O'Toole RF (2020) State-Wide Genomic and Epidemiological Analyses of Vancomycin-Resistant Enterococcus faecium in Tasmania's Public Hospitals. Front. Microbiol. 10:2940. doi: 10.3389/fmicb.2019.02940 Kelvin W. C. Leong<sup>1</sup> , Ranmini Kalukottege<sup>2</sup> , Louise A. Cooley3,4, Tara L. Anderson3,5 , Anne Wells<sup>5</sup> , Emma Langford<sup>6</sup> and Ronan F. O'Toole1,4,7 \*

<sup>1</sup> Department of Pharmacy and Biomedical Sciences, School of Molecular Sciences, College of Science, Health and Engineering, La Trobe University, Albury-Wodonga, VIC, Australia, <sup>2</sup> Department of Microbiology, Launceston General Hospital, Launceston, TAS, Australia, <sup>3</sup> Royal Hobart Hospital, Hobart, TAS, Australia, <sup>4</sup> School of Medicine, University of Tasmania, Hobart, TAS, Australia, <sup>5</sup> Tasmanian Infection Prevention and Control Unit, Department of Health and Human Services, Hobart, TAS, Australia, <sup>6</sup> Department of Microbiology, Hobart Pathology, Hobart, TAS, Australia, <sup>7</sup> Department of Clinical Microbiology, Trinity College Dublin, Dublin, Ireland

From 2015 onwards, the number of vancomycin-resistant Enterococcus faecium (VREfm) isolates increased in Tasmania. Previously, we examined the transmission of VREfm at the Royal Hobart Hospital (RHH). In this study, we performed a statewide analysis of VREfm from Tasmania's four public acute hospitals. Whole-genome analysis was performed on 331 isolates collected from screening and clinical specimens of VREfm. In silico multi-locus sequence typing (MLST) was used to determine the relative abundance of broad sequence types (ST) across the state. Core genome MLST (cgMLST) was then applied to identify potential clades within the ST groupings followed by single-nucleotide polymorphic (SNP) analysis. This work revealed that differences in VREfm profiles are evident between the state's two largest hospitals with the dominant vanA types being ST80 at the RHH and ST1421 at Launceston General Hospital (LGH). A higher number of VREfm cases were recorded at LGH (n = 54 clinical, n = 122 colonization) compared to the RHH (n = 14 clinical, n = 67 colonization) during the same time period, 2014–2016. Eleven of the clinical isolates from LGH were vanA and belonged to ST1421 (n = 8), ST1489 (n = 1), ST233 (n = 1), and ST80 (n = 1) whereas none of the clinical isolates from the RHH were vanA. For the recently described ST1421, cgMLST established the presence of individual clusters within this sequence type that were common to more than one hospital and that included isolates with a low amount of SNP variance (≤16 SNPs). A spatio-temporal analysis revealed that VREfm vanA ST1421 was first detected at the RHH in 2014 and an isolate belonging to the same cgMLST cluster was later collected at LGH in 2016. Inclusion of isolates from two

**276**

smaller hospitals, the North West Regional Hospital (NRH) and the Mersey Community Hospital (MCH) found that ST1421 was present in both of these institutions in 2017. These findings illustrate the spread of a recently described sequence type of VREfm, ST1421, to multiple hospitals in an Australian state within a relatively short time span.

Keywords: Enterococcus faecium, whole genome sequencing, vancomycin, multi-locus sequence typing, single nucleotide polymorphism

### INTRODUCTION

fmicb-10-02940 January 6, 2020 Time: 15:52 # 2

Vancomycin-resistant Enterococcus faecium (VREfm) is an important antibiotic-resistant microorganism that can cause healthcare-associated infections (HAI) in patients receiving care. It was first recorded in Australia at a Melbourne hospital in 1994 (Kamarulzaman et al., 1995). By 2015, Australia exhibited one of the highest rates of vancomycin resistance in E. faecium in the world at 48.7–56.8% of clinical isolates (Australian Commission on Safety and Quality in Health Care (ACSQHC), 2017). From 2008 to 2015, the prevalence of VREfm in Tasmania was relatively low with an average of approximately 10 new VREfm isolates per quarter during that period (Wilson et al., 2017). However, by 2016 there was a marked increase to over 100 VREfm isolates collected on average per quarter (Wilson et al., 2017). The reasons underlying the abrupt rise in VREfm in the state have not yet been established. Previously, we applied whole-genome sequencing to examine VREfm at the Royal Hobart Hospital (RHH) and identified the major sequence types as vanB ST796 and vanA ST80 as well as their probable direction of transmission at the hospital (Leong et al., 2018a).

In this work, we examined VREfm on a state-wide basis to improve our understanding of this pathogen across Tasmania. We determined the genotypes of VREfm isolates collected at Tasmania's other public hospitals, the Launceston General Hospital (LGH), the North West Regional Hospital (NRH) and the Mersey Community Hospital (MCH), using multi-locus sequence typing (MLST), core genome MLST (cgMLST), and single-nucleotide polymorphic (SNP) analysis. We then combined genomic data with patient spatio-temporal information which provided insights into the emergence and distribution of VREfm sequence types in the state.

### MATERIALS AND METHODS

### VREfm Isolate and Epidemiological Data Collection

The Multi-Resistant Organism Screening and Clearance Protocol for the Tasmanian Health Services identifies VRE colonization in patients when a VRE-positive culture was obtained from a non-sterile site and VRE-specific antibiotic therapy was not administered by a clinician, and identifies VRE infection when a VRE-positive culture was obtained from either a sterile or non-sterile site and VRE-specific antibiotic therapy was administered by a clinician (Wilson et al., 2018). In accordance with the Australian Public Health Act 1997 (Act Parliamentary Counsel, 2016), the Tasmanian Infection Prevention and Control Unit (TIPCU) of the Department of Health and Human Services (DHHS) established the Healthcare Associated Infection Surveillance Program for the notification of new patient cases with VRE (Wilson et al., 2018). VREfm screening isolates were obtained from inpatients who underwent VRE screening under the following circumstances: direct transfers from any intrastate, interstate or overseas acute or long-term healthcare facility; patients with an overnight admission in the previous 3 months to any intrastate acute or long term healthcare facility; patients with an overnight admission in the previous 12 months to any intrastate acute or overseas acute or long term healthcare facility; patients with a "History-VRE" alert; patients with a self-reported or healthcare facility reported history of VRE; or patients identified to be a VRE contact. When patients presented with an VRE infection, the VRE isolates were classified as clinical isolates (Wilson et al., 2018).

For whole-genome sequencing, we collected VREfm samples from Tasmania's acute public hospitals based on the following criteria: all clinical isolates from 2014–2016, screening samples which overlapped with the clinical isolates collected with respect to patient admission and sample collection dates, and all VREfm that were tested positive for vanA vancomycin resistance. A total of 257 VREfm isolates including both clinical and screening cases, were retrieved from patient samples collected between 2014 and 2016 at the RHH (n = 500 beds approx.) and LGH (n = 300 beds approx.). Isolates from the state's smaller hospitals, the NRH at Burnie (n = 160 beds approx.) and the MCH near Devonport (n = 95 beds approx.), were also included. Storage of VREfm isolates at these two hospitals did not commence until late 2016 and did not include clinical isolates during 2016. To investigate the epidemiology of the VREfm isolates from the NRH and MCH, a full calendar year of isolates (n = 74) collected in 2017 was analyzed.

Stored isolates were retrieved and cultured on blood agar plates at the Microbiology laboratories of the respective hospitals. Only one VREfm isolate per patient was included. At RHH and LGH, the Bruker Biotyper matrix assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) (Bruker Daltonic GmbH, Leipzig, Germany) was used to identify E. faecium and antibiotic susceptibility testing (AST) was performed using the EUCAST methodology<sup>1</sup> . For NRH and MCH, the bioMérieux Vitek MALDI-TOF MS (bioMérieux Australia Pty Ltd., Baulkham Hills, NSW, Australia) was used. AST was performed using the agar disc diffusion assay and zone diameters of inhibition were interpreted using the calibrated dichotomous sensitivity (CDS) test clinical breakpoint of 2 mm

<sup>1</sup>http://www.eucast.org/clinical\_breakpoints/

to differentiate between VRE sensitivity and resistance for Enterococcus species<sup>2</sup> . For all sites, identification of VRE was also determined by growth on VRE-selective agar, organism detection, and a final confirmation of the vancomycin-resistance locus type was obtained with the Cepheid Xpert <sup>R</sup> vanA/vanB assay (Xpert <sup>R</sup> vanA/vanB) 3 .

Information was collected from the hospital's electronic medical record and infection control database which included patient admission and discharge dates, specimen type and collection date, patient ward location on date of specimen collection, and patient ward/hospital movements during hospitalization. Ethics approval for this study was obtained from the Tasmanian Health and Medical Human Research Ethics Committee (Reference# H0016214).

#### Genomic DNA Purification

Enterococcal isolates were sub-cultured in thioglycolate broth (TM0935, 15 mL) (Thermo Fisher Scientific, Waltham, MA, United States) at LGH, RHH, and the Hobart Pathology Laboratory for NRH and MCH. The isolates were analyzed at the School of Medicine, University of Tasmania, and the School of Molecular Sciences, La Trobe University, Australia. The extraction and purification of genomic DNA was processed in accordance to the protocol previously described by Gautam et al. (2019). Briefly, 1.5 mL of broth culture was centrifuged, and the cell pellet was resuspended in a mixture of lysozyme [30 µL lysozyme (50 mg/mL)] (Muramidase, VWR Chemicals, Radnor, PA, United States) and phosphate buffered saline (PBS) (600 µL) and incubated at 37◦C for 1 h. Using the DNeasy Blood and Tissue Kit protocol (Qiagen, Hilden, Germany), 200 µL of lysate was used to initially extract 100 µL of DNA eluate. This was treated with 2 µL of RNase (100 mg/mL) (Qiagen, Hilden, Germany), incubated at room temperature for 1 h, and further purified using the High Pure PCR Template Preparation Kit (Roche, Basel, Switzerland) to achieve a final 50 µL of DNA eluate. The Qubit 2.0 Fluorometer (Life Technologies, Carlsbad, CA, United States) was used with the Qubit dsDNA (double-stranded DNA) HS (high sensitivity) Quantification Kit to measure the DNA concentration before diluting with purified water to a concentration of 0.2 ng/µL (input DNA).

### DNA Library Preparation

The Nextera XT DNA Library Preparation Kit (Illumina Inc., San Diego, CA, United States) was used to generate the DNA libraries from an input DNA volume of 2.5 µL for wholegenome sequencing on an Illumina MiSeq platform. DNA dual-indexed libraries were generated using the Nextera XT 24 Index Kit (Illumina Inc., San Diego, CA, United States). After PCR amplification, the DNA amplicons were purified with the Agencourt AMPure XP beads (Beckman Coulter, Brea, CA, United States). The concentration of each amplicon was measured with the Qubit 2.0 Fluorometer (Life Technologies) and Qubit dsDNA HS Quantification kit, before normalization to create the pooled amplified library (PAL). Library quantification of the PAL was performed using the KAPA Library Quantification Kit (Kapa Biosystems Inc., Wilmington, MA, United States), and the concentrations were determined by qPCR. An appropriate dilution of the PAL was used according to the manufacturer's recommended protocol for loading into an Illumina MiSeq v2 (2 × 150-bp paired-end reads) cartridge for sequencing.

### Genome Assembly and in silico Multi-Locus Sequence Typing

Raw FASTQ sequencing reads from the Illumina MiSeq wholegenome sequencing were processed using an assembly pipeline generated with the SeqSphere+ version 6 (Ridom GmbH, Münster, Germany)<sup>4</sup> . FastQC (Andrews et al., 2010) was used to perform a quality check of the read files to assess the sequencing quality scores, total number of reads, and GC content. For the removal of the Nextera XT index library adapters, Trimmomatic (Bolger et al., 2014) was applied to achieve an average Q score of 30 in a sweeping window of 20 bases. The BWA plug-in in the SeqSphere+ software was used for the assembly of genome sequences of each isolate by mapping the paired-end reads to the complete reference genome of E. faecium DO (TX16\_NC-017960) (Qin et al., 2012). The resultant contiguous consensus sequences (contigs) were exported for in silico identification of vancomycin-resistance (van) locus using the ResFinder server on the Centre for Genomic Epidemiology (CGE) online tool<sup>5</sup> . The settings used were a minimum sequence identity threshold of 90% and a genome length identity cut-off of 60%. The assembled genome sequences were queried against the MLST tool<sup>6</sup> from the CGE database to determine the sequence type of the isolates. The E. faecium MLST database<sup>7</sup> was also queried to confirm the sequence types.

### Genome Analysis and Phylogenetic Comparison

A core genome MLST (cgMLST) scheme for E. faecium has been defined in the cgMLST database<sup>8</sup> and imported into SeqSphere+. While conventional MLST is based on seven putative housekeeping genes, the cgMLST scheme utilizes 1,423 target genes thereby providing a higher level of discrimination between isolates (De Been et al., 2015). Distance calculations based on the number of allelic differences between isolates were used to detect clusters within given sequence types and this analysis was visualized using a minimum spanning tree in SeqSphere+.

Pairwise SNP analysis was then conducted to establish the phylogenetic relationship between cgMLST-clustered isolates of VREfm from the state's public hospitals. Each isolate within a cluster was nominated as the reference genome against which the raw FASTQ sequences of the other isolates were assembled and a core SNP alignment was generated using Snippy<sup>9</sup> . The presence of a SNP was defined using a minimum

<sup>2</sup>http://cdstest.net

<sup>3</sup>http://www.cepheid.com/

<sup>4</sup>http://www.ridom.de/seqsphere/

<sup>5</sup>https://cge.cbs.dtu.dk/services/ResFinder/

<sup>6</sup>http://cge.cbs.dtu.dk/services/MLST/

<sup>7</sup>http://pubmlst.org/efaeciuim/

<sup>8</sup>https://www.cgmlst.org/ncs/schema/991893/

<sup>9</sup>https://github.com/tseemann/snippy

nucleotide variant frequency of 95% and a minimum read depth of 20. Gubbins<sup>10</sup> was used to process the resulting SNP alignment to predict regions of homologous recombination within each isolate cluster. This resulted in the generation of three separate SNP scores for the phylogenetic comparison of isolates: Total number of SNPS; Number of SNPs in homologous recombinant regions; and Number of SNPs in nonhomologous recombinant regions. From the Gubbins output, a maximum-likelihood phylogenetic tree was also generated using PhyML with the generalised-time-reversible (GTR) model. The previously described recombination-filtered SNP threshold of ≤16 SNPs for VREfm was used as a guide for identifying clonally related or non-unique isolates (De Been et al., 2015; Schurch et al., 2018).

To confirm any epidemiological linkage between phylogenetically related isolates, the genomic data were integrated with clinical data including date of hospital admission, VREfm screening, date of hospital discharge, and patient movement records. Spatio-temporal analyses derived from this information were used to infer phylogenetic relationships and identify possible, probable, or unlikely instances of VREfm transmission.

#### RESULTS

#### MLST Sequence Types and van Resistance Loci

A total of 257 VREfm isolates were collected from Tasmania's two largest public hospitals, LGH (n = 176) and RHH (n = 81), during 2014 (n = 18), 2015 (n = 40) and 2016 (n = 199). The higher number of isolates collected at the above hospitals in 2016 compared to 2014 and 2015 reflects the previously reported increasing number of VREfm isolates collected in Tasmania through this period (Wilson et al., 2017). A higher proportion of isolates were from clinical specimens at LGH (n = 54, 30.7%) compared to the RHH (n = 14, 17.3%). In silico analyses of the VREfm isolates were performed to determine their multi-locus sequencing types (MLST) and confirm the vancomycin resistance (van) loci present.

The two hospitals shared seven common sequence types, however, isolates from LGH exhibited a wider range of sequence types (n = 17 STs) compared to RHH (n = 9 STs) (**Figure 1**). The dominant vanB-harboring sequence type at both hospitals was ST796. For vanA VREfm isolates, more isolates at the RHH belonged to ST80 (n = 15) than the recently described ST1421 (n = 10) (Leong et al., 2018b). At LGH, most of the vanA isolates were ST1421 (n = 26) which accounted for 72.7% of vanA clinical isolates at that hospital. While all of the clinical cases from RHH harbored only the vanB locus, there was a mixture of vanA and vanB resistance loci among clinical isolates from LGH (**Figure 1**).

All of the NRH and MCH isolates (n = 74) analyzed were obtained from screening specimens and none were clinical isolates. Nine sequence types were represented among the isolates. The two most dominant sequence types were ST796 vanB (67.6%) and ST1421-vanA (24.3%) (**Figure 1**). ST1424 vanA was unique to MCH and was not isolated at NRH, LGH or RHH during the period analyzed.

#### cgMLST Analysis of Tasmanian VREfm Isolates

Core genome MLST resolved the dominant vanB sequence type in Tasmania, ST796, into one large cluster (n = 201) and three smaller clusters that differed by between 1 and 2 alleles, and two unique isolates (**Figure 2**). The dominant vanA sequence type at the RHH, ST80, separated into three cgMLST clusters that differed by up to 8 alleles, and four unique isolates. Application of cgMLST further differentiated the dominant vanA sequence type at LGH, ST1421, into three clusters which differ from one another by 1–2 alleles, and one unique isolate (**Figure 2**). The smallest cluster, Cluster 1, consists of three isolates from RHH and one isolate from LGH (**Figure 3**). Cluster 2 contains six isolates all of which were collected at LGH. The largest cluster, Cluster 3 (n = 39), includes isolates from all four of the state's public hospitals (**Figure 3**).

#### SNP Variant Analyses of Tasmanian VREfm Isolates

A further refinement of the ST1421 VREfm isolates was performed by analyzing the differences on a nucleotide-base level to identify clonally related isolates. The genome of E. faecium DO (TX16\_NC-017960) was used as the reference to generate a maximum-likelihood phylogenetic tree in PhyML based on recombination-filtered SNP differences between isolates (**Figure 4**). Firstly, there was good concordance between the SNP-based analysis and the cgMLST method with regard to the identification of clades of phylogenetically related isolates. For example, all of the ST1421 VREfm isolates positioned together based on the SNP differences. In addition, three main clusters of ST1421 isolates were evident from the phylogenetic tree (**Figure 4**). Closer examination of ST1421 revealed exact matches for isolate composition of Clusters 1 and 2 generated by both cgMLST and SNP analyses (**Figure 4**). Cluster 3 from cgMLST was further sub-divided into Clusters 3A and 3B by SNP analysis (**Figure 4**). The higher resolution of SNP variant analysis was then combined with epidemiological data on individual cases of VREfm from the four public hospitals in Tasmania. For comparison, SNP variant analyses were also conducted for the isolates belonging to sequence types ST80 (**Supplementary Figures S3**, **S4**) and ST796 (**Supplementary Figures S5**, **S6**). For the ST796 clusters, representative isolates (n = 31) were selected from the clades identified in the SNP-based phylogenetic tree (**Supplementary Figure S5**).

#### Genomic and Epidemiological Analyses of ST1421 Cluster 1

Cluster 1 contains four VREfm isolates with three collected at the RHH and one at LGH. The first patient in Cluster 1, 14S\_RHH008 was transferred to RHH from an out-ofstate hospital on September 30, 2014 and the VREfm isolate

<sup>10</sup>https://github.com/sanger-pathogens/Gubbins

collected represents the first confirmed isolate of ST1421 in Tasmania (**Figure 5**). Approximately 2 years after patient 14S\_RHH008 was discharged from RHH, other isolates also belonging to Cluster 1 were detected at the RHH and the LGH (**Figure 5**). At the RHH, isolates 16S\_RHH021 and 16S\_RHH046 were collected from VREfm colonized patients who were admitted to the Neurosurgery Ward and Specialist Surgery Ward, respectively. These isolates exhibited no cgMLST allele differences and ≤16 SNP differences with respect to each other and isolate 14S\_RHH008. This indicates that further members of this cluster may exist but were not available in the isolate set, which may in part be reflective of VREfm sampling being more limited during 2014 and 2015 compared to 2016.

### Genomic and Epidemiological Analyses of ST1421 Cluster 2

Cluster 2 contains six isolates that were collected from patients when they were admitted to LGH. The first patient in the cluster, patient 16S\_LGH063, was previously an out-patient at RHH on June 30, 2016 (**Figure 6**). On the patient's second admission to the Surgical Ward at LGH, a VRE screening test on August 28, 2016 returned positive. Patient 16S\_LGH063's subsequent admissions were at the Medical Ward at LGH, and at both times, shared the same ward with a second patient, 16S\_LGH095. Patient 16S\_LGH095 initially tested negative for VREfm upon admission but subsequently tested positive after sharing the Medical Ward with patient 16S\_LGH063 (**Figure 6**). 16S\_LGH095 then shared the Rehabilitation Ward with a third patient, 16S\_LGH097, who tested positive for VREfm after a previous negative result at earlier hospital admission. The overlaps in both time and ward location in the hospital for patients 16S\_LGH063, 16S\_LGH095, and 16S\_LGH097, combined with no cgMLST allele differences and ≤16 SNP differences between their respective isolates, indicate that they belong to a clonally related outbreak of ST1421.

### Genomic and Epidemiological Analyses of ST1421 Cluster 3A

The isolates in Cluster 3A were collected from patients who were admitted to LGH (n = 6), NRH (n = 4), and RHH (n = 3). There were a number of patients belonging to this cluster who had admissions to multiple hospitals, e.g., 16C\_LGH009, 16S\_LGH080, 16C\_LGH018, 16S\_RHH020, and 16S\_RHH060 (**Figure 7**). The first patient, 16C\_LGH009, was not screened for VREfm during the initial three admissions at NRH and MCH (**Figure 7**). The patient's admission on May 5, 2016 at NRH involved an inter-hospital transfer to the Intensive Care Unit at LGH where a clinical sample collected on May 28, 2016 was confirmed positive for VREfm. During an admission to MCH on August 15, 2016, patient 16C\_LGH018 was involved in two inter-hospital transfers to NRH (August

15, 2016) and LGH (August 27, 2016) (**Figure 7**). While staying in the Surgical Ward at LGH, the patient underwent three VREfm screenings. The initial two tests were negative, however, the third clinical isolate from patient 16C\_LGH018 tested positive for VREfm. Spatio-temporal analysis indicated that two other patients, 16S\_LGH074 and 16S\_LGH081, who shared the Surgical Ward with patient 16C\_LGH018, were also tested positive for VREfm on September 5, 2016 and September 12, 2016, respectively. Pairwise SNP-based analysis revealed that the VREfm isolates from these patients differed by ≤16 SNPs (**Figure 7**). Additionally, patient 16C\_LGH018 was readmitted at MCH and also at NRH, providing opportunities for further dissemination of VREfm beyond the initial hospital where the infection was first confirmed (**Figure 7**). Pairwise SNP differences between the isolates in Cluster 3A are shown in **Supplementary Figure S1**.

#### Genomic and Epidemiological Analyses of ST1421 Cluster 3B

Of the 10 isolates in Cluster 3B which were collected in LGH, seven patients tested positive for VREfm after staying in the Surgical Ward (**Figure 8**). The first patient, 16S\_LGH103, was admitted to the Surgical Ward on June 1, 2016 and subsequently to the Intensive Care Unit where the patient tested negative for VREfm. However, during a second admission to the Surgical Ward at LGH, the patient tested positive for VREfm on October 30, 2016. Spatio-temporal analysis indicated that two other patients, 16S\_LGH106 and 16S\_LGH104, shared the Surgical Ward with patient 16S\_LGH103 and tested positive for VREfm on November 1, 2016 and November 3, 2016, respectively (**Figure 8**). Pairwise SNP-based analysis revealed that isolates from patients 16S\_LGH103 and 16S\_LGH106 differed by ≤16 SNPs and zero cgMLST alleles indicating that they are clonally related (**Figure 8**).

As in the case of patient 16S\_RHH060 in Cluster 3A, patient 16C\_LGH019 from Cluster 3B, underwent an inter-hospital transfer after testing positive for VREfm, providing a further example of the propensity for interinstitutional spread ofthe ST1421 sequence type in Tasmania. Pairwise SNP differences across Cluster 3B are shown in **Supplementary Figure S2**.

## DISCUSSION

In this study, we established the sequence types of VREfm isolated at the four public hospitals in Tasmania. While both the RHH and LGH shared ST796 as their dominant vanB sequence type, interestingly, the two hospitals exhibited a different profile with respect to other sequence types present among isolates collected from 2014 to 2016 (**Figure 1**). For example, while ST80 was the prominent vanA VREfm at the RHH, a limited number of ST80 isolates were collected at the LGH where instead, the recently discovered ST1421 was more dominant. All of the clinical isolates at the RHH belonged to vanB sequence types, whereas both vanA and vanB sequence types constituted clinical isolates obtained at the LGH. The identification of ST1421 in Tasmania appears to coincide with the change in Australia from a near-complete dominance by the vanB resistance locus among VREfm to an expansion of isolates that harbor the vanA resistance locus from only 1.9% (2/107) of vancomycin non-susceptible E. faecium bloodstream isolates in 2011 to 43.0% (83/193) by 2016 (Coombs et al., 2014a, Coombs et al., 2018). In this Tasmania-wide study, the collection of 331 clinical and overlapping-screening isolates consisted of 74.6% vanB (n = 247), 25.1% vanA (n = 83), and 0.3% vanAB (n = 1).

Our whole-genome sequence data were then applied to cgMLST and SNP analyses of the 331 VREfm isolates. This revealed the existence of three cgMLST clusters within ST1421 which resolved further into clusters (1, 2, 3A, and 3B) based on SNP-variant analyses (**Figure 4**). When we combined the genomic data with patient spatio-temporal information, a number of features of VREfm epidemiology in Tasmania became evident. Firstly, with regard to Clusters 2, 3A, and 3B, clonally related isolates which differed by ≤16 SNPs and zero cgMLST alleles were collected from patients who shared specific hospital wards at the same time, indicating potential intra-institutional transmission involving these patients (**Figures 6**–**8**). Secondly, with respect to Clusters 3A and 3B, patients who were confirmed positive for VREfm infection or colonization at one hospital, were subsequently transferred or re-admitted to another hospital in Tasmania which provided opportunities for onward inter-institutional spread of VREfm in the state (**Figures 7**, **8**). Lastly, Cluster 1 contains the first confirmed isolate of ST1421 in Tasmania. The patient

the clusters determined by core-genome MLST (cgMLST) with Cluster 3 sub-dividing further into Clusters 3A and 3B.

was transferred from an out-of-state hospital to a Tasmanian hospital in 2014 shortly before testing positive for VREfm at the latter hospital. This indicates potential inter-state transmission of a newer VREfm sequence type. Subsequently, ST1421 started to be isolated at the other public hospitals in Tasmania from 2016 onward (**Figure 9**) and it is highly possible that this state-wide spread involved movements of VREfm-positive patients between locations when taking into account the collection of clonally related isolates across multiple hospitals.

Risk factors associated with VRE colonization have been found to include exposure to any antibiotic, diarrhea, and longer length of stay in hospital (Karki et al., 2012). Furthermore, intensive care admission, a higher burden of co-morbidities, and longer time to appropriate antibiotics have been associated with mortality in enterococcal bacteremia (Cheah et al., 2013). A study of 103 patients with confirmed VREfm infection or colonization found that 40% of patients remained positive in the first year of follow-up and that 23.3% were still positive in the fourth year of follow-up (Karki et al., 2013). While the investigators observed a downward trend in fecal carriage of VREfm over time, the findings revealed that, even in the absence of recent risk factors including hospitalization or antibiotic use, patients with a previous history of VREfm can harbor the pathogen for a period in the order of years. This implies that patients discharged from one hospital may still harbor VREfm, and therefore be potentially infectious, when admitted to another healthcare institution in another jurisdiction several months or even years later. The repatriation of a VREfm-positive patient has been linked to the regional spread of a sequence type, ST796, from a hospital in Melbourne, Australia to a hospital in Auckland, New Zealand (Mahony et al., 2018). In addition, a recent outbreak of VREfm in hospitals in Switzerland highlights the potential for new sequence types to move globally. ST796 had not been reported in Switzerland prior to 2017. However, between December 2017 and April 2018, four hospitals in the Canton of Bern isolated this sequence type from 89 patients. Markedly, 77 out of the 89 isolates (86.5%) belonged to ST796 with the remaining isolates made up of ST117 (n = 6), ST78 (n = 4), ST555 (n = 2), ST17 (n = 1), and ST80 (n = 1) (Wassilew et al., 2018). The findings suggest a relatively recent introduction of ST796 into Switzerland and its subsequent establishment as a dominant sequence type.

(PhyML) phylogenetic tree. (B) Matrix of pairwise comparison of SNP differences between isolates expressed as: (i) Total number of SNP differences; (ii) Number of SNPs inside homologous recombination regions, and (iii) Number of SNPs outside of homologous recombination regions. The previously described recombination-filtered SNP threshold of ≤16 SNPs for VREfm has been used as a guide for identifying clonally related or non-unique isolates. (C) Overview of recombination-filtered SNPs between isolates. The numbers of different SNPs between the isolates are shown on the solid black connecting lines. SNP differences above the threshold of 16 SNPs are shown as blue dotted lines. (D) Spatio-temporal location of patients in Cluster 1 who tested positive for VREfm at the Royal Hobart Hospital and Launceston General Hospital. The movement of patients following admission to the Royal Hobart Hospital through to date of discharge are indicated with respect to time (x-axis) and hospital ward location (y-axis). Each line color represents an individual patient. Patient 14S\_RHH008 was admitted to RHH on September 30, 2014 and was screened for VREfm on October 1, 2014 from which a positive test was reported.

The origins of ST796 can be traced back to Australia, where it was first discovered in 2012, and subsequently identified as the source of a notable increase in VREfm colonization at a Melbourne neonatal intensive care unit in 2013 (Lister et al., 2015). By 2015, ST796 had become the dominant vanB sequence type among patient episodes of E. faecium bacteremia in Melbourne hospitals (Buultjens et al., 2017) displacing the previously endemic vanB sequence type ST203 (Coombs et al., 2014b). The ability of ST796 to establish relatively quickly in new geographical locations and out-compete existing strains of VREfm suggests the potential existence of inherent advantageous properties in this sequence type. Indeed, generation of a complete genome sequence for an ST796 isolate revealed that it likely evolved from an ST555-like ancestral progenitor through the acquisition of transposons Tn1549 and Tn916 conferring resistance to vancomycin and tetracycline,

FIGURE 6 | Phylogenetic analysis of vancomycin-resistant Enterococcus faecium (VREfm) isolates from ST1421 Cluster 2. (A) A SNP-based maximum-likelihood (PhyML) phylogenetic tree. (B) Matrix of pairwise comparison of SNP differences between isolates expressed as: (i) Total number of SNP differences; (ii) Number of SNPs inside homologous recombination regions, and (iii) Number of SNPs outside of homologous recombination regions. The previously described recombination-filtered SNP threshold of ≤16 SNPs for VREfm has been used as a guide for identifying clonally related or non-unique isolates. (C) Overview of recombination-filtered SNPs between isolates. The numbers of different SNPs between the isolates are shown on the solid black connecting lines. SNP differences above the threshold of 16 SNPs are shown as blue dotted lines. (D) Spatio-temporal location of patients in Cluster 2 who tested positive for VREfm at the Launceston General Hospital. The movement of patients following admission to hospital through to date of discharge are indicated with respect to time (x-axis) and hospital ward location (y-axis). Each line color represents an individual patient. As illustrated, patient 16S\_LGH063 had admissions to both the Royal Hobart Hospital and Launceston General Hospital but was confirmed VREfm positive at the latter hospital.

respectively, along with plasmids, prophages, cryptic genome islands, and chromosomal SNPs (Buultjens et al., 2017). Similarly, the recently described ST1421 VREfm strain has been identified as a variant of the ST17 strain due to a mutation in the housekeeping gene, pstS, that is used for MLST (Andersson et al., 2019). Previous studies attributed this occurrence to multiple

FIGURE 7 | Phylogenetic analysis of vancomycin-resistant Enterococcus faecium (VREfm) isolates from ST1421 Cluster 3A. The previously described recombination-filtered SNP threshold of ≤16 SNPs for VREfm has been used as a guide for identifying clonally related or non-unique isolates. (A) Overview of recombination-filtered SNPs between isolates. The numbers of different SNPs between the isolates are shown on the solid black connecting lines. SNP differences above the threshold of 16 SNPs are shown as blue dotted lines. (B) Spatio-temporal location of patients in Cluster 3A who tested positive for VREfm. The movement of patients following admission to hospital through to date of discharge are indicated with respect to time (x-axis) and hospital ward location (y-axis). Each line color represents an individual patient. As illustrated, a number of patients had multiple admissions to more than one hospital over the time course.

FIGURE 8 | Phylogenetic analysis of vancomycin-resistant Enterococcus faecium (VREfm) isolates from ST1421 Cluster 3B. The previously described recombination-filtered SNP threshold of ≤16 SNPs for VREfm has been used as a guide for identifying clonally related or non-unique isolates. (A) Overview of recombination-filtered SNPs between isolates. The numbers of different SNPs between the isolates are shown on the solid black connecting lines. SNP differences above the threshold of 16 SNPs are shown as blue dotted lines. (B) Spatio-temporal location of patients in Cluster 3B who tested positive for VREfm. The movement of patients following admission to hospital through to date of discharge are indicated with respect to time (x-axis) and hospital ward location (y-axis). Each line color represents an individual patient. As illustrated, a number of patients had multiple admissions to more than one hospital over the time course.

recombination events in Australia (Van Hal et al., 2018) and also the insertion of a Tn5801-like transposon into the tetM gene, an event commonly detected in vanA VREfm strains that have lost the pstS locus (Lemonidis et al., 2019). In addition to antibiotic resistance, it is believed that the new gene content has conferred adaptations to the healthcare environment on ST796. One such adaptation may include higher tolerance to isopropanol used in hospital hand hygiene products that was reported in a number of recently emerged sequence types of E. faecium (Pidot et al., 2018).

A previous study has shown that hand-hygiene measures are only effective when used in combination with other interventions to control the transmission of VREfm (Wolkewitz et al., 2008). Environmental contamination remains an important factor in spread due to the ability for VREfm to persist on surfaces for prolonged periods of time (Wendt et al., 1998). A recent multi-center randomized trial, REACH, involving 1,700 environmental services staff and 6,100 overnight beds across 11 hospitals in Australia found that interventions with regard to improved cleaning techniques, disinfectant products used, staff training, auditing, and communication for routine hospital cleaning increased the percentage of cleaned frequent touch points from 55% to 76% in bathrooms and from 64% to 86% in bedrooms (Mitchell et al., 2019). Although colonizations were not assessed, the interventions were associated with a reduction in clinical VRE infections from 0.35 to 0.22 per 10,000 occupied bed days (Mitchell et al., 2019). An earlier study performed at one hospital in Melbourne found that the use of a bleach-based cleaningdisinfection program correlated with a decrease in both VRE colonizations in high-risk patients and VRE bacteremia cases (Grabsch et al., 2012).

In summary, based on available evidence, it is apparent that the marked increase of VREfm in Tasmania involved factors that included the emergence of newer sequence types in the state and also the movement of infected or colonized patients between hospitals. This has important implications for VREfm control in Australia and further afield. Newly detected sequence types need to be carefully monitored and where necessary, targeted with enhanced strategies that include managing patients with transmission-based precautions. The Australian Guidelines for the Prevention and Control of Infection in Healthcare recommend measures such as placement of patient alerts and screening of patients with VREfm transferred within and between healthcare institutions (NHMRC, 2019). Coordination of efforts and knowledge between institutions is required when changes in genotypic profiles of dominant strains occur and when new sequence types emerge. For this, rapid routine identification of VREfm types, beyond standard vancomycin-resistance locus determination, is required. To be effective, this work will necessitate the use of whole-genome sequencing on a routine and real-time basis. Therefore, sequencing and bioinformatic protocols for VREfm will need to be standardized between laboratories to translate the technology from retrospective to real-time applications.

#### DATA AVAILABILITY STATEMENT

The datasets generated for this study can be found in the NCBI Sequence Read Archive repository, under the BioProject number PRJNA592871.

#### ETHICS STATEMENT

fmicb-10-02940 January 6, 2020 Time: 15:52 # 14

Ethics approval for this study was obtained from the Tasmania Health and Medical Human Research Ethics Committee (Reference# H0016214).

### AUTHOR CONTRIBUTIONS

RO conceived the study, obtained funding, supervised the work, analyzed the data, and drafted the manuscript. KL performed the laboratory experimentation and genome sequence analysis, and drafted the manuscript. RK and EL retrieved isolates and matched epidemiological information. LC, TA, and AW assisted in planning the study and manuscript review.

### REFERENCES


#### FUNDING

Funding for the study was provided by the Tasmanian Community Fund (grant # 36Medium00014). Additional funding was provided by the Tasmanian Infection Prevention and Control Unit (TIPCU), Department of Health and Human Services, TAS, Australia.

#### ACKNOWLEDGMENTS

We gratefully acknowledge the assistance of the staff at the Royal Hobart Hospital (Ms. Megan Brough, Ms. Carol-Anne Eaton, Ms. Belinda McEwan, Ms. Kerry Osborne, and Ms. Rachel Thompson), and the Launceston General Hospital (Mr. Mark Green and Ms. Kathleen Wilcox) in the provision of infection control data and VREfm samples. KL was supported by a La Trobe University Research Training Programme Ph.D. Scholarship.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb. 2019.02940/full#supplementary-material

Resistance (AGAR) Australian Enterococcal Sepsis Outcome Programme (AESOP) Annual Report 2016. Commun. Dis. Intell. 2018:42.



vanA vancomycin-resistant Enterococcus faecium: a genome-wide investigation. J. Antimicrob. Chemother. 73, 1487–1491. doi: 10.1093/jac/dky074


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Leong, Kalukottege, Cooley, Anderson, Wells, Langford and O'Toole. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Progress of Multi-Omics Technologies: Determining Function in Lactic Acid Bacteria Using a Systems Level Approach

Shane Thomas O'Donnell1,2,3, R. Paul Ross1,2,3 and Catherine Stanton1,3 \*

<sup>1</sup> Teagasc Food Research Centre, Moorepark, Fermoy, Ireland, <sup>2</sup> Department of Microbiology, University College Cork – National University of Ireland, Cork, Ireland, <sup>3</sup> APC Microbiome Ireland, Cork, Ireland

Lactic Acid Bacteria (LAB) have long been recognized as having a significant impact ranging from commercial to health domains. A vast amount of research has been carried out on these microbes, deciphering many of the pathways and components responsible for these desirable effects. However, a large proportion of this functional information has been derived from a reductionist approach working with pure culture strains. This provides limited insight into understanding the impact of LAB within intricate systems such as the gut microbiome or multi strain starter cultures. Whole genome sequencing of strains and shotgun metagenomics of entire systems are powerful techniques that are currently widely used to decipher function in microbes, but they also have their limitations. An available genome or metagenome can provide an image of what a strain or microbiome, respectively, is potentially capable of and the functions that they may carry out. A top-down, multi-omics approach has the power to resolve the functional potential of an ecosystem into an image of what is being expressed, translated and produced. With this image, it is possible to see the real functions that members of a system are performing and allow more accurate and impactful predictions of the effects of these microorganisms. This review will discuss how technological advances have the potential to increase the yield of information from genomics, transcriptomics, proteomics and metabolomics. The potential for integrated omics to resolve the role of LAB in complex systems will also be assessed. Finally, the current software approaches for managing these omics data sets will be discussed.

Keywords: lactic acid bacteria, multi-omics, microbiome, genomics, transcriptomics, proteomics, metabolomics, meta-omics

#### INTRODUCTION

Sequencing the first whole genome of a bacterial strain, namely Haemophilus influenzae, in 1995 was a milestone in molecular biology for a number of reasons (Fleischmann et al., 1995), one of which was heralding in an era rich in information where the volume of data produced was beyond being completely interpreted. Subsequent genomic data sets derived from bacteria elucidated gene functions, metabolic networks, biological pathways, microbial evolution and genome structure

#### Edited by:

Konstantinos Papadimitriou, Agricultural University of Athens, Greece

#### Reviewed by:

Jasna Novak, University of Zagreb, Croatia Fabio Minervini, University of Bari Aldo Moro, Italy

\*Correspondence: Catherine Stanton catherine.stanton@teagasc.ie

#### Specialty section:

This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology

Received: 08 July 2019 Accepted: 20 December 2019 Published: 28 January 2020

#### Citation:

O'Donnell ST, Ross RP and Stanton C (2020) The Progress of Multi-Omics Technologies: Determining Function in Lactic Acid Bacteria Using a Systems Level Approach. Front. Microbiol. 10:3084. doi: 10.3389/fmicb.2019.03084

significantly enhancing our understanding of bacterial function and potential (Kanehisa and Goto, 2000; Ley et al., 2006). However, the challenge in deciphering relevant genetic information from the background genetic material was almost an impossible task. To combat this, more information was required. Information on the transcription of the genetic material and subsequent production of proteins was necessary. The targets of these proteins and the molecules they interact with had to be determined. Nuanced epigenetic triggers and the metabolites that are ultimately produced are all crucial to understanding the system as a whole. Indeed, much of the observed phenotype in a system can be explained in the context of these data sets when interpreted correctly.

Biological systems rely on the DNA – RNA – protein information transfer paradigm that determines the phenotype of an organism. Biologists have analyzed these "omes" for a number of years in the form of genomics, transcriptomics and proteomics. In addition to these, epigenomics and metabolomics have recently been used to answer specific questions relating to the many functions of an organism. Given the year on year advances in "omics" technologies, the volume of information that can be gathered in individual studies is expanding rapidly. Furthermore, the current high throughput nature of these techniques has increased accessibility to this information in terms of time and cost. This has placed many researchers in a situation where they can collect several omics data sets on the same experimental samples. In order to draw more comprehensive conclusions on biological processes these data sets must be integrated and analyzed as a holistic system.

Technologies involved in a multi-omics approach share several commonalities, some of which contrast with the original approach of molecular biologists. For the most part, molecular biology has utilized a reductionist approach to date. This methodology involves breaking a complex problem into its constituent parts and solving them individually. While reductionism has had significant successes, particularly when the experimental subject is controlled by a single component (Isberg and Falkow, 1985), or can be explained by interactions between single molecules (Krebs, 1940), it also has substantial limitations. These limitations are caused by the process of isolating components of a complex system; often the nature of their role in the system is lost. This is the most significant advantage that omics technologies can have compared to a reductionist approach. Maintaining these components in the system allows observation in a realistic environment where emergent properties can also be studied. These high throughput, top down methods also provide a phenomenal volume of data in comparison to the reductionist approach. For example, as much as six terabytes of information can be generated by processing samples in tandem on an Illumina NovaSeq 6000. Finally, omics technologies require significant computational infrastructure in the form of novel algorithms and software to process and analyze the information produced (Berger et al., 2013).

This has placed researchers in a situation where they must adapt to maximize the results from biological data. Biologists must acquire skills in managing and manipulating these vast quantities of data to resolve experimental questions outside the scope of the lab bench. An appreciation of the strengths and limitations of many of these technologies will allow researchers to answer more complex questions and generate more general conclusions. This is a constant arms race as the technologies that underpin the generation of these omic and meta'omic data sets are constantly advancing. The scope of analysis ranges from entire community samples of 10<sup>12</sup> microbes to exploring the components of single cells. Higher throughput machines are facilitating deeper sequencing than ever before. This sequencing depth is opening new potential use cases such as metatranscriptomics of the gut microbiome. Integrating these large data sets to provide a systems level view will reveal previously unattainable information on the individual microbes involved.

LAB are uniquely placed to take full advantage of these fundamental changes in approach. Notably, LAB have been the subject of extensive research exploring specific attributes and functions of isolated cultures (Noike et al., 2002; Leroy and De Vuyst, 2004). In depth analysis of individual LAB strains has provided a wealth of information on their biological processes and functionality in a variety of publically available databases. Multi-omics technologies stand ready to exploit this information on LAB to decipher and predict many functions of interest using a systems level approach. Inter-microbe interactions, particularly within starter cultures, are some of the first to witness the potential in these advances (Sattin et al., 2016; Sirén et al., 2019). Similarly, multi-omics technologies are currently being used to tackle more complex systems such as the gut microbiome and host-microbe interactions (Turroni et al., 2016; Huang et al., 2017; Wang et al., 2019). This emerging field is capable of providing a platform for more accurate functional prediction of LAB in a variety of complex environments.

In this review, we will discuss the current most popular combinations of different omics technologies to facilitate accurate functional prediction for LAB. The most recent advances in the relevant technologies will be mentioned, while the potential they hold for deciphering the final phenotype of these microbes will be assessed. Finally, the computational barriers associated with integrating complex and diverse data sets will be discussed.

#### THE POTENTIAL FOR FUNCTIONAL PREDICTION IN LAB USING OMICS TECHNOLOGIES

LAB are among the most industrially significant groups of bacteria. These versatile microbes have a variety of potential functions that are applied in many sectors. Food production, health promotion, production of antimicrobials and in vivo fermentation all see benefit from this group of microorganisms. These diverse functions encoded within single genomes are a source of valuable information available for exploitation. As a result, these processes have been studied extensively using in vitro, in vivo and more recently in silico techniques to determine the critical pathways underpinning the phenotypes. Analysis of these molecular pathways has resulted in more accurate use of LAB in commercial endeavors such as starter

fermentation cultures and probiotic supplements (Leroy and De Vuyst, 2004; Muñoz-Atienza et al., 2013). Deciphering the underlying biological attributes associated with these critical microbes allows greater understanding of their current roles and may reveal new applications. However, this scrutiny has often focused on a single element of interest in isolation, instead of taking the entire system into account.

A more inclusive approach is possible with diverse sets of information interrogated by omics technologies. To date, many groups have utilized both omics and multi-omics data sets to advance this field and progress knowledge of LAB function (Lahtvee et al., 2011; Rebollar et al., 2016; Huang et al., 2017; Filannino et al., 2018; Ellepola et al., 2019). The consistent progression of new technologies and methodologies has resulted in many studies providing crucial information on the function of LAB. Progress in this manner also provides direction for future work with these microbes. Studies using multi omics for mammalian or bacterial cells may pre-empt similar work with LAB. New methodologies incorporating omics technologies are often applicable to a variety of research topics. These studies will be discussed to gain insight into the potential use cases for LAB. Translating the relevant findings in multi omics research will focus on what can be discovered in LAB within ecosystems they inhabit and their common commercial applications. The influence exerted by the new technological advancements will also be assessed for their direct effect on LAB.

Basic functions of LAB are well understood, as are the cellular processes underlying them. With this foundation of knowledge, research into LAB is in a prime position to exploit the coming wave of omics technologies. The sections below document the strengths and limitations of each omics technology, the impact of each omic data set on inferring LAB function to date, the relevant advancement in technology and how these new methodologies may accelerate future progress.

#### THE IMPACT OF GENOMICS ON LACTIC ACID BACTERIA

Genomic information has recently become essential when studying microbes in detail. These data sets can provide an immutable link to the organism that for the most part remains constant. In well studied bacteria, such as many LAB, fully sequenced genomes are readily available in a variety of species (Chenoll et al., 2015; Inglin et al., 2018). These may be used as reference genomes when assembling draft genomes of the strain at hand. Mining these genomes alone reveals information on all traits available to the microbe. Potential products are inferred from the genetic code and viable pathways can be determined analyzing the relevant genes. These pathways are categorized by their overarching function e.g., carbohydrate utilization pathways. Comparative genomic analysis between the generic pathways available to LAB and the strain of interest may highlight the unique functional capacity of the particular strain of interest (Makarova et al., 2006). De novo construction of genomic information is also possible. Knowledge of characteristic motifs and patterns in specific alignment of DNA bases e.g., Shine-Dalgarno, allows software to detect important genes such as cluster specific transcription factors and the promoters associated with said genes (Wolf et al., 2018). Such methods can be used in conjunction with comparative genomics to predict the production of difficult to detect molecules such as secondary metabolites in the form of antibiotics, toxins and immunosuppressives (Zerikly and Challis, 2009; Weber et al., 2018). Mining genomes for well-known genes is more straightforward using BLAST or DIAMOND (Altschul et al., 1990; Buchfink et al., 2014). Searching the DNA or protein sequence of the element of interest against your genome will provide probability-based results on its presence in the genome. This information is the bedrock on which the multi-omics integration is built.

Mining genomic information on its own may direct experimentation or suggest mechanisms for known functional attributes. This was exemplified by analysis of genomic data in Lactobacillus ruminis revealing the presence of functional flagellar apparatus in the form of 45 flagellar genes (Neville et al., 2012). Robust flagellar apparatus suggests L. ruminis is a motile microbe and presents a mechanism for proinflammatory tendencies. Despite not expressing flagella in culture media, strains with the genomic capacity to produce flagella were observed to partially recover this ability in vivo. Gene clusters for crucial mucus binding pili have been detected in several LAB (Douillard et al., 2013). This gene cluster explains L. rhamnosus' capacity to adhere to the intestinal mucosa (Kankainen et al., 2009). In a similar fashion, gene clusters that are capable of producing a broad range of bacteriocins have been reported. This information led to observe Lactobacillus salivarius outcompeting Listeria monocytogenes utilizing these compounds (Corr et al., 2007).

Appropriate use of secondary metabolite software tools for analyzing the genome has resulted in the discovery of many novel antibiotics (Schulze et al., 2015; Tian et al., 2016). Incorporating software such as anti-SMASH (Blin et al., 2017), PRISM (Skinnider et al., 2017) and GRAPE (Dejong et al., 2016) has facilitated mining of the genomic data sets for crucial biosynthetic gene clusters. A very similar process unlocks the potential in genomic data sets of LAB. These microbes are capable of producing a wide variety of diverse anti-microbial peptides (Stoyanova et al., 2012; Zacharof and Lovitt, 2012). Capitalizing on these powerful analysis tools can realize much of the potential that a genomic data set provides and determine many possible functions available to the bacterium in question. Researchers can forgo the culture based issues with screening for novel anti-microbial compounds and instead direct future experiments more accurately. This process was adeptly demonstrated by Singh S. et al. (2015), to identify putative bacteriocins (Singh N.P. et al., 2015). Twenty LAB genomes were assessed for relevant bacteriocin producing genes. Putative operons were identified leading to further characterization of novel bacteriocins. This simplistic process describes the exploitation of genomic material to identify these traits of interest.

The utility of genomic information is on the cusp of a generational leap forward. Third generation sequencers, such

as the Sequel II and the MinIon, are set to remedy many of the intrinsic issues associated with second generation sequencers (Rhoads and Au, 2015; Lu et al., 2016). These are primarily a GC bias in fragmentation and amplification (Grokhovsky et al., 2011; Poptsova et al., 2014), short reads resulting in difficult to sequence repeat regions (Bovee et al., 2007; Aird et al., 2011; Genovese et al., 2013) and substantial burdens on computational rearrangement of genomes from a huge volume of short reads (Aldrup-Macdonald and Sullivan, 2014; Sims et al., 2014). Limitations associated with short read sequencing become apparent when analyzing large insertion sequences (Iranzo et al., 2014) or substantial rearrangements in chromosomal or circular genomic sequences (Darling et al., 2008; Sobreira et al., 2011). Transposable elements often contain genes of interest encoding traits such as antibiotic resistance and bacteriocin production (Klaenhammer, 1993; Toomey et al., 2009). In some cases, critical processes such as genome replication are also intrinsically linked to the presence of inverted repeats (El Kafsi et al., 2017) and structure variants of large genomic transfers between species often harbor crucial mechanisms (Kant et al., 2011). Progress in this area will have a significant impact on determining some of the poorly understood traits associated with LAB (Teusink and Molenaar, 2017). Similarly, it will be possible to shed more light on up-and-coming areas for LAB such as discovering novel bacteriocins (Perez et al., 2014). The ability to analyse far greater read lengths (>20 kb) has facilitated the characterization of clusters coding resistance to crucial antibiotics as well as accurate tracking of large translocations of patho-adaptive traits from commensals to pathogenic bacteria in the microbiota (Huang et al., 2016; Proença et al., 2017). LAB are likely to have gained traits from similar translocation events.

However, this is the proverbial tip of the iceberg regarding the potential impact on LAB due to third generation sequencing. The most striking example of this is the extraordinary amount of information in the form of epigenetics that is lost due to the fragmentation and sequencing process in next generation sequencing. Epigenetic, post transcriptional modifications exert significant control on bacterial genomes resulting in altered phenotypes (Goldberg et al., 2007). Currently, there is a significant cost in both time and money associated with determining epigenetic marks (Kurdyukov and Bullock, 2016; Soto et al., 2016). Both Nanopore and PacBio report the presence of epigenetic modifications during sequencing (van Dijk et al., 2018). This will result in regular sequencing reporting epigenetic alterations to bases, in turn opening this added layer of complexity to a far greater audience (Casadesús and Low, 2006).

#### OVERVIEW OF OMICS TECHNOLOGIES

The following section will review the existing omics technologies in the context of LAB. The advantages and disadvantages of each is summarized in **Table 1**.

#### Single Cell Genomics

Single cell sequencing approaches have become more frequently utilized throughout the past decade (Tang et al., 2009). This approach holds significant promise for several reasons, primarily due to its potential to decipher cellular differences within heterogenous cell populations in any tissue or cell culture. Determining cell heterogeneity is an essential step in understanding the development, regulation and response to external influence in a population of cells. This natural heterogeneity is amassed and averaged in bulk sequencing approaches. Traditional sequencing removes much information that may indicate more nuanced reasons for phenotypes of interest. Many techniques have been developed to isolate and sequence these single cells in a cost effective and high throughput manner (Wang and Navin, 2015; Lan et al., 2017; Hwang et al., 2018). Microfluidics and Fluorescent Activated Cell Sorting (FACS) are the most popular methods to date. FACS relies on tagging and isolating fluorescent cells by capitalizing on the charged nature of a fluorescently tagged cell (Gross et al., 2015). Microfluidics focuses on the precise combinations of oil, surfactants and cells to create a droplet containing a single cell (Lecault et al., 2012). These techniques are used in a variety of fields and are perfectly designed for use in single cell sequencing. Furthermore, these techniques are adapted to include lysing of cells and to incorporate sequencing materials within the droplets encapsulating the cell components.

Single cell sequencing technologies have focused on human cells to date. This is in part due to the ease involved in lysing them to release nucleic acids, enabling high throughput protocols. This issue is being addressed to link higher throughput cell isolation methods, such as microfluidics, including suitable lysing protocol for bacterial cells (Liu et al., 2018). For this reason, however, LAB studies availing of high throughput analysis are not presently available. Despite this, proof of principle studies demonstrate the potential for LAB research in this area. Large sequencing attempts to explore the "microbial dark matter" of unknown areas of the tree of life (Rinke et al., 2013) in microbiome samples have been conducted. In a similar manner, single cell isolation techniques may also be employed to analyze the least abundant bacterial species within community samples. Minor community members have been observed within the fermented dairy product Koumiss using this approach (Yao et al., 2017). The protocol, described by Yao et al., 2017, involves diluting microbiome samples and sequencing single cells. This powerful, yet simplistic, technique can be exploited to analyze pools of bacteria that are known to have a specific output or phenotype in order to isolate the cells responsible. Analyzing minute quantities of DNA and RNA, sometimes as low as femtograms of material, are within the remit of these single cell techniques (Lasken, 2007). This knowledge has been utilized in environmental samples to isolate microbes of interest such as oil degrading microorganisms (Mason et al., 2012). Furthermore, single cell segmented filamentous bacteria were isolated using microfluidics from mouse gut microbiome samples (Pamp et al., 2012). This protocol provides an isolation method applicable for single cell LAB in community samples. Despite the lack of single cell sequencing studies on LAB, many of the techniques described are directly applicable to LAB. These highlight the potential advances that are attainable in this area. The rapid progress of isolation technologies, lysing protocols and sequencing depth will



The recent advances in their respective fields are included to illustrate the frequent progress in this area. As discussed below, many of the weaknesses that are mentioned may be overcome by integration of several of these omics technologies together.

provide a more stable platform for targeted single cell analysis of LAB (Gawad et al., 2016).

#### Metagenomics

In contrast to single cell genomics, metagenomics provides community-based genome sequences of many diverse species simultaneously. This information allows correlation-based work to compare the abundance of particular gene families to the respective environment (Brown et al., 2011). Metagenomics provides an overview of species abundance in the microbiome and characterizes common metabolic pathways available in the ecosystem (Huttenhower et al., 2012). The contribution LAB make to the gene pool and functional processes can be discerned using metagenomic data. Armed with this knowledge, the potential role LAB play in the community can be determined. Metagenomics is regularly used to determine the microbial diversity in order to direct further analysis of the sample at hand. Zhang et al., 2016, used metagenome sequencing to study novel fermented foods. These insights into fermented foods revealed potentially interesting Lactobacillus strains that were then isolated from the samples (Zhang et al., 2016). This targeted approach to omics data is the most effective method when working with a single omics set. However, combining other omics data unveils a more dynamic image of the metagenome. This is most commonly applied when integrating genomic and transcriptomic data. These data can be indispensable to understanding the functional role members fill in a given system.

#### COMBINING TRANSCRIPTOMICS WITH GENOMIC DATA SETS

The combination of genomics and transcriptomics is one of the most common in addressing experimental questions. Combining transcript data with available genomic information provides an image of the intentions of the organisms, given a specific environmental situation. The integration of genomic and transcriptomic data (Curtis et al., 2012; Ju et al., 2012; Craig et al., 2013; Lappalainen et al., 2013) is frequently used across many fields, while merging metagenome and metatranscriptome data (Shi et al., 2010; Solbiati and Frias-Lopez, 2018) is becoming far more prevalent. Genomic and transcriptomic data sets have been combined regularly to offer insight into the role LAB play in food spoilage (Andreevskaya et al., 2015), potentially probiotic traits of LAB strains (Saulnier et al., 2011) and their ability to use alternative electron acceptors in order to respire instead of

ferment (Brooijmans et al., 2009). Isolating strains with particular functions is readily facilitated by transcriptomics data. A systems approach was used to analyze the altered functional capacity of a mutated Lactococcus lactis strain utilizing genomic and microarray data (Chen et al., 2015). The genetic component responsible for its increased thermo resistance was determined using this combination of omics data. Transcriptome analysis of Lactobacillus strains causing beer spoilage was performed to determine the functional pathways which enable these microbes to enter the viable putative non-culturable (VPNC) state and thus survive in beer. Analysis of three Lactobacillus acetotolerans strains revealed that these strains were in a heightened stress state and had reduced gene expression levels in several other regular pathways such as metabolic processes, transport and enzyme activity. Understanding this process may afford future opportunities to prevent beer spoilage by inhibiting entry into the VPNC state (Liu et al., 2016).

Fundamental processes in LAB such as amino acid and carbohydrate metabolism have been advanced using transcriptomic data. Comparative transcriptomics has been imperative in understanding amino acid metabolism in Lactococcus lactis MG1363. A codY mutant strain was used to determine the role this gene plays in regulating more than 30 genes involved in metabolizing amino acids (den Hengst et al., 2005; Guédon et al., 2005). This strain was further analyzed using transcriptomic data to analyze its global regulatory networks during growth in milk (de Jong et al., 2013). Knowledge regarding the expression of critical genetic components in LAB such as the catabolite control protein A (CcpA) has seen considerable advancement using transcript data. Deep transcriptomic and physiological data were used to explore the differential expression between WT Lactobacillus plantarum and a CcpA mutant during growth phase on different carbohydrates (Lu et al., 2018). This study reports a substantial rearrangement in the carbohydrate metabolism regulatory network and sheds new light on the complexity of this process. It is clear that incorporating this data set into LAB research has already led to an increased understanding of these microbes.

#### Single Cell Transcriptomics

Transcriptomics at a single cell resolution is a relatively new field that may unravel many of the changes in transcription that are altered through a cells life span. These alterations are completely masked by bulk transcriptomics (Shapiro et al., 2013). With new technologies providing faster delineation of single cells (Klein et al., 2015) in conjunction with greater sequencing depth with machines such as the NovaSeq, large scale single cell transcriptomics is more accessible than ever. Progress in these complimentary fields allows researchers to explore a previously unavailable aspect of cell state heterogeneity. Many single cell transcriptomic studies focused on stem cells (Kolodziejczyk et al., 2015), embryos (Yan et al., 2013), tumors (Patel et al., 2014) and the nervous system (Zeisel et al., 2015). The use cases for this approach in these tissue types are apparent due to the advantages of delineating the differences between differentiated and non-differentiated cells. Comparing the responders to non-responders provides a greater opportunity to isolate mechanisms that stimulate desired responses (Hidalgo-Cantabrana et al., 2012; Shalek et al., 2013). This method is also efficacious when the genome and transcriptome sequencing are carried out simultaneously on the same cell (Macaulay et al., 2015). These researchers sequenced the DNA and RNA of single mammalian cells in parallel, demonstrating the current capacity of single cell technologies. A subpopulation of 10% within 172 single cells of human and murine origin was reported after analysis. Several genetic alterations between cells and large chromosomal translocations events were observed.

LAB are among the best understood constituents within the microbiome. However, it is difficult to determine the importance of their role in an ecosystem this large without a systems approach (Pessione, 2012; Waldor et al., 2015). Constituents of the microbiome are known for their ability to affect the impact of drug compounds and therapy (Lindenbaum et al., 1981), ferment and convert many components in our diet (Albenberg and Wu, 2014) and have significant impact on healthy brain function (Foster and McVey Neufeld, 2013). Single cell technologies may lead the way in deciphering LABs role in these functions of the microbiome.

#### Metatranscriptomics

Advances in next generation sequencing technology have reached a sequencing depth that facilitates more comprehensive metatranscriptomics of large community samples such as the gut microbiome (Bashiardes et al., 2016; Furnholm et al., 2017; Mehta et al., 2018). The current NovaSeq can produce 20 billion reads in a machine run. With this volume of reads between 100–400 taxa can reach maximum saturation of reads required for the highest statistical power (Ching et al., 2014). This step forward provides a powerful tool for gut microbiota analysis and realizes a true systems biology approach to determining how this community reacts to environmental perturbations. Due to the large database of sequenced LAB genomes, LAB stand to gain the most from the analysis of ecosystems such as the gut microbiome with a systems level approach. This combination of metagenomics and metatranscriptomic data is just beginning to become relevant in larger ecosystems; however, significant results have already been attained.

To date, this approach has been observed in smaller microbiome samples such as Kimchi (Jung et al., 2013) and rumen (Kamke et al., 2016). This methodology was also used in a proof of principle analysis to determine the function of a critical microbe in bacterial vaginosis. Indeed, by utilizing metatranscriptomics Lactobacillus iners was implicated in having a functional role in the presence of this disease differentially expressing over 10% of its genome between healthy and disease states (Macklaim et al., 2013). Specific commercially important processes such as cheese ripening have also seen the impact of this approach. Despite using shallower metatranscriptome data, De Filippis et al., 2016 demonstrated temperature-driven functional changes in the cheese microbiome during ripening which had a significant impact on cheese maturation rate. They indicated that "processing-driven microbiome responses"

can be altered to influence product quality and production efficiency (De Filippis et al., 2016). Expression data that can be tracked throughout the process and related back to the specific strains responsible is invaluable in important commercial processes such as cheese production. The potential to carry out similar research in large microbiome samples such as the gut, in a manner similar to De Filippis' experiment, is fast approaching.

Third generation sequencing platforms may have a direct impact on the RNA-seq field. Long read technologies perform better in the determination of unknown transcript abundance in single celled organisms (Tombácz et al., 2018), full-length splice isoforms with alternative splicing (Xu et al., 2017) and co-transcription of genes in a polycistronic fashion (Tardaguila et al., 2018). It is clear that there are obvious advantages to long read sequencing in reducing the difficult reassembly and loss of contextual information associated with short read sequencing. However, the lower accuracy and sheer scale of some transcriptomics and metatranscriptomics projects keeps them out of reach of current third generation sequencers.

Integrating metatranscriptomics with deep metagenomic data presents the ability to track a time mediated response to specific changes in the environment. Be it antibiotic exposure, pathogenic infection or probiotic administration, a wealth of information on how the system is reacting to the alteration will be generated. Such data could be tracked back to each species and provide information on how to produce effective therapies in similar situations. This holistic approach to studying these bacteria in their natural habitat will reveal much about the production of compounds of interest and may result in interesting revelations about the transition between the microbiomes symbiotic and dysbiotic states.

#### ASSESSING THE UTILITY OF PROTEOMICS WITHIN MULTI-OMICS DATA

Proteomics investigates the complete set of proteins present in a cell, tissue or organism at a molecular level. Proteomics in its own right is a powerful analytical tool that has helped resolve many functional questions regarding LAB. Proteomic analysis has determined abundant compounds that are present in the transition between growth phases in LAB (Pessione et al., 2005) and has been used to study the metabolic interactions of LAB (Pessione et al., 2010). Proteomics has determined critical proteins involved in acid stress resistance in Lactobacillus casei comparing a known stress resistant mutant to the Wild Type (WT) strain (Wu et al., 2012). Assessing the complete set of secreted proteins in LAB has increased our understanding of how these bacteria interact with their environment (Zhou et al., 2010). Research to determine the capacity for LAB to resist osmotic stress, a critical trait of all microbes in challenging environments, has also progressed notably utilizing proteomics (Zhang et al., 2010). These examples serve to prove the flexibility and effectiveness of proteomics; however, this information is best used when combined with other omics data sets. Despite considerable advances in proteomic technologies of late (Michalski et al., 2012; Hein et al., 2013; Gillet et al., 2016), this omics data set is the limiting factor in relation to throughput when integrated with genomics and transcriptomics. Untargeted discovery proteomics is termed Data Independent Acquisition (DIA) and allows the most comprehensive combination with other omics sets (Hu et al., 2016). This process facilitates the tracking of genes to proteins in a manner that produces functional data (Tocchetti et al., 2015; Trapp et al., 2016; Kedaigle and Fraenkel, 2018).

Protein abundance is intrinsically linked to the mRNA levels discussed above, however, mRNA abundance does not correlate well to protein abundance in a system (Chen et al., 2002; Pascal et al., 2008; Vogel and Marcotte, 2012). Considering this disparity, and that proteins are the molecules that control almost all cellular processes, the benefit from integrating these technologies for a more complete image is clear (Griffin et al., 2002; Cox and Mann, 2007). When combined, these data sets can answer higher dimensional questions about large scale processes such as studying the metabolism of many products simultaneously in a holistic manner (Delmotte et al., 2010; Wang et al., 2013). Complex microbial interactions such as quorum sensing may be deciphered using this approach (Di Cagno et al., 2011). This process was carried out to assess the complex interplay between LAB strains in yogurt fermentations. Transcriptomics and proteomics were combined to understand how the strains present interacted to produce the desirable effects. By-products from a single strain stimulated growth in the coculture resulting in a reliable yogurt formation (Herve-Jimenez et al., 2009; Sieuwerts et al., 2010). Similarly, this information can be used to further understand specific traits of LAB such as their crucial ability to manage bile stress in the case of probiotic strains of Lactobacillus (Koskenniemi et al., 2011). Combining proteomics with genomics and transcriptomics allows more robust biomarkers and treatments to be determined. This is observed in traits of disease phenotypes in humans (Wheelock et al., 2013) or functional processes in bacteria (De Keersmaecker et al., 2006). Understanding the methods that bacteria use to interact with their environment through several omic data sets develops network links between these data sets. Primary processes such as stress responses are frequent targets for a combinatory approach and may reveal critical information that would be lost without a holistic approach (Dressaire et al., 2011).

#### Single Cell Proteomics

Single cell analysis also makes an impression on the field of proteomics. Until recently, there were merely proof of principle publications describing the ability to identify minute concentrations of proteins available in a single cell (Jo et al., 2007; Rubakhin and Sweedler, 2007). Mass spectrometry, the primary method for proteomics research, is frequently used when tens of thousands of cells are available from which to extract proteins. However, a single cell has in the region of 1 × 10<sup>5</sup> protein molecules. With this in mind, it is clear why single cell mass spectrometry (MS) techniques will reveal only the most abundant of proteins present. Despite this, advances in this area have increased the scope of single cell

proteomics considerably. Progress within the flow cytometry field has resulted in an increased variety of fluorescent markers (Krutzik and Nolan, 2006). Antibody based immunofluorescence confocal microscopy has been used in human cells to identify >12,000 proteins across multiple cell lines (Thul et al., 2017). Microfluidic image cytometry can now analyze activity of specific protein groups such as kinases (Sun et al., 2010) and using photocleavable DNA barcode-antibodies to quantify various proteins from single cells is also possible (Agasti et al., 2012; Ullal et al., 2014). These techniques have enabled targeted proteomics of single cells, however, to our knowledge no single cell proteomics work has been carried out on LAB. Tracking the mechanisms and rate at which single cells adapt to environmental exposures will help define the specific triggers, systems and pathways involved in these situations. Developing knowledge of these networks while avoiding the clouded nature of bulk analysis is the most effective method to increase the accuracy at which we can predict the function and reactions of these microbes.

#### Metaproteomics

Currently, meta-proteomics struggles to stack up to the comprehensive nature of metagenomics and metatranscriptomics. However, a recent study by Ting et al. (2017) depicts the current power of untargeted exploratory proteomics accurately. This group demonstrated that the difference between the more comprehensive nature of Data Independent Acquisition and more accurate Data Dependent Acquisition (DDA) is lessening. After developing a novel library free peptide detection method, PECAN, this group was capable of detecting 12,767 peptides within a sample, 6,221 of which were unique compared to the targeted approach. The untargeted DIA approach detected 83% of the peptides elucidated during the targeted DDA approach. The detection accuracy was impressive as ∼99.5% of the retention times were identical between both approaches. This is while simultaneously detecting more than twice the number of peptides, indicating its suitability for exploratory proteomics. Techniques such as this are facilitating more realistic, high throughput proteome analysis (Kim et al., 2014). This represents the future of larger scale metaproteomic work; however, it is still in its infancy. For this reason, targeted assessment of protein abundance and identification is more appropriate and useful than untargeted in its current state. As such a targeted approach must often be taken when combining the information with metagenomic and metatranscriptomics data. It is feasible to choose a specific function or set of functions to analyze as part of the system-wide metaomics approach. Important traits associated with functional features of LAB can be interrogated further with proteomics in this manner (Hamon et al., 2011; Perez Montoro et al., 2018). Analyzing functional traits has developed a significant understanding of the proteome of the LAB group to date (De Angelis et al., 2016). Several studies in 2019 have combined metaproteomic data sets with the genomic counterparts to analyze fermented foods. This phenomenon has revealed much information regarding the central role played by LAB in these functional foods (Xie et al., 2019a,b). However, these metaproteomic approaches may only assess smaller scale community samples. Expanding this process to incorporate gut microbiome samples is beyond current proteomic technologies (Verberkmoes et al., 2009).

### COMPLETING THE MULTI-OMICS PICTURE: METABOLOMICS

Metabolomics is the study of all metabolites produced in a given system. This omics technology is a natural progression from proteomics in that proteins are responsible for the presence of the majority of metabolites found in an organism. The far-reaching effect that metabolomics research may have is apparent, as all phenotypes are intrinsically linked to the metabolites involved in the system studied. This concrete connection between the phenotype and metabolome makes it an exciting data set to add to any research. The volume of information that can be gathered is substantial considering the sheer scale of all metabolites that may be present. Recent advances have resulted in more accurate systems than ever for delineating these compounds. Back to back Liquid Chromotography or Mass Spectrometry units (LC/LC-MS, LC-MS/MS) (Cui et al., 2018), High Temperature Ultra High Performance Liquid Chromotography (HT-UHPLC) (Yoshida et al., 2007; Sarrut et al., 2014) and nanoflowUHPLCnanoESI-MS (Chetwynd et al., 2015) are filling in the gaps for previously difficult to detect compounds such as hydrophilic or minor metabolites. This progress results in untargeted metabolomics reaching the throughput necessary to combine them comprehensively with other omics data sets.

The utility of metabolomics technology on its own is obvious and has been used to make connections between gut related diseases and metabolites produced by gut microbes (Bingham, 1999; van Nuenen et al., 2004; Nicholson et al., 2005). Metabolomics has also been utilized to decipher functions of LAB such as their role in commercially important processes and revealing information regarding their metabolism (Weckx et al., 2010; Wilson et al., 2012; Mozzi et al., 2013). In a study by Hong et al., 2011, probiotic LAB were introduced to a group of Irritable Bowel Syndrome (IBS) sufferers. NMR was used to determine the metabolic niche that these LAB fulfilled in the IBS sufferers by comparing them to a control group not receiving the probiotic (Hong et al., 2011). The resulting data suggested a dysregulation in energy homeostasis and liver function based on the metabolites present. The potential of metabolomic data sets are unlocked when linked to other components throughout the organism such as proteins, RNA etc.

Many functions have been determined integrating these data sets as they provide a powerful platform to answer many research questions. This has been observed in several fields such as chemotherapy toxicity, toxicology, fungal phytopathology and heavy metal resistance in plants (Heijne et al., 2005; Tan et al., 2009; Singh S. et al., 2015; Wilmes et al., 2015). In LAB, this combination of technologies is useful when analyzing an organism-wide response such as growth efficiency and amino acid metabolism (Lahtvee et al., 2011) or the ability to adapt to ecological niches (Heinl and Grabherr, 2017). Conversely, these omics can be used with a specific end point in mind such as probiotic potential to treat a specific issue, providing a powerful

protocol for selecting suitable candidates (Rebollar et al., 2016). A combination of genomics, transcriptomics and metabolomics has been used to analyze the microbial role in the fast growing area of milk whey. Milk whey has recently transitioned from a low value by-product to a high value commercial product. However, the regulation has lagged behind resulting in many unknowns regarding the microbial composition and microbial by-products present in this milk whey. Omics data sets allowed Sattin et al., 2016, to report the reasons for poor whey quality, the potentially concerning compounds present and how to maintain higher commercial value (Sattin et al., 2016). Similar processes' were described when assessing the role multiomics data may play in food microbial interactions (Sattin et al., 2016; Ferrocino and Cocolin, 2017). Indeed, genomics, transcriptomics and metabolomics aided Turroni et al., 2016, in deciphering the complex interactions that allow Bifidobacterium strains to persist in the murine gut (Turroni et al., 2016). This valuable information leads to a greater understanding of the functional requirements for probiotic strains to survive this environment.

Exploiting multi-omics data sets is now a pivotal step in discovering novel antibiotics (Palazzotto and Weber, 2018). Albright et al., 2014, demonstrate a clear methodology for screening strains for novel antibiotic metabolites based on multiomics data (Albright et al., 2014). **Figure 1** shows a generalized version of the method outlined in this paper. The flow begins with sequencing the genome of the strain of interest. This facilitates searching the genome for the presence of biosynthetic gene clusters that may indicate antibiotic production. The genome enabled proteomics used in this study produced >15 fold increase in the number of antibiotic peptides detected compared to regular proteomic analysis. After determining the relevant proteins produced, metabolites were assessed using datadependant acquisition. Metabolites associated with the proteomic data set that represent potential antibiotic compounds are selected for using LC MS/MS. Isolated metabolites are compared to the relevant databases to rule out the known compounds. In this manner, novel natural products are isolated in an effective progression from strain to isolated compound using multiomics data.

#### Single Cell Metabolomics

Single cell metabolomics, although still in its infancy, can reveal the closest link to the phenotype of a single cell in a population. Sample volumes, low concentrations of analytes and sampling techniques are significant obstacles to this technology based on miniscule substances. However, in recent years, we have seen significant steps in accurate sampling of metabolites. A comprehensive review has been completed by Duncan et al., 2019, on the state of the art techniques developed over the last three years for single cell metabolomics (Duncan et al., 2019). A primary concern of these techniques is the transient nature of metabolites. A cell with particularly fast metabolic turnover can change its metabolic profile in 0.3 s (Zhang and Vertes, 2018). This volatility means the cells must be maintained in the native environment and treated carefully to avoid artifacts of the sampling technique.

Live single cell mass spectrometry shows promise providing rapid, direct analysis of targets while also producing a complete annotation of the results in less than one hour (Fujii et al., 2015). Fluorescent based techniques relying on flow cytometry and direct microscopy are suitable methods for analyzing specific metabolites (Li et al., 2016; Mondal et al., 2017). More recent developments in single cell technologies such as microfluidics are also applicable in isolating cells based on their extracellular metabolite profile (Wang et al., 2014). With associated technologies progressing quickly, it is clear that single cell metabolomics will make large steps with regard to throughput and accuracy in the near future. However, no LAB based single cell metabolomics data are available in published literature. Despite the difficulty in accessing the metabolomic data of single

cells, the utility of the information is clear. This analysis allows abundance of specific metabolites to be confidently linked to traits. Moreover, developing specific links will aid functional prediction in all similar microbes with metabolomics data.

The flexibility of these data sets is apparent and with the advent of more effective and comprehensive technologies, even more research will move to this culture independent, data driven approach. Further development in multi-omic databases will fuel research in improving the technologies related to each omics data set. Understanding the methods that bacteria use to interact with their environment through analysis of simultaneous omics data sets will result in the development of network links between the data sets themselves. Common links that are regularly observed between omics data sets will lead to greater molecular understanding and will be incorporated into traditional interaction networks. Extracting and depicting these complex interactions is an intrinsic issue associated with big data sets that multi-omics produce. Complex software pipelines play an important role in making sense of these data. To this end, we will briefly mention the more prominent software used for these integration processes.

#### DATA INTEGRATION AND COMPUTATIONAL LIMITATIONS

The integration of data sets generated in multi-omics research is by no means trivial. Each omics technology naturally consists of different types of data complicating the analysis to begin with. This is further complicated by the sheer volume of information that must be sifted through, particularly with meta-omics data. Software pipelines have been developed to manage this seemingly impossible task. Programs are developed to create models that can predict outcomes when multi-omics data is inputted. The outcomes analyzed are frequently disease states that can be described in terms of multi-omic data sets using these techniques. These pipelines generally fit into one of three approaches.

These models approach the issue with different methods, but all are valuable assets available to integrate the data. However, each of these approaches has limitations when negotiating these data. These limitations manifest in difficulty transforming differing data sets, combining massive input matrices and over fitting training data. Most pipelines adopt one of these approaches as a broad starting point. Researchers will have nuanced differences in how they treat types of data and how they weight connections between their data points.

A creative use of a transformation based approach has led to remarkable results in liver cancer survival prediction (Chaudhary et al., 2018). This model developed by Chaudhary et al., 2018 uses a deep learning method autoencoder and a single variate cox-PH model to choose features associated with survival. K-mean clustering is applied to these features to determine survival-risk groups. The omics data sets are then ranked via an ANOVA and features common with the predicting set are chosen. This step is depicted in **Figure 2**, where omics data sets are transformed into survival-risk predicting sets. These ranks can then be compared and combined. The final survival-risk labels are generated from the top features chosen by this method (Chaudhary et al., 2018). By incorporating microRNA seq, RNA seq, methylation and genomic information this study highlights the potential powerful use for multi–omics data when appropriate software can realize its potential.

An R based software, mixOmics, presents a recent example of a modified concatenation-based approach (Rohart et al., 2017). This R based software uses its "Diablo" process for this approach. Diablo incorporates all the input variables into a single input matrix. During this process interactions and associations between data sets are factored into the single input matrix. **Figure 3** represents a schematic of this approach, combining the different omics sets while taking interactions between data sets into account. This adds another level of complexity that suits multi-omics approaches specifically, creating a more powerful input matrix from which the model is derived. This software also contains another pipeline, namely "MINT", which is a model-based integration where data sets of the same type, e.g., transcriptomics, from different studies are combined to produce a model. Each type of data set predicts aspects of the phenotype as illustrated in **Figure 4**. The

predictions determined by each model are weighted based on their propensity to determine the phenotype. These input models are subsequently combined to generate a final prediction model (Rohart et al., 2017).

These exciting new software packages encompass each of the major methods of generating models for integrating omics data. Many developers are constantly updating and upgrading their platforms to incorporate the latest data sets and produce the most effective models (Medina et al., 2010; Alonso et al., 2015). Many concatenation based approaches have been developed to address nuanced different data sets (Fridley et al., 2012; Kim et al., 2013). Cancer survival prediction has seen significant success utilizing transformation and model based approaches. This progress has been observed using multiple genomic data sets to accurately assess ovarian cancer survival and in predicting instigators of melanoma from gene expression data and chromosomal copy number variation (Akavia et al., 2010; Kim et al., 2013).

More powerful models are constantly being developed, incorporating ever expanding data sets while integrating even more types of omics data. These advances are essential to keep up with the rapidly increasing volumes of data and to address the current limitations associated with many original data sets (Pinu et al., 2019). Several studies have emphasized the over estimation of significant results and several contradictory outcomes in multiomics data sets in many high impact publications (Ioannidis, 2005; Ioannidis and Trikalinos, 2005). In these cases, genuine heterogeneity within samples in genome wide association studies is considered statistically important disease specific information (Ioannidis et al., 2003; Kavvoura et al., 2008). Software tools must understand and incorporate these issues, while avoiding the risk of false negatives due to too harshly correcting data. For more information on the challenges associated with integrating these data sets, the reader is directed to the following review articles (Palsson and Zengler, 2010; Gomez-Cabrero et al., 2014; Fondi and Lio, 2015). For the reasons detailed above, there must be an element of responsibility on biologists to negate some of these limitations by developing a workable level of understanding regarding the most suitable software model for their experimental questions.

### CONCLUSIONS

fmicb-10-03084 January 24, 2020 Time: 17:37 # 12

The mechanisms responsible for generating omics information have seen considerable progress in recent years. The advent of new technologies, such as third generation sequencing, is capable of transforming the level of information available to researchers. These advances have placed the integration of multiple omics data sets within reach of more scientists. This availability will result in substantial advances in all aspects of microbial work from delineating specific functions to understanding their role in complex ecosystems. This review assesses the advantages to a high dimensional systems level approach when analyzing organisms and systems simultaneously. The fortuitous position that LAB research finds itself in is also discussed. LAB are a group of microbes with a wealth of data already available in a variety of databases. Due to this position, research focused on this group is poised to take full advantage of the progress in multi-omics research. As the field of molecular biology becomes a data intensive one, it is critical that biologists keep up with this trend. Researchers must

#### REFERENCES


develop skills in data processing, develop an understanding of the mechanisms behind software they utilize and be flexible to incorporating new technologies into their workflows. This task is challenging, but the rewards available with multi-omics are substantial.

#### AUTHOR CONTRIBUTIONS

SO drafted the initial manuscript. CS and RR provided critique and corrections and all worked together in the construction of the final review.

### FUNDING

We would like to acknowledge funding from the Department of Agriculture, Food and the Marine, Ireland. Project Ref. No: 15 F 602, and in part by a research grant from Science Foundation Ireland under Grant Number SFI/12/RC/2273.


Cox, J., and Mann, M. (2007). Is proteomics the new genomics? Cell 130, 395–398.

Craig, D. W., O'shaughnessy, J. A., Kiefer, J. A., Aldrich, J., Sinari, S., and Moses, T. M. (2013). Genome and transcriptome sequencing in prospective metastatic

triple-negative breast cancer uncovers therapeutic vulnerabilities. Mol. Cancer Ther. 12, 104–116. doi: 10.1158/1535-7163.mct-12-0781



ed. L. Von Stechow, (New York, NY: Springer), 13–26. doi: 10.1007/978-1- 4939-7493-1\_2



expression of cluster designation antigens in the prostate. BMC Genomics 9:246. doi: 10.1186/1471-2164-9-246


in expression and splicing in immune cells. Nature 498, 236–240. doi: 10.1038/ nature12172


transcriptomics, proteomics, metabolomics and biokinetics. Toxicol. In Vitro 30, 117–127. doi: 10.1016/j.tiv.2014.10.006


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 O'Donnell, Ross and Stanton. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.