Conversion of Methionine to Cysteine in Lactobacillus paracasei Depends on the Highly Mobile cysK-ctl-cysE Gene Cluster

Milk and dairy products are rich in nutrients and are therefore habitats for various microbiomes. However, the composition of nutrients can be quite diverse, in particular among the sulfur containing amino acids. In milk, methionine is present in a 25-fold higher abundance than cysteine. Interestingly, a fraction of strains of the species L. paracasei – a flavor-enhancing adjunct culture species – can grow in medium with methionine as the sole sulfur source. In this study, we focus on genomic and evolutionary aspects of sulfur dependence in L. paracasei strains. From 24 selected L. paracasei strains, 16 strains can grow in medium with methionine as sole sulfur source. We sequenced these strains to perform gene-trait matching. We found that one gene cluster – consisting of a cysteine synthase, a cystathionine lyase, and a serine acetyltransferase – is present in all strains that grow in medium with methionine as sole sulfur source. In contrast, strains that depend on other sulfur sources do not have this gene cluster. We expanded the study and searched for this gene cluster in other species and detected it in the genomes of many bacteria species used in the food production. The comparison to these species showed that two different versions of the gene cluster exist in L. paracasei which were likely gained in two distinct events of horizontal gene transfer. Additionally, the comparison of 62 L. paracasei genomes and the two versions of the gene cluster revealed that this gene cluster is mobile within the species.


INTRODUCTION
Lactobacillus paracasei belongs to the LAB, which are gram-positive bacteria with lactic acid as the main product of the carbohydrate metabolism. It must be noted that the taxonomic status of L. paracasei has been debated during recent years (Collins et al., 1989;Wayne, 1994; Judicial Commission of the International Committee on Systematics of Bacteria, 2008). Consequently, in many cases, strains described as Lactobacillus casei are more closely related to L. paracasei ATCC 334 (previously known as L. casei ATCC 334) than to the species type strain ATCC 393. In this report, we use the species name L. paracasei, because the strains used in this study are more closely related to L. paracasei ATCC 334 (mean genome similarity 98.6%) than L. casei ATCC 393 (mean genome similarity 78.3%).
Lactobacillus paracasei bacteria are found in various habitats, such as plants, human body, and fermented food (Hammes and Hertel, 2006). It was suggested that the species L. paracasei has a large gene pool that allows the bacteria to adapt to the different habitats (Broadbent et al., 2012;Smokvina et al., 2013). The large gene pool is an asset for industrial applications and in particular for the dairy industry. Contrary to the natural habitats, the emergence of the milk and cheese as habitats is a rather recent event (Salque et al., 2013). These habitats are rich in nutrients like carbohydrates, proteins and lipids compared to natural habitats. However, the diversity of these nutrients is rather limited, as the main constituents of carbohydrates and proteins are lactose and caseins, respectively. Furthermore, in cheese the diversity of bacteria is high (Quigley et al., 2012) and therefore, the competition pressure is high which might be an advantage for fast growing bacteria with small genomes. It was proposed in earlier studies that the adaptation to the milk related habitats caused gene loss in L. paracasei (Cai et al., 2009).
The sulfur containing amino acids methionine and cysteine are a good example for the specificity of milk as a habitat. Bogicevic et al. (2013) described that the methionine concentration excesses the cysteine during the cheese ripening upto a factor of 25. Therefore the ability to convert methionine to cysteine is an important growth advantage. Several of the genes involved in the methionine and cysteine metabolism are already described in L. paracasei. It was shown that genes coding for cystathionine lyase are present in the genomes of several L. paracasei strains (Irmler et al., 2009). It was also shown that an O-acetylhomoserine sulfhydrylase (cysteine synthase) is coded in the genomes of many strains (Bogicevic et al., 2012a). Recent findings showed that a gene cluster of a cysteine synthase, a cystathionine lyase and a serine acetyltransferase (cysK-ctl-cysE gene cluster) is involved in the sulfur metabolism in L. paracasei (Liu et al., 2008;Bogicevic et al., 2012b). However, the genomic and evolutionary basis to convert methionine to cysteine has not been studied in detail in L. paracasei.
The aim of this study was to find genomic elements that allow L. paracasei strains to convert methionine to cysteine and to study their evolutionary origin. Therefore, we investigated strain specific genes using whole genome sequencing and tested if the strains can grow with methionine as the sole sulfur source. With the resulting data, we performed genetrait matching to find the phenotype causing genes. We found that the previously described gene cluster cysK-ctl-cysE (Bogicevic et al., 2012b) is the genomic element that allows a subpopulation of the L. paracasei strains to synthetize cysteine from methionine. Furthermore, we found evidence that the gene cluster is very mobile, and was possibly gained twice by the species by horizontal gene transfer and is mobile within the species.

Sequencing and Genome Assembly
We sequenced the genomes of 40 L. paracasei strains from the Agroscope culture collection originating from dairy products to an average depth of coverage of about 1200× (Supplementary  Table S1). In addition, we used the long-read technology of Pacific Biosciences to sequence six bacterial strains (FAM18149, FAM18121, FAM10859, FAM18099, FAM18101, and FAM3228) selected from the 40 strains. These six nearly complete genomes were used to order the contigs of the other genomes which allowed us to better perform comparative genomics tasks. It also allowed us to identify if a strain has plasmids. We applied a hybrid assembly with Illumina short reads and PacBio long leads, as the long reads generated with the PacBio C2 chemistry do not allow the use of the non-hybrid HGAP method (Chin et al., 2013). Nevertheless, we obtained almost complete genome assemblies, which consist of only a few scaffolds (3-21 scaffolds, Supplementary Table S1). The remaining 34 strains were assembled using Illumina short reads only. The median n50 of the assemblies is 63,528 bp which is comparable with the results of similar studies performed with L. paracasei (Broadbent et al., 2012;Smokvina et al., 2013). The scaffolds of the genome assemblies of these 34 strains were ordered according to the six nearly complete genome assemblies as well as publicly available complete genome assemblies. We performed ab initio annotation of all 40 genomes to identify the CDSs using Prokka (Seemann, 2014). The number of CDSs per genome ranges from 2,501 (FAM18132) to 3,078 (FAM3257) with a median of 2,930 CDSs (Supplementary Table S1).

Gene-Trait Matching
We tested 23 sequenced strains and the reference strain ATCC 334 of L. paracasei (Makarova et al., 2006) if they can grow in medium with methionine as sole sulfur source. We incubated the strains for 25 h in CDM with methionine as sole sulfur source and measured the OD600 of the 24 strains. The experiment was performed in triplicates. Strains that reached a mean OD600 of one or higher were considered as growth positive. In total, 16 out of 24 strains could grow under this condition (Figure 1). The strains were also grown in medium containing cysteine and methionine and all grew to an OD600 > 1 (data not shown). To compare the phenotype with the genotype, we first determined OGCs shared between the strains using OrthoMCL (Li et al., 2003). In total, we found 5,406 OGCs within the 24 strains. We tested all 5,406 OGCs using Fisher's exact test (Fisher, 1922) for overrepresentation in strains that can or cannot grow with methionine as sole sulfur source. This resulted in three OGCs that are significantly (p-adj. < 0.05) overrepresented in strains that can grow with methionine as sole sulfur source. These three genes build the cysK-ctl-cysE gene cluster and are neighboring in all 16 strains that can grow with methionine as sole sulfur source. The cysK-ctl-cysE gene cluster contains a cysteine synthase, a cystathionine lyase and a serine acetyltransferase (Bogicevic et al., 2012b). Worth mentioning is the fact that we found an ortholog of cysE in the strain FAM3257 that cannot grow without cysteine, whereas cysK FIGURE 1 | Growth of various L. paracasei strains in a chemically defined medium with methionine as the sole sulfur source. The bars and black lines represent the mean and standard deviation of the reached OD600 after 25 h of growth, respectively, of three independent biologically repeated measurements. Strains that reached an OD600 of one or more were consider as being able to grow in a medium with methionine as sole sulfur source. and ctl are exclusively present in strains that can grow without cysteine.

Phylogeny of the cysK-ctl-cysE Gene Cluster
To study the evolutionary history of the cysK-ctl-cysE gene cluster, we extended our dataset of 24 strains with 16 additional strains from the Agroscope strain collection in order to have a larger dataset. We searched the 40 sequenced L. paracasei genomes and the non-redundant nucleotide databases from GenBank for this gene cluster. Of the 40 strains, 11 strains carried the cysK2-ctl1-cysE2 and 13 strains carried the cysK3-ctl2-cysE3 gene cluster. In addition, we identified the gene clusters in 166 different nucleotide sequences of the non-redundant nucleotide database from GenBank, which were mainly from species found in fermented food. We built a phylogenetic tree of these sequences and found that the gene cluster of L. paracasei is spread over two separate branches (Figure 2A). A previous study described that two versions of the gene cluster exist in L. paracasei and they were called cysK2-ctl1-cysE2 and cysK3-ctl2-cysE3 (Bogicevic et al., 2012b), respectively. These do correspond to the gene cluster we have identified. Interestingly, cysK2-ctl1-cysE2 clusters with the gene clusters of S. thermophilus, L. helveticus, L. fermentum, and L. delbrueckii, whereas cysK3-ctl2-cysE3 clusters with the ones of L. rhamnosus, L. casei, L. gallinarum, and L. pseudomesenteroides. From the 166 sequences containing the gene cluster, 65 are from a completed chromosome or plasmid assembly. Interestingly, only the clusters that are closely related with cysK2-ctl1-cysE2 and in L. casei ATCC 393 (cysK3-ctl2-cysE3) are located on a plasmid (Figure 2A).
To study the genomic environment of the gene cluster in the different species, we depicted all neighboring genes of all complete chromosomes and plasmids ( Figure 2B). The figure shows that the neighboring genes are highly conserved, if the clusters are closely related. However, we also found that the gene cluster has many neighboring transposases in most Lactobacillus and Streptococcus species. In contrast, in the Leuconostoc and Lactobacillus plantarum, we found that almost no transposases are in close proximity to the gene cluster.
The Phylogeny of L. paracasei As we found two different versions of the cysK-ctl-cysE gene cluster, we studied the correlation of the presence of the two versions of the gene cluster with the phylogenetic relation of the strains. We extended our 40 strains with 17 published genomes from strains derived from diverse habitats (Broadbent et al., 2012) and the genomes of five previously published L. paracasei genomes, which are completely assembled and have a cysK-ctl-cysE gene cluster. We built GCs for all translated CDSs of all 62 strains using Roary (Page et al., 2015). Roary is more sensitive regarding paralog separation than OrthoMCL. Interestingly, the genes of the cysK2-ctl1-cysE2 and cysK3-ctl2-cysE3 gene clusters are not considered as orthologs in this analysis. In total, the size of the pan genome is 11,613 GCs and the size of the core genome is 1,305 GCs. Based on the core genome, we constructed a phylogenetic tree ( Figure 3A). We found that most strains that carry the cysK2-ctl1-cysE2 version of the gene cluster are (B) The colors of the different fields in the heat map represent the numbers of shared GCs of two strains of the accessory genome. Based on these numbers, the Euclidean distance was calculated to build the dendrogram. The dendrogram separates the strains into group 1 and group 2. The purple square indicates the strains located on the phylogenetic branch where the strains carry cysK2-ctl1-cysE2 gene clusters.
Frontiers in Microbiology | www.frontiersin.org located on a separated branch or appear in some single strains. The strains LcA, LcY, LC2W, W56, and BD-II do also carry the cysK2-ctl1-cysE2 gene cluster and build their own branch in the phylogenetic tree. However, the phylogenetic distance between these strains is very close and they might even be clones of each other. Worth mentioning is that the strain BL23 is also very closely related to these strains but does not have the gene cluster. We clustered the 62 strains based on their gene pool to study the diversity of the gene repertoire among the different strains. We constructed a dendrogram of the strains based on the number of shared GCs (Figure 3B) using the Euclidean distance as a measure (Warnes et al., 2014). The analysis resulted in a separation of the strains into a large group with 45 strains and smaller group with 17 strains. The comparison of the dendrogram based on the Euclidean distance of the gene content with the phylogenetic tree showed that the 17 strains from the smaller group lie on one branch in the phylogenetic tree (Figure 3A), which contains many strains with the cysK2-ctl1-cysE2 gene cluster. This shows that the strains adapted their genepools during the separation into the two groups.

Conservation of the Genomic Context of the cysK-ctl-cysE Gene Cluster in L. paracasei
We identified cysK-ctl-cysE as the main genomic element that allows L. paracasei to produce cysteine from methionine. To minimize the possibility that we have missed other important genes, we studied genes that might have been transferred within the species along with the cysK-ctl-cysE gene cluster. Therefore, we analyzed the surrounding genes of the gene cluster in more detail. As the gene cluster is flanked by transposases (Figure 2B), that may lead to breaks in the genome assembly, we searched for orthologs of the neighboring genes of completely assembled strains: BD-II, CAUH35, LC2W, LcA, LcY, and W56 for the cysK2-ctl1-cysE2 cluster and N1115 for the cysK3-ctl2-cysE3 cluster.
In contrast to the cysK2-ctl1-cysE2 cluster, the cysK3-ctl2-cysE3 is located on the chromosome and not a plasmid. For strain N1115, we found that most genes of the surrounding 70 kb are not present in any of the other 61 strains (Figure 4). The comparison with the Islandviewer 4 (Langille and Brinkman, 2009) showed that N1115 has predicted genomic islands from 341,661 to 363,478 and a second one from 377,342 to 408,141. Interestingly, we did not find larger parts of the genomic islands within the genomes of any of the other strain, not even in the strains that carry the cysK3-ctl2-cysE3 gene cluster.
In the phylogenetic analysis, we found that LcA, LcY, BD-II, W56, and LC2W are closely related. The comparison of the plasmids of these five strains shows that not only the gene cluster but also the other genes on the plasmids are conserved within these strains, except for LC2W (Figure 4). The plasmid of LC2W is much smaller than the plasmids of the other four strains. This indicates that the plasmid is rather unstable, as the difference in size is enormous between such closely related strains. By expanding the analysis to all strains that carry the cysK2-ctl1-cysE2, we found that barely any region of the plasmids is conserved in other strains. Next to the gene cluster also the first 20 kbp are conserved within most strains. Interestingly, a large part of the plasmid is also conserved within strains of the branch that do not carry the gene cluster (UW1, FAM18119, FAM18157, and FAM18113).

DISCUSSION
To be a successful bacteria species in a medium such as milk, which is rich in the amount of nutrients, but rather limited in its diversity, it is indispensable for the bacterium to metabolize these substrates into essential compounds. This is especially true for sulfur containing amino acids, as it was shown by Bogicevic et al. (2013) that methionine is 25 times more abundant than cysteine during the cheese ripening process. Therefore, the ability to convert methionine to cysteine is a selective growth advantage.
In this study, we found that the cysK-ctl-cysE gene cluster is the most important gene cluster that allows L. paracasei strains to grow with methionine as sole sulfur source. The genetrait matching analysis clearly showed that all three genes are required for the conversion of methionine to cysteine and that they are conserved as a unit. We conclude that the gain of the cysK-ctl-cysE gene cluster is sufficient for L. paracasei strains to acquire the ability to synthesize cysteine from methionine. A previous study proposed this role of the gene cluster for S. thermophilus and L. bulgaricus (Liu et al., 2009). Additionally, a model was published that explained the role of ctl in the volatile sulfur compounds production (Bogicevic et al., 2013). Finally, we recently published that the cysK-ctl-cysE is massively up regulated in L. paracasei, if no cysteine is present in the growth medium (Wüthrich et al., 2018).
As we searched for this gene cluster in other species, we found that it is not only present in L. paracasei, but also in many species that are also used in the food production. This is more evidence that this gene cluster may be transmitted between species by horizontal gene transfer, as it was already suggested in a previous study (Liu et al., 2009). In Leuconostoc species, we found that the gene cluster has almost no flanking transposases and was found in several species in the genus. Also, the genomic context was conserved within the species. Therefore, we propose that the gene cluster is not very mobile within the Leuconostoc and ancestor of this genus may also be the origin of this gene cluster. In contrast, all Lactobacilli, except for L. plantarum, show many transposases in the genomic region of the cysK-ctl-cysE gene cluster. We suggest that this may lead to a highly mobile gene cluster.
Within the species L. paracasei, we found that two different versions of the cluster exist (Figure 2A). We assume that these two versions were introduced in two different events of horizontal gene transfer, as they are closely related to gene clusters in other species and are separated within the phylogenetic tree of the studied L. paracasei strains (Figure 3A). While the cysK3-ctl2-cysE3 cluster is found in many branches of the phylogenetic tree, the cysK2-ctl1-cysE2 gene clusters are located on one branch ( Figure 3A). We also found the cysK2-ctl1-cysE2 gene cluster in the strains CRF28, FAM19404 as well as in LcA, LcY, BD-II, FIGURE 4 | Positions of unconserved regions. Visualization of conservation of the genes around cysK-ctl-cysE gene cluster. Orthologs were identified using the paralog aware algorithm Roary (Page et al., 2015). The X-axis represents the genomic position on the chromosome (upper panel) or the position on the plasmid (lower panel) of the different protein coding genes. Each green line represents the genes found in the genomes as ortholog. The position of the cysK-ctl-cysE gene cluster is indicated in red. The length of a branch represents the phylogenetic distance. The black dots at the branching events indicate a bootstrap support of at least 80%.
W56, and LC2W. Because we found that all identified versions of the cysK2-ctl1-cysE2 gene cluster are coded on a plasmid and therefore are mobile, we concluded that this gene cluster was horizontally transferred within the species.
The comparison of the gene pool of the strains of the branch with the strains carrying the cysK2-ctl1-cysE2 gene cluster and the rest of the 62 strains showed that they have a clearly separated gene pool (Figure 3B). A similar separation was found in studies about Oenococcus oeni (Campbell-Sills et al., 2015) and Streptococcus pneumoniae (Hilty et al., 2014) in which the authors concluded that it was induced by an alteration of the habitat. However, with the cysK2-ctl1-cysE2 gene cluster, we found a good example that there is an exchange of these separated groups by horizontal gene transfer.
Finally, we could show the differences between the two popular ortholog finding algorithms Roary (Page et al., 2015) and OrthoMCL (Li et al., 2003). While the genes of the cysK2-ctl1-cysE2 and the cysK3-ctl2-cysE3 were not considered as orthologs using Roary, they were classified as orthologs by OrthoMCL, although, the two genes cluster were not gained in the same evolutionary event. Therefore, we conclude that the strength of Roary is to cluster genes from the same evolutionary events, whereas the strength of OrthoMCL is to identify genes with similar functions (Ding et al., 2018). However, in this study we tested only the default parameters of the algorithms.

Bacterial Strains, Media, and Growth Conditions
Lactobacillus paracasei strains were obtained from the Agroscope culture collection in Liebefeld (Berne, Switzerland) and were maintained in MRS broth (De Man et al., 1960). The requirement for cysteine was assayed by inoculating a chemically defined medium (Christensen and Steele, 2003) devoid of cysteine. Optical density at 600 nm (OD600) was determined with a spectrophotometer (LKB Biochrom 4050 Ultrospec II).
The sequencing was performed on an Illumina HiSeq 2000 (Illumina Inc., San Diego, CA, United States) using TruSeq v3 chemistry.

Library Preparation and PacBio Sequencing
High molecular weight of DNA from the L. paracasei strains was sheared in a Covaris instrument (Covaris, Woburn, MA, United States) to 10 kb fragments, and the DNA size distribution was checked on a fragment analyzer (Advanced Analytical Technologies, Ames, IA, United States). Then, 5 µg of the sheared DNA was used to prepare a SMRTbell library using the PacBio DNA Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, United States), according to the manufacturer's recommendations. The library was sequenced using one SMRT cell v2 with C2 chemistry on a PacBio RSII system (Pacific Biosciences, Menlo Park, CA, United States) that was given a movie length of 90 min.

Genomes
The data (assembled genomes and annotation) of the 40 sequenced strains is available as BioProject under the number
The scaffolds were ordered using mauve (snapshot 2013-06-07, default parameters) (Darling et al., 2010) based on the most closely related reference genome from the dendrogram of Figure 2B.

Gene-Trait Matching
All CDSs of the different strains were clustered into OGCs using OrthoMCL (version 2.0.9, default parameters) (Li et al., 2003). Every OGC was tested for significant association with the growth phenotype in medium devoid of cysteine. Therefore, the presence or absence of every OGC was counted for the strains that grew and did not grow in medium devoid of cysteine, respectively. The p-values were calculated using Fisher's exact test (Fisher, 1922). The resulting p-values were corrected for multiple testing using the Benjamini-Hochberg method (Benjamini and Hochberg, 1995).

Search of the cysK-ctl-cysE Gene Cluster in Other Species
To find the cysK-ctl-cysE gene cluster, the translated CDS sequences of the three genes of strains FAM18149 were aligned against the non-redundant nucleotide sequence of NCBI (March 2016) using tblastn (version 2.6.0+, default parameter) (Altschul et al., 1990). To find the cysK-ctl-cysE gene cluster in other genomes two selection steps were performed. First, the genome must have homology to all three genes within a window of 20 kbp. Second, the homologous parts must have the same gene order as the cluster in L. paracasei. We aligned all sequence using MAFFT (version 7.187, default parameters) (Sievers et al., 2011). Finally, we computed the phylogenetic tree using RaxML (version 8.2.9, options: -f a -m GTRGAMMA -# 1000) (Stamatakis, 2006) based on the merged multiple alignments.
We selected the corresponding CDSs from the core genome GCs that contained only a single ortholog from each of the studied bacterial strains. We aligned these CDSs of all GCs using MAFFT (version 7.187, default parameters) (Sievers et al., 2011). Finally, we computed the phylogenetic tree using RaxML (version 8.2.9, options: -f a -m GTRGAMMA -# 1000) (Stamatakis, 2006) based on the merged multiple alignments.

ETHICS STATEMENT
Ethical approval was not required for our study as we exclusively used natural bacterial strains.

AUTHOR CONTRIBUTIONS
All authors conceived and designed the study, and read and approved the final manuscript. SI and HB performed experiments. DW and RB performed bioinformatics analyses. DW, SI, and RB wrote the manuscript.

FUNDING
This work was supported by Agroscope and by the Canton of Bern.

ACKNOWLEDGMENTS
We thank Michèle Ackermann and Muriel Fragnière for providing excellent technical support and the Next Generation Sequencing Platform of the University of Bern for performing the high-throughput sequencing experiments.