Recovering genomic clusters of secondary metabolites from lakes: a Metagenomics 2.0 approach

Background Metagenomic approaches became increasingly popular in the past decades due to decreasing costs of DNA sequencing and bioinformatics development. So far, however, the recovery of long genes coding for secondary metabolism still represents a big challenge. Often, the quality of metagenome assemblies is poor, especially in environments with a high microbial diversity where sequence coverage is low and complexity of natural communities high. Recently, new and improved algorithms for binning environmental reads and contigs have been developed to overcome such limitations. Some of these algorithms use a similarity detection approach to classify the obtained reads into taxonomical units and to assemble draft genomes. This approach, however, is quite limited since it can classify exclusively sequences similar to those available (and well classified) in the databases. In this work, we used draft genomes from Lake Stechlin, north-eastern Germany, recovered by MetaBat, an efficient binning tool that integrates empirical probabilistic distances of genome abundance, and tetranucleotide frequency for accurate metagenome binning. These genomes were screened for secondary metabolism genes, such as polyketide synthases (PKS) and non-ribosomal peptide synthases (NRPS), using the Anti-SMASH and NAPDOS workflows. Results With this approach we were able to identify 243 secondary metabolite clusters from 121 genomes recovered from the lake samples. A total of 18 NRPS, 19 PKS and 3 hybrid PKS/NRPS clusters were found. In addition, it was possible to predict the partial structure of several secondary metabolite clusters allowing for taxonomical classifications and phylogenetic inferences. Conclusions Our approach revealed a great potential to recover and study secondary metabolites genes from any aquatic ecosystem.

CheckM).  clusters in all bins.  w  h  i  c  h  N  R  P  S  ,  P  K  S  a  n  d  H  y  b  r  i  d  P  K  S  /  N  R  P  S  c  l  u  s  t  e  r  s  w  e  r  e  f  o  u  n  d  .  R  e  d  b  a  r  a  n  d  p  i  e  :  N  R  P  S  ;  b  l  u  e  b  a  r  a  n  d   1  9  8   p  i  e  :  T  y  p  e  I  P  K  S  ;  g  r  e  e  n  b  a  r  s  a  n  d  p  i  e  :  H  y  b  r  i  d  c  l  u  s  t  e  r  s  (  N  R  P  S  -P  K  S  a  n  d  P  K  S  -N  R  P  S  )  .   1  9  9 2 0 0 A total of 43 condensation (C) domains were obtained from NRPS clusters. All these 2 0 1 sequences were submitted to NAPDOS analysis. Figure 2a shows the classification of C  a  r  e  b  i  o  s  y  n  t  h  e  t  i  c  a  s  s  e  m  b  l  y  l  i  n  e  s  t  h  a  t  2  1  0   i  n  c  l  u  d  e  b  o  t  h  P  K  S  a  n  d  N  R  P  S  c  o  m  p  o  n  e  n  t  s  ;  P  U  F  A  :  P  o  l  y  u  n  s  a  t  u  r  a  t  e  d  f  a  t  t  y  a  c  i  d  s  (  P  U  F  A  s  )  a  r  e  l  o  n  g  c  h  a  i  n  2  1  1   f  a  t  t  y  a  c  i  d  s  c  o  n  t  a  i  n  i  n  g  m  o  r  e  t  h  a  n  o  n  e  d  o  u  b  l  e  b  o  n  d  ,  i  n  c  l  u  d  i  n  g  o  m  e  g  a  -3  -a  n  d  o  m  e  g  a  -6  -f  a  t  t  y  a  c  i  d  s  ;  2  1  The screening for type I PKS resulted in 9 KS domain sequences. Most of them are 2 2 5 classified as modular type I PKS (56%). All of them were submitted to NAPDOS and  All the KS and C domains were also submitted to similarity analysis by using BLASTP against RefSeq database (table S4 and S5), and the best 3 hits of each sequence were    We highlight 3 bins (with less than 35% contamination and more than 70% completeness) 2 8 5 out of the 15 obtained type I PKS and/or NRPS and explore their clusters.

8 8
In cluster 2 (ctg181), multiple domains of NRPS (with the 3 minimal modules) and   m  a  i  n  a  n  n  o  t  a  t  i  o  n  s  a  r  e  g  i  v  e  n  .  C  A  L  :  C  o  -e  n  z  y  m  e  A  l  i  g  a  s  e  d  o  m  a  i  n  ,  C  :  c  o  n  d  e  n  s  a  t  i  o  n  ,  A  :  a  d  e  n  y  l  a  t  i  o  n  ,  2  9  9   E  :  e  p  i  m  e  r  i  z  a  t  i  o  n  ,  T  E  :  T  e  r  m  i  n  a  t  i  o  n  ,  K  R  :  K  e  t  o  r  e  d  u  c  t  a  s  e  d  o  m  a  i  n  ,  a  n  d  E  C  H  :  E  n  o  y  l  -C  o  A  h  y  d  r  a  t  a  s  e  3  0  0   3  0  1 All clusters show a high similarity with Pseudomonas proteins. Cluster 2 has a similarity of 3 0 2 92% with Pseudomonas synxantha bg33r, conserving also the gene synteny. The C domain sequences were submitted to NAPDOS analysis and 2 were classified as  In cluster 3 (ctg415) (Figure 5), in addition to the NRPS domains, the following transporter  Cluster 6 (ctg857 - Figure 5) shows many NRPS domains, regulatory factors and 3 1 3 transporters genes, including drug resistance genes, e.g. SMCOG1005 (drug resistance 3 1 4 transporter, EmrB/QacA), SMCOG1044 (ABC transporter, permease protein) and (confidence value 100). In bin 193 (Mycobacterium, 73.37 % completeness) a type I PKS cluster was identified  Two further clusters were recovered: one type III PKS and one unclassified one. All  Maduropeptin and Neocarzinostatin pathways. Using clusterblast inside Anti-Smash it was 3 4 5 not possible to find any similar cluster, but using BLASTP it was possible to find similarity  In addition, there are 12 more bins with NRPS or type I PKS clusters, but with less than 3 5 1 70% of completeness or more than 35% contamination. The bins 1 (69.54% completeness)   family, also from Burkholderiales order) shows one NRPS cluster. The Anti-Smash results for all bins are available in the Supplemental Information (SI 1). The field of metagenomics has generated a vast amount of data in the last decades [34].

6 8
Most of the data is poorly annotated and little quality controlled when loaded into the Recently, some new algorithms have been proposed to overcome these limitations and to for all the genes, including promoters and transporters. Screening the bins for secondary metabolite clusters, we can see that the most abundant 3 9 6 cluster belongs to the Terpene pathway (125 clusters) (Figure 1). This biosynthesis pathway 3 9 7 is well known to be present in many plant and fungi genomes, but recently it was proposed 3 9 8 to be also widely distributed in bacterial genomes. One study revealed 262 distinct terpene suggested as a viable alternative to traditional antibiotics and can be used as narrow- In this study, we focused on 2 families of large modular secondary metabolite genes, type I bond between two L-amino acids [50]. A previous study also found that the LCL class was phylogenetic classification. It was also possible to recover 9 KS domains from the type I 4 2 0 PKS clusters, 56% from the modular class and 22% from the hybrid PKS/NRPS class.

2 1
Those classes are larger (with many copies of each domain) than the iterative ones, increasing the chances to be recovered by metagenomic approaches. Accordingly, the phylogeny, and partial metabolite protein structure predictions. domains similar to Syringomycin, however, it is more likely that the product encoded by 4 4 0 this cluster is functionally close to the latter pathway.

1
In bin 34 -cluster 6, both C domains were classified as belonging to the Pyochelin Pseudomonas genome, providing it an "arsenal" of secondary products, increasing the 4 5 0 likelihood of the Pseudomonas species to succeed in aquatic systems.

5 1
Bin 131 (unclassified bacteria) shows a PKS cluster and 3 domains. It was classified as with an identity of 64%, suggesting that it is encoding for a new compound, which has not 4 5 8 previously been described. (NTM) that can be found in different environments, but both can also be opportunistic To assess the life style of the bins (free-living or particle-associated), we calculated the 5 0 7 relative abundance of the bins in every sample. A total of 158 bins with significant 5 0 8 difference between the 3 groups were found (Table S7), however from the 15 bins on which and Old_b9), accounting for 20-25% (bin 1) and 10-15% (bin 2) on these samples. The monitoring program of IGB on Lake Stechlin, by the fact that these samples were collected 5 1 5 during the occurrence of a massive cyanobacterial bloom. From the other bins containing PKS/NRPS clusters, we can see that Bins 6, 7, and 8 5 1 8 (Planktothrix), beside the lack of significant difference between FL and PA groups (p-value 5 1 9 > p 0.05), they are clearly more abundant in NSF. The possible explanation for this notion 5 2 0 is that the NSF samples were collected during a mesocosm experiment, whereas the other 5 2 1 samples were directly derived from the lake. Using the Metagenomics 2.0 approach, we were able to recover full megasynthases limitations, e.g., the genomic coverage of less abundant organisms and the possibility of pathways and their evolution in detail. Thus, allowing cloning and expressing these clusters possible to obtain such sequences and to synthesise the full cluster for heterologous 5 4 0 expression, skipping the cloning and functional screening process, saving considerable time 5 4 1 and money. In addition, the current work highlights the great potential for the discovery of 5 4 2 new metabolically active compounds in freshwaters such as oligo-mesotrophic Lake 5 4 3 Stechlin. Further, the study of complete or near complete genomes from uncultivated 5 4 4 bacteria in the natural environment will enable us to better understand the multiple forms of interactions between species and how they compete for the limiting natural resources. The sequences generated for this study (metagenomic reads) were deposited in ENA 5 5 0 (PRJEB22274 and PRJEB7963). The authors declare no competing interests This study was supported by the Science without Borders Program (Ciência Sem Fronteiras), CNPq. DI and HPG were funded by German science foundation (DFG) 5 5 7 projects Aquameth (GR1540/21-1) and Aggregates (GR1540/28-1). revised it for significant intellectual content.