High-Throughput Sequencing and Unsupervised Analysis of Formyltetrahydrofolate Synthetase (FTHFS) Gene Amplicons to Estimate Acetogenic Community Structure

The formyltetrahydrofolate synthetase (FTHFS) gene is a molecular marker of choice to study the diversity of acetogenic communities. However, current analyses are limited due to lack of a high-throughput sequencing approach for FTHFS gene amplicons and a dedicated bioinformatics pipeline for data analysis, including taxonomic annotation and visualization of the sequence data. In the present study, we combined the barcode approach for multiplexed sequencing with unsupervised data analysis to visualize acetogenic community structure. We used samples from a biogas digester to develop proof-of-principle for our combined approach. We successfully generated high-throughput sequence data for the partial FTHFS gene and performed unsupervised data analysis using the novel bioinformatics pipeline “AcetoScan” presented in this study, which resulted in taxonomically annotated OTUs, phylogenetic tree, abundance plots and diversity indices. The results demonstrated that high-throughput sequencing can be used to sequence the FTHFS amplicons from a pool of samples, while the analysis pipeline AcetoScan can be reliably used to process the raw sequence data and visualize acetogenic community structure. The method and analysis pipeline described in this paper can assist in the identification and quantification of known or potentially new acetogens. The AcetoScan pipeline is freely available at https://github.com/abhijeetsingh1704/AcetoScan.

This physiological diversity and versatility enables acetogens to survive in different types of habitats and compete with other microorganisms (Drake et al., 1997;Drake and Küsel, 2003). However, the phylogenetic heterogeneity of acetogens hampers their detection and identification based on the most commonly used molecular markers. For example, designing acetogen-specific primers for the 16S rRNA gene is almost impossible (Drake, 1994;Drake et al., 2002Drake et al., , 2008Drake et al., , 2013Lovell and Leaphart, 2005). Enzymes involved in the WLP, especially formyltetrahydrofolate synthetase (FTHFS), are structurally and functionally conserved and the gene sequences of these enzymes have been used in numerous studies to target and identify potential acetogens in complex microbial communities (e.g., Lovell and Hui, 1991;Ragsdale and Wood, 1991;Lovell, 1994;Leaphart and Lovell, 2001;Lovell and Leaphart, 2005;Ragsdale and Pierce, 2008;Hori et al., 2011;Müller et al., , 2016Shin et al., 2016). Despite this, a high-throughput analysis pipeline for this group of organisms has not yet been established. Most, if not all, previous studies using FTHFS gene amplicons have been conducted by clone library construction, sequencing by low-throughput methods and manually evaluated against small reference datasets (Gagen et al., 2010;Henderson et al., 2010;Hori et al., 2011;Planý et al., 2019). This process is timeand resource-intensive and not suitable for rapid analysis of a large number of samples, due to lack of a high-throughput solution for analysis of FTHFS gene amplicons (Leaphart and Lovell, 2001;Xu et al., 2009;Gagen et al., 2010). Lack of a dedicated information resource/database for acetogen-specific homology and taxonomic comparisons has also restricted the development of high-throughput methods (Lovell and Hui, 1991;Xu et al., 2009;Gagen et al., 2010;Planý et al., 2019). Therefore, we recently published AcetoBase, an information resource for FTHFS data, which can assist in high-throughput analysis of acetogenic community diversity and taxonomic assignments of sequence reads from high-throughput sequencing platforms .
The goal of this study was to present a proof-of-principle for high-throughput sequencing of FTHFS gene amplicons. More specifically the aim was to set up and validate our bioinformatics pipeline "AcetoScan" for unsupervised data analysis of the raw high-throughput sequence data using our previously developed resource "AcetoBase" . For the analyses, samples from a biogas digester were selected. Biogas/anaerobic digester environments are well-studied in terms of overall bacterial community composition as well as dynamics and acetogens. Previous studies with the bioreactors used in the present study have also confirmed a diverse acetogenic community, making it suitable for the evaluation Westerholm et al., 2015;Müller et al., 2016).

Sample Collection and DNA Extraction
Samples were taken on five different time-points (date: 150303, 150414, 150519, 150709, and 151117) from a continuous stirredtank biogas reactor (GR2) in the Anaerobic Microbiology and Biotechnology Laboratory, Swedish University of Agricultural Sciences, Uppsala. The reactor was operated with mixed food waste at 37 • C, an organic loading rate of 2.5 ± 0.42 g VS L −1 day −1 and NH 4 + -N 5.4 g/L, while other operating parameters were as described for reactor D TE 37 by Westerholm et al. (2015). Genomic DNA extraction was performed with the FastDNA TM Spin kit for soil (MP Biomedicals, 2015) with an additional wash step with 500 µL 5.5 M Guanidine thiocyanate (Sigma-Aldrich, 2020) for humic acid removal . Reactor and DNA samples were stored at −20 • C until further processing.

PCR Amplification, Library Preparation and Sequencing
Partial FTHFS gene amplicons were generated by the primer pairs and PCR protocol developed by  with the modifications FTHFS_fwd (5 -CCIACICCISYIGGNGARGGNAA-3 ) and FTHFS_rev (5 -ATITTIGCIAAIGGNCCNSCNTG-3 ). FTHFS amplicons were purified by E-Gel R iBase TM Power System (Invitrogen, 2012) and E-Gel R EX with SYBR R Gold II, 2% SizeSelect pre-cast agarose gels (Invitrogen, 2014). Sequencing libraries were prepared from 20 ng purified FTHFS amplicons using the ThruPLEX DNA-seq Prep Kit (Takara Bio USA, 2017) according to the manufacturer's protocol. Library preparation and sequencing were performed by the SNP&SEQ Technology Platform at the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory in Uppsala. Paired-end sequencing was performed on Illumina MiSeq with 300 base pairs (bp) read length using v3 sequencing chemistry (UGC, 2018).

Development and Implementation of the AcetoScan Pipeline
AcetoScan is a dedicated data analysis pipeline for FTHFS gene sequences. It is a fully automated pipeline for data filtering, taxonomic annotation, phylogenetic tree reconstruction and visualization of sequence data. Input sequence format can be in fasta format or compressed fastq format. AcetoScan is primarily tested with raw sequence data in compressed fastq format from the Illumina MiSeq platform. AcetoScan allows the user to run the analysis with raw sequence data using the command acetoscan based on parameters of the user's choice (read-type, quality threshold, clustering threshold, minimum cluster size, e-value and phylogenetic tree bootstraps). If the user does not specify any analysis parameters, default parameters are used instead (see AcetoScan user manual). The FTHFS gene amplicon generated with the primers used in the present study has an average length of ∼635 bp. Illumina MiSeq paired-end sequencing can generate the sequence data of only 2 × 300 bp (600 bp), so after the quality control step, it is practically impossible to merge the paired-end reads. Therefore, AcetoScan is at present optimized for data analysis of only one type of sequencing reads at a time, either forward reads (R1) or reverse reads (R2).
AcetoScan carries out data analysis in four major steps ( Table 1). In the first step, raw sequences are subjected to primer sequence trimming and quality filtering. The second step comprises dereplication, denoising, chimera filtering, operational taxonomic units (OTU) cluster generation, filtering of non-target sequences, best-frame analysis and taxonomic assignments. In this step, AcetoScan uses the AcetoBase reference protein dataset for selection of the target sequences and taxonomic annotations. In the third step, multiple sequence alignment of the OTU Filtering non-target sequences and best frame analysis Bioperl (Stajich et al., 2002), NCBI-blast+ (Camacho et al., 2009), AcetoBase  OTU sequences is performed with the subsequent generation of a phylogenetic tree. The last step is data visualization, in which all the results generated in steps two and three are rendered in different plots and interactive graphs. The AcetoScan pipeline can also perform data analysis on the FTHFS gene sequences in fasta sequences format. In addition to the command acetoscan, there are three commands which can be used to process the FTHFS sequence data. The command acetocheck takes fasta format sequences as input, filters out any non-FTHFS sequences and performs reading frame analysis to select the longest reading frame without internal stop codons. If the user wants to taxonomically annotate the FTHFS fasta sequences, the command acetotax can be used, which is acetocheck followed by taxonomic annotation of the filtered sequences. If a user wants to prepare a phylogenetic tree, the command acetotree can be used, which is acetotax followed by phylogenetic tree generation. These commands are implemented for better utilization of time and computational resources when analyzing large fasta sequence datasets. AcetoScan pipeline can also be used as a Docker container for operating system independent and software installation free data analysis.

In silico Mock Community Construction
To evaluate the AcetoScan pipeline for different analysis parameters, in silico mock communities were constructed. Three datasets were generated to target different taxonomic levels, i.e., phylum, family and genus. For the construction of the phylum-level mock community, the 11 most abundant phyla in AcetoBase were identified and full-length nucleotide sequences were retrieved. For the family level mock community, fulllength nucleotide sequences were collected from the 29 most abundant families in AcetoBase. Finally, for the genus-level mock community, full-length nucleotide sequences from 40 known acetogens (Supplementary Table S1, Singh et al., 2019) were collected from AcetoBase. To assess the effect of different sequence lengths, datasets of ∼635 nucleotides (nt), 300 nt and 150 nt were generated.
For all individual mock communities, the process described in the following paragraph was adopted: (1) Creation of dataset full-length.
(2) Dataset trimmed was created by aligning the sequences to the sequence amplified by the primer pair from  and the full-length sequence was trimmed to the aligned length of ∼635 nt. (3) Dataset 300 was created by trimming the dataset trimmed to a sequence length of 300 nt (equivalent to maximum read length from Illumina MiSeq). (4) Dataset 150 was created by trimming the dataset trimmed to a sequence length of 150 nt. (5) All the fasta sequences in individual datasets were converted to fastq sequences using the program fastA2Q (Singh, 2019) and compressed in Samplename_L001_R1_001.fastq.gz format. (6) Individual datasets were processed with the command acetoscan in the AcetoScan pipeline with cluster thresholds of 80% (default) and 100%. (7) To check the reproducibility, analyses were repeated three times (data not shown).

FTHFS Sequence Data Analysis With the AcetoScan Pipeline
The raw FTHFS gene amplicon sequence data retrieved from biogas reactor samples in compressed fastq format were used as input. The analysis was carried out separately for the forward (R1) and reverse (R2) reads using the command acetoscan with default parameters: max_length = 300, min_length = 120, Phred quality score = 20, clustering threshold = 80%, minimum cluster size = 2 and evalue 1e-3. The analysis was also carried out for both reads separately, using a clustering threshold of 100%. All analyses presented in this study were performed on a Debian-based Linux operating system running kernel 5.3.4-40-generic in x86-64 architecture (Computer 1, Table 2).
To evaluate the computational performance of the AcetoScan pipeline, the analysis was performed on three computers with different specifications ( Table 2).

Phylum-Level Mock Community Analysis
The mock community analysis at the phylum level for the dataset full-length resulted in the taxonomic assignment of OTUs with >96% accuracy at the 80% clustering threshold, while for the 100% clustering threshold the taxonomic assignment of OTUs was >99% accurate. For the 80% clustering threshold, only one phylum showed misidentification (3.4% of OTUs belonging to phylum Actinobacteria were misidentified as phylum Firmicutes). However, this misidentification was only 0.6% when the clustering threshold was set to 100%. The taxonomic assignments for the dataset trimmed was >99%

Family-Level Mock Community Analysis
For the family level, the dataset full-length analyzed with clustering thresholds 80% and 100% resulted in the identification and taxonomic assignment of OTUs with accuracy > 99%. In this analysis, particularly for the family Clostridiaceae, <5% of OTUs were misidentified as family Peptostreptococcaceae.
Similar observations were made for the dataset trimmed at both clustering thresholds. Analysis for the dataset 300 at both clustering thresholds resulted in the similar taxonomic classification of OTUs as obtained for the dataset trimmed, except that 25% of OTUs were affiliated to the family Verrucomicrobiaceae. These could not be annotated with a valid family and were thus denoted "NA." The affiliation of this "NA" was traced back to class and order level as unclassified Verrucomicrobia. Analysis of the dataset 150 at clustering threshold 80 and 100% resulted in invalid taxonomic assignments, ranging from 5-24% in six out of 29 families (Supplementary Data 2).

Acetogen/Genus-Level Mock Community Analysis
When analyzing the dataset full-length from known acetogens, we observed that OTUs resulting from a clustering threshold of 80% were annotated to either the correct species or closest relative in the same genus. However, at a clustering threshold of 100%, all except three samples were accurately annotated. Of these, Butyribacterium methylotrophicum was correctly annotated as order Clostridiales, while Terrisporobacter glycolicus was correctly annotated as an

FTHFS Gene Amplicon Sequencing
Ultra-deep sequencing of five multiplexed samples on the Illumina MiSeq platform resulted in a sequence data size of 8.2 Gigabytes (compressed size 4.68 Gigabytes). A total of 6.5 million read pairs and, on average, 1.32 million read-pairs per sample were generated. The number of reads generated per sample and the number of reads per sample after quality filtering is presented in Supplementary Table T1.

Analysis of Forward Reads
Forward reads data were analyzed with the command acetoscan with clustering thresholds of 80 and 100%. This generated in total 577 OTUs belonging to 13 phyla and 168 genera, and 1171 OTUs belonging to 13 phyla and 176 genera, respectively ( Table 3). The community composition at the family level [relative abundance (RA) > 1%] for both clustering thresholds was similar, except for the Tepidimicrobium.NA (RA 1.06%), which was observed in sample GR2-151117 at the 80%, but not the 100%, clustering threshold. At the genus level, 20 genera with relative abundance > 1% were seen for the 80% clustering threshold, while 19 genera with relative abundance > 1% were seen at the 100% clustering threshold%. At the 80% clustering threshold, Pyramidobacter was observed in samples GR2-150303 (RA 1.3%) and GR2-150519 (RA 1.35%) and Tepidimicrobium in sample GR2-151117 (1.05%) ( Figure 1A). However, these two genera were not seen (RA > 1%) at the 100% clustering threshold, where instead Caldisalinibacter was observed in one sample, GR2-151117 (1.19%) (Figure 1B). At the species level (RA > 5%), community composition was similar to that at the genus level, with only five major species observed at both clustering thresholds. The RA of these species differed marginally between the clustering thresholds (Supplementary Data 4, 5).

Comparison of Forward and Reverse Reads Processing Results
Comparison of data processed at the 100% clustering threshold at the family level for both forward reads and reverse reads showed a very similar pattern for the top five families Clostridiaceae, Hungateiclostridiaceae, Ruminococcaceae, Syntrophomonadaceae and Thermoanaerobacterales family_IV Incertae Sedis. At the genus level, comparison of the forward and reverse reads showed a similar pattern for the four most abundant genera Flavonifractor, Pseudobacteroides, Syntrophomonas, and Mahella. However, genus Hungateiclostridium (RA 2-9%) was only recovered in forward reads results (Figures 1B, 2B) and Clostridium was only observed in reverse reads results (RA 2.9-13%). At the genus level, 24 genera were unique and observed only in forward read result while five genera (Actinobaculum, Anaerobium, Anaerovibrio, Clostridiisalibacter, Paucisalibacillus) appeared uniquely in reverse read result ( Figure 3A). Even though some differences were seen between forward and reverse FIGURE 1 | Forward reads processing results representation of the genus-level bar plot with relative abundance > 1% for data processed at a clustering threshold of (A) 80% and (B) 100%. Genera with relative abundance < 1% are merged in the category 'Minor genera'.
FIGURE 2 | Reverse reads processing result representation of the genus-level bar plot with relative abundance > 1% for the data processed at a clustering threshold of (A) 80% and (B) 100%. Genera with relative abundance < 1% are merged in the category 'Minor genera'. reads, a density plot comparing the absolute abundance of the whole dataset from forward and reverse reads at the genus level did not show significant differences in Student's t-test (p = 0.147) ( Figure 3B). Moreover, an additional independent 2-group Mann-Whitney U Test/Wilcox test (p = 0.363) and Kruskal-Wallis Test (p = 0.363) did not show any significant difference in the absolute abundance of forward and reverse read dataset at the genus level.

Computational Competence Comparison
The sequence data generated were used to compare the computation performance of the AcetoScan pipeline. Three computers with different specifications were used to process the raw sequence data, and the times required for the analysis were compared. The analysis took the shortest time (35 and 38 min) on computer 1 because of comparatively better hardware (see Table 2). The data analysis performed on computers 2 and 3 was generally similar for both forward and reverse reads data (∼1 h 28 min), except for reverse reads data processed on computer 2 (∼2 h 27 min). Complete computer specifications are provided in Table 2.

High-Throughput Sequencing of FTHFS Amplicons
To our knowledge, this is the first study to succeed in producing and analyzing high-throughput sequencing data from FTHFS gene amplicons with multiplexed samples. A previous attempt at multiplexed sequencing of the FTHFS gene has been reported (Planý et al., 2019), but the reliability and reproducibility of that analysis cannot be tested as no data were submitted to public repositories/databases. Most previous studies dealing with the FTHFS gene were based on clone library sequencing or terminal restriction fragment length polymorphism (T-RFLP) analysis (e.g., Pester and Brune, 2006;Hori et al., 2011;Westerholm et al., 2011). Both methods are used widely for microbial community analysis and are considered reliable. However, both are also low-throughput and time-and resource-intensive and have many shortcomings. For example, clone library sequencing can give long sequences of good quality but can be biased by ligation, transformation and colony selection. The T-RFLP method can be useful for community dynamics profiling of a large number of samples, but the resolution of the community may be limited, the reliability of restriction fragments can be questioned and there is no taxonomic information associated with the restriction fragments (Avis et al., 2006;Prakash et al., 2014;De Vrieze et al., 2018).
For the high-throughput sequencing on Illumina platforms, forward and reverse reads are generally merged during dataanalysis. To merge the read-pairs, the amplicon size must be less than twice the read length generated by respective Illumina platform and if paired-end reads cannot be merged, singleend reads can be used for the taxonomic annotations (Menzel et al., 2016;Liu et al., 2020). In case of FTHFS gene, there has been a lack of primer pair which can (1) target of the high diversity in the whole microbial community and (2) generate amplicons ≤ 590 bp. Until now, several primer pairs have been published but they all are limited in their ability to target the overall acetogenic community . Among those, primers developed by  covered more diverse FTHFS sequences as compared to other published FTHFS primer pairs. Thus, although these primers generate amplicons of ∼635 bp and therefore read pairs produced from these amplicons on Illumina MiSeq platform cannot be merged, we used the primers from  for our high-throughput sequencing approach. The effect of un-paired forward and reverse reads on resulting community profile and effect of different read lengths on reliability of taxonomic classification is further reviewed in the discussion.
The high-throughput sequencing of partial FTHFS gene amplicons in this study generated an average of 1.32 M fastq reads (pairs) per sample on Illumina MiSeq. For meta-barcoding studies, a sequencing depth of approximately 15,000 reads is sufficient to enumerate the community structure and diversity of taxa that are not considered rare (Kuczynski et al., 2010;Caporaso et al., 2012;Bukin et al., 2019). Thus, the sequence data generated in our study can be considered sufficient to illustrate accurate and statistically legitimate community composition across samples up to the genus level. With the method, it is possible to sequence large numbers of samples using different strategies, e.g., deep sequencing of fewer samples to find unknown members of the acetogenic community or sufficiently deep sequencing of a larger number of samples from different environments or time-series data to follow and illustrate community changes.
High-throughput sequencing data require a reproducible and accurate data analysis pipeline and a curated database for taxonomic classification. Specific bioinformatics skills are also required, together with computational resources for handling and processing the sequencing data. These skills and resources are not always easily accessible in a cross-or multi-disciplinary research environment and rarely outside the research arena in practical field applications. For this reason, we developed a new high-throughput automated analysis method/pipeline for unsupervised estimation of the acetogenic community.

Testing AcetoScan Pipeline With in silico Mock Community
To test the reproducibility, reliability and accuracy of AcetoScan, mock communities at phylum, family and genus level were constructed and analyzed using different clustering thresholds (80 and 100%) and different sequence lengths (full-length, trimmed, 300 and 150). Repeated analysis of the respective datasets showed reproducible (duplicate) results, where taxonomic assignments of mock communities reliably corresponded with the clustering thresholds and sequence lengths (data not shown for duplicate results). At family and genus level, the taxonomic affiliations were accurate at the 100% clustering threshold for the datasets full-length, trimmed and 300 nt. However, when the sequence clustering threshold was reduced to 80%, only the family level could be reliably classified. The sequence length of 150 nt could not be accurately and reliably used for rare taxa, but still illustrated the overall community dynamics of more abundant taxa and at higher taxonomic levels, i.e., phylum, class, and order.
The differences in community structure at the different clustering thresholds could easily be explained by the fact that any marker gene has its specific percentage clustering thresholds which are required for accurate classification at different taxonomic levels (Qin et al., 2014;Yang et al., 2014). Length of FTHFS sequence, the origin of sequence (level of sequence similarity or variation among different taxa) and type of sequence similarity searches can also influence taxonomic classification at different clustering threshold levels. A percentage similarity threshold of 78% (class Clostridia) in translated nucleotide versus protein searches has been shown to be sufficient for taxonomic classification at the genus level of FTHFS sequences , and thus the 80% clustering threshold is used as the default cut-off value in the AcetoScan pipeline. The full-length and trimmed datasets helped understand the accuracy of classification, while the 300 and 150 datasets indicated the classification accuracy when using different sequencing platforms, i.e., PacBio, Oxford Nanopore, Illumina MiSeq, HiSeq, NextSeq, or Novaseq series machines, which produce different read lengths.
Blast best-hit ambiguity and re-classification of taxa might also lead to incomplete or false classification, as observed for the acetogen mock community. At the 100% clustering threshold, Terrisporobacter glycolicus, a member of the family Peptostreptococcaceae, was assigned correctly at the family level but with no further classification, despite T. glycolicus being present in the AcetoBase reference dataset. This might be due to the blast best-hit ambiguity. Furthermore, [Butyribacterium] methylotrophicum, also present in AcetoBase, was assigned at order level as Clostridiales, probably because genus [Butyribacterium] has not yet been classified and named as a valid genus.
[Clostridium] sticklandii has been renamed Acetoanaerobium sticklandii, but its homotypic synonym is still in use, and might instead be taxonomically assigned as Acetoanaerobium noterae.

Analysis of High-Throughput Sequencing Data With AcetoScan
Processing FTHFS forward and reverse reads with AcetoScan resulted in similar community dynamics for both at clustering thresholds of 80% and also 100%. Aberrations in the taxonomic profiling, with certain taxa being present in one dataset and not in another (Figures 1B, 2B), have two possible explanations: (i) Low-abundance taxa identified in one dataset might be below the visualization threshold in the other dataset, and thus be merged. This is supported by the fact that the genera differing between forward and reverse reads had <1% relative abundance. There were no statistically significant (Student t-test p-value 0.147) differences between the community profile generated from the forward and reverse reads if absolute abundance at the genus level is compared, and thus the differences were negligible. (ii) Lack of read pair merging, resulting in individual processing of the respective read type leading to a slight deviation in the read count matrix. This could result in slight variations in the community structure presented in the plots for the respective sequence read types, although such variations might not be statistically significant in whole datasets. The AcetoScan pipeline can process long fastq sequences and does not require read-pairs. It can, therefore, be used for the data generated on sequencing platforms other than Illumina MiSeq or fasta sequences generated by traditional clone libraries using the Sanger sequencing method. The AcetoScan pipeline is available for Debian Linux and MacOS operating systems as well as Docker container, and can perform the analysis even on a laptop computer with standard hardware specifications. Since a high-performance computer is not easily available to all, the ability of AcetoScan to run on a laptop computer is a strong benefit of our bioinformatics analysis pipeline.

The Vigilance for Acetogens
Acetogenesis is a complex physiological trait and not a phylogenetic/genomic property (e.g., Drake, 1994;Tanner and Woese, 1994;Küsel et al., 2001;Drake et al., 2002;Schuchmann and Müller, 2016;Singh et al., 2019). It should thus be referred to as a flexible functional characteristic, rather than a rigid group of few taxonomic ranks. The FTHFS gene is an important enzyme of the WLP and has been used widely to assess the acetogenic population. It can also be present in other groups of microorganisms, such as syntrophic acid-oxidizing bacteria, sulfate-reducing bacteria and archaea. The mere presence of a FTHFS gene does not define a bacterium as an acetogen, as proposed in a recent publication by Kim et al. (2020). There are instances where the terms acetogens, homoacetogens and acetogenesis are still not properly understood and have been misused (Das and Ljungdahl, 2000;Küsel and Drake, 2011;Müller and Frerichs, 2013;Singh et al., 2019). The FTHFS OTUs generated in the AcetoScan pipeline do not claim that the detected OTUs is an 'acetogen, ' and we propose a threestep procedure which must be followed before an OTU can be defined as an acetogen: (1) Identification of the bacterial candidate with the FTHFS gene; (2) identification of the WLP genes in the genome of the candidate; and (3) experimental validation of acetogenic metabolism in an environment which sustains acetogenesis. Steps 1 and 2 can be conducted using AcetoScan and AcetoBase. To avoid misuse of the terms acetogen, acetogenic community or acetogenesis, we suggest that organism should be considered as a candidate which has the potential for acetogenesis if its genome contains the complete WLP or at least its main enzyme-encoding genes, i.e., formyltetrahydrofolate synthetase, acetate kinase and acetyl-CoA synthase/carbon monoxide dehydrogenase complex .

Future Perspectives
Acetogens are among the most versatile organisms on the planet. This metabolic versatility lies in their ability to grow on the thermodynamic borderline and produce organic precursor acetyl-CoA by reduction of CO 2 (Drake and Küsel, 2003;Lever, 2012;Schuchmann and Müller, 2014). Some acetogens are reported to harbor a unique hydrogen-dependent CO 2 reductase that has the highest biological hydrogen production rates known (Müller, 2019). In industrial processes, acetogens are used as microbial cell factories to trap CO 2 and production of biofuels from syngas (Liew et al., 2016). In recent years, acetogenic bacteria have also been the focus of studies on microbial fuel cells, where electricity is generated from microorganism-powered batteries, and on microbial electrosynthesis systems, where surplus renewable electric power can be used to synthesize organic compounds (Nevin et al., 2011;Parameswaran et al., 2011;Scott and Yu, 2015;Saheb-Alam et al., 2018). Therefore, acetogens are important microorganisms for the circular bio-economy and for mitigating climate change (Oren, 2012;Liew et al., 2016;Müller, 2019;Wiechmann and Müller, 2019). Acetogens are also abundantly present in the human gut, but their role in human gut physiology and the gut-brain relationship require further investigation (e.g., Gibson et al., 1990;Leclerc et al., 1997;Ohashi et al., 2007;Rey et al., 2010;Laverde Gomez et al., 2019). Moreover, some acetogens have been found to associate with plants in aquatic habitats and can fix atmospheric nitrogen (Küsel et al., 1999;Pester and Brune, 2006;Ohkuma et al., 2015). Therefore, acetogens are ubiquitous and prominent microorganisms in the ecosystem and more focused and extensive studies are needed to decode their interconnection with human and other organisms. Consequently, our method and AcetoScan can be of great importance in the exploration of potential acetogenic communities in different and natural environments.

CONCLUSION
A novel pipeline, AcetoScan, was validated with several sets of in silico mock communities and successfully used for the analysis of the acetogenic community in multiplexed biogas reactor samples. AcetoScan is designed for rapid and accurate analysis of FTHFS amplicon sequence data and taxonomic annotation using AcetoBase with minimum or no user supervision and results in interactive and publication-ready graphs and plots from raw sequence data. AcetoScan does not necessarily require a highperformance cluster computer and analysis can be performed on a standard computer, with analysis time varying depending on the computer configuration. Our sequencing approach and AcetoScan analysis pipeline can become the method of choice for research on natural or constructed environments where the acetogenic community and its dynamics are important.

DATA AVAILABILITY STATEMENT
The FTHFS OTU sequences generated from analysis of high-throughput sequencing data have been submitted to AcetoBase. Accession numbers of the respective OTU dataset are presented in Table 3. The raw data generated by Illumina Miseq have been submitted to NCBI SRA (study: SRP257947, run IDs: SRR11590656:SRR115 90660) with BioProject accession number PRJNA627452 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA627452). The AcetoScan pipeline, with test data, user manual and instruction video, is available on the AcetoScan GitHub repository (http:// github.com/abhijeetsingh1704/AcetoScan). The Docker image of AcetoScan pipeline can be accessed on the Docker hub (https://hub.docker.com/r/abhijeetsingh1704/acetoscan).

AUTHOR CONTRIBUTIONS
ASi, ASc, and BM conceived the idea for the study. ASc acquired funding for the project. ASi performed the experiment and data analysis and developed the AcetoScan pipeline, together with JN and in discussion with ASc and BM. EB-R critically assessed the set-up of the AcetoScan pipeline. ASi wrote the manuscript with valuable help from all co-authors. All the authors contributed to the article and approved the submitted version.