<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Bioinform.</journal-id>
<journal-title>Frontiers in Bioinformatics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Bioinform.</abbrev-journal-title>
<issn pub-type="epub">2673-7647</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">893933</article-id>
<article-id pub-id-type="doi">10.3389/fbinf.2022.893933</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Bioinformatics</subject>
<subj-group>
<subject>Technology and Code</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Snaq: A Dynamic Snakemake Pipeline for Microbiome Data Analysis With QIIME2</article-title>
<alt-title alt-title-type="left-running-head">Mohsen et al.</alt-title>
<alt-title alt-title-type="right-running-head">Snaq QIIME2 Automation Pipeline</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Mohsen</surname>
<given-names>Attayeb</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1267868/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chen</surname>
<given-names>Yi-An</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/775952/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Allendes Osorio</surname>
<given-names>Rodolfo S.</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1715987/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Higuchi</surname>
<given-names>Chihiro</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Mizuguchi</surname>
<given-names>Kenji</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/775939/overview"/>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Artificial Intelligence Center for Health and Biomedical Research (ArCHER)</institution>, <institution>National Institutes of Biomedical Innovation, Health and Nutrition</institution>, <addr-line>Osaka</addr-line>, <country>Japan</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Institute for Protein Research</institution>, <institution>Osaka University</institution>, <addr-line>Osaka</addr-line>, <country>Japan</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/306576/overview">Keith A. Crandall</ext-link>, George Washington University, United States</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/306807/overview">Eduardo Castro-Nallar</ext-link>, University of Talca, Chile</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/38036/overview">Jan Aerts</ext-link>, Amador Bioscience, Belgium</p>
<p>Hannes Peeters, Hasselt University, Belgium, in collaboration with reviewer JA</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Attayeb Mohsen, <email>attayeb@nibiohn.go.jp</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Genomic Analysis, a section of the journal Frontiers in Bioinformatics</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>01</day>
<month>07</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>2</volume>
<elocation-id>893933</elocation-id>
<history>
<date date-type="received">
<day>11</day>
<month>03</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>19</day>
<month>05</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Mohsen, Chen, Allendes Osorio, Higuchi and Mizuguchi.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Mohsen, Chen, Allendes Osorio, Higuchi and Mizuguchi</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Optimizing and automating a protocol for 16S microbiome data analysis with QIIME2 is a challenging task. It involves a multi-step process, and multiple parameters and options that need to be tested and determined. In this article, we describe Snaq, a snakemake pipeline that helps automate and optimize 16S data analysis using QIIME2. Snaq offers an informative file naming system and automatically performs the analysis of a data set by downloading and installing the required databases and classifiers, all through a single command-line instruction. It works natively on Linux and Mac and on Windows through the use of containers, and is potentially extendable by adding new rules. This pipeline will substantially reduce the efforts in sending commands and prevent the confusion caused by the accumulation of analysis results due to testing multiple parameters.</p>
</abstract>
<kwd-group>
<kwd>snakemake</kwd>
<kwd>QIIME2</kwd>
<kwd>microbiome</kwd>
<kwd>16S</kwd>
<kwd>automation</kwd>
</kwd-group>
<contract-sponsor id="cn001">Ministry of Health, Labour and Welfare<named-content content-type="fundref-id">10.13039/501100003478</named-content>
</contract-sponsor>
</article-meta>
</front>
<body>
<sec id="s1">
<title>Introduction</title>
<p>The microbial content of a biological sample can be determined by sequencing and the subsequent bioinformatic processing/analysis of the sequenced data. 16S Ribosomal RNA gene sequencing (<xref ref-type="bibr" rid="B18">Hugerth and Andersson, 2017</xref>) is one of the most intensively used approaches in microbiome research. It is also called amplicon sequencing since it incorporates the amplification of a specific DNA region (16S rRNA gene) in bacterial genomes using PCR. Accordingly, software tools such as QIIME 2 (<xref ref-type="bibr" rid="B6">Bolyen et al., 2019</xref>) and Mothur (<xref ref-type="bibr" rid="B30">Schloss et al., 2009</xref>), to name only two, have been developed for the processing and analysis of this type of data. For a detailed description of 16S amplicon approach please refer to (<xref ref-type="bibr" rid="B16">Go&#x142;&#x119;biewski and Tretyn, 2019</xref>) and for a comparison between tools that can be used for its analysis, we recommend (<xref ref-type="bibr" rid="B6">Bolyen et al., 2019</xref>; <xref ref-type="bibr" rid="B25">Prodan et al., 2020</xref>).</p>
<p>QIIME2 (<xref ref-type="bibr" rid="B6">Bolyen et al., 2019</xref>) is a microbiome data analysis platform that targets amplicon (16S) data. It relies on third party software programs implemented as plugins (such as feature-classifier (<xref ref-type="bibr" rid="B5">Bokulich N. A. et al., 2018</xref>) for taxonomic classification), QIIME2 is designed to facilitate seamless incorporation of new plugins, allowing developers to add new features easily<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref>.</p>
<p>QIIME2 plugins handle input and output through the definition of <italic>artifacts</italic>, i.e., compressed folders that contain both data files and metadata information. For example, raw sequence data can be imported to construct an artifact, which is later used by a specific plugin. In turn, the plugin produces a new artifact of a different type as output.</p>
<p>This approach makes it possible to change the order of the steps or insert new steps in the middle without extra effort, provided that the input and output follow the QIIME2 framework guidelines. This approach makes combining multiple tools in sequence effortless and reduces the requirements of programming skills. Moreover, importing or exporting data from/to various formats or visualization of the original data or the results can be achieved easily.</p>
<p>Despite its multiple advantages, some difficulties arise when trying to automate data analysis using QIIME2. Even when the same data types and the same experimental technique (such as sample preparation proceduce or sequencing technology) are used, the results of the analysis depend on multiple environmental and technical features, such as the length and quality of the sequenced data. As the choice of the bioinformatics tools and processing parameters (such as quality trimming threshold or sequence similarity) used for the analysis depend on the data set, every data set can be considered unique and in need of special treatment, making it impossible (or very difficult) to automate.</p>
<p>For example, the process of <italic>quality trimming</italic> is used to clean the data and improve the results by removing the nucleotides assigned with low confidence. Selecting the appropriate quality threshold value at which trimming should be performed depends on the data set, and usually requires trying and testing. If a very stringent trimming threshold is adopted, plenty of good data could be lost; alternatively, adopting a loose trimming threshold introduces low-quality data in the downstream analysis, affecting the quality and reliability of the results. Moreover, there are different tools and databases for data cleaning, identifying Operational Taxonomy Units (OTUs), and taxonomy assignment. The choice of these tools will affect the final result, and hence fine tuning is required in order to find the best options.</p>
<p>Usually, researchers need to investigate multiple options, which requires running the analysis several times and comparing the results to decide the best set of tools and parameters for the data under investigation. Such an optimization process requires a substantial effort and leads to the accumulation of several copies of the results with different sets of parameters, making the whole process rather inefficient and difficult to reproduce.</p>
<p>By combining the analysis strengths of QIIME2 with the flexibility in the definition of pipelines provided by Snakemake, here we introduce &#x201c;Snaq&#x201d;, a dynamic Snakemake pipeline for microbiome data analysis with QIIME2.</p>
<p>Snaq incorporates the definition of analysis rules with the definition of an expressive target file format, which together provide the functionality required to achieve the following when working with QIIME2:</p>
<p>1) Faster protocol optimization: By changing the name of the target file, the analysis workflow dynamically changes, allowing testing of different tools and parameters. This is crucial, as the analysis of 16S microbiome data with QIIME2 can be performed in multiple ways with numerous permutations of software and parameter choices, depending on the technology used and sequencing qualities, and the researcher&#x2019;s preference.</p>
<p>2) Full pipeline automation: Combining rule definition with an ad-hoc target file name, Snaq allows the execution of a full analysis pipeline through a single command instruction. As 16S microbiome data analysis with QIIME2 entails multiple command submission, this significantly reduces the number of commands and instructions that the user needs to know, allowing to focus on the actual analysis and not the programming.</p>
<p>3) Handle data accumulation: Snaq automatically handles the (intermediate) data that are often generated as a result of multiple trial runs. Additionally, it avoids the duplication of intermediate result files when multiple executions of different analysis pipelines include identical intermediate stages.</p>
</sec>
<sec id="s2">
<title>Related Work</title>
<p>Various tools and efforts have been developed to make QIIME 2 more accessible and easy to use.</p>
<p>Fung et al. introduced the QIIME2 automation pipeline (QAP), a series of scripts that could be used to run multiple QIIME2 protocols (<xref ref-type="bibr" rid="B14">Fung et al., 2021</xref>). In addition, their paper gives detailed explanations of many steps and descriptions of their results. Multiple commands need to be executed to run the analysis using QAP; moreover, it provides more options and different approaches than Snaq follows (<xref ref-type="bibr" rid="B14">Fung et al., 2021</xref>).</p>
<p>
<xref ref-type="bibr" rid="B12">Estaki et al. (2020)</xref> provided a comprehensive description of QIIME2. They also, with the help of Jupyter notebooks, provide examples of running end to end analysis using QIIME2.</p>
<p>In an effort closer to Snaq, Hu and Alexander implemented a Snakemake pipeline for QIIME2 analysis, designed to run with parameters specified through a configuration file (<xref ref-type="bibr" rid="B17">Hu and Alexander, 2020</xref>). Due to its design, the change of parameters requires the modification of manifest and configuration files. Additionaly, tasks like trimming and taxonomy assignment are not covered.</p>
<p>Dadasnake is another example of a Snakemake pipeline that automates DADA2 analysis outside the setting of the QIIME2 framework (<xref ref-type="bibr" rid="B31">Wei&#xdf;becker et al., 2020</xref>).</p>
<p>Also worth mentioning at this point is the Galaxy project, an open-source platform that allows users to do data analysis within the FAIR initiative (<xref ref-type="bibr" rid="B1">Afgan et al., 2018</xref>). Included in its directory of tools is q2Galaxy, a comprehensive interface for QIIME2 <ext-link ext-link-type="uri" xlink:href="https://github.com/qiime2/q2galaxy">
<monospace>https://github.com/qiime2/q2galaxy</monospace>
</ext-link>. q2Galaxy makes performing microbiome data analysis easier especially when docker is used for its installation.</p>
<p>Although the above mentioned tools are available and help automate the analysis of 16S data, none of them (with the only exception of q2Galaxy) provides an easy way to run an analysis multiple times as is usually required for optimization purposes. Also, they tend to have fixed steps and/or do not make it easy to change the sequence of steps and parameters used.</p>
<p>To address this issues, we would like to propose a pipeline that make it easy to modify the key parameters used by different tools, by simply modifying the target file name. We also expect our single command approach to make it easier to run analysis multiple times without the user having to worry about the handling or intermediate output results.</p>
</sec>
<sec id="s3">
<title>Implementation</title>
<p>Snakemake (<xref ref-type="bibr" rid="B20">Koster and Rahmann, 2012</xref>; <xref ref-type="bibr" rid="B23">M&#xf6;lder et al., 2021</xref>) is a Python dialect created for the specification of pipeline workflows. A Snakemake pipeline is specified through the definition of <italic>rules</italic>; where each rule typically has: <italic>input</italic> for the specification of input files; <italic>output</italic> for the specification of output files; and <italic>shell</italic> for the specification of the command used to produce the output based on the input.</p>
<p>The execution of a Snakemake pipeline is achieved via the definition of a single target file name. Snakemake will then determine the steps required to produce the target output based on its rules, the file name, and the application of the wildcards concept<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref>.</p>
<p>The wildcards concept facilitates passing parameters for any rule in the pipeline by inferring the parameter&#x2019;s value from the target file name. This feature of Snakemake is especially suitable for parameter optimization. There are other advanced features, such as caching processing results, to prevent doing the same analysis repeatedly (<xref ref-type="bibr" rid="B20">Koster and Rahmann, 2012</xref>).</p>
<p>Snaq is made of three main components: 1) <monospace>Snakefile</monospace>, 2) <monospace>env</monospace> folder and 3) <monospace>scripts</monospace> folder. <monospace>Snakefile</monospace> is the file where all the required snakemake rules are implemented; notice these rules were carefully constructed as not to contend with each other and to make the whole process run smoothly. The <monospace>env</monospace> folder contains the definitions for the Conda (<xref ref-type="bibr" rid="B2">Anaconda, 2020</xref>) environments as a series of YAML files, while the <monospace>scripts</monospace> folder contains extra scripts required by Snaq to fill the gaps of the pipeline that are not covered by QIIME2 plugins.</p>
<p>Snaq takes advantage of QIIME2&#x2019;s command-line interface and available plugins and combines it with our implementation of new Python and R scrips. Then, by incorporating a descriptive name file convention and the rule-based structure of Snakemake, it makes possible the definition and execution of dynamically defined pipelines through a single terminal command.</p>
<p>Snaq can be used on personal computers or server environments. It works on Linux and Mac operating systems<xref ref-type="fn" rid="fn3">
<sup>3</sup>
</xref>. It is also possible to use directly from the available Docker and singularity containers. All analysis takes place in the Snaq home folder, where the input files need to be stored inside <monospace>data</monospace> folder, and all results will be saved in <monospace>results</monospace> folder.</p>
<sec id="s3-1">
<title>Descriptive File Convention</title>
<p>To make the pipeline versatile and easily modifiable, we adopted a convention of including all the key parameter values inside a target file name and called this scheme descriptive target file naming (Figure: 1). At the same time, other parameters are left as default. This means that Snakemake will parse the target file name and infer the sequence of steps and the parameter values used. Then the target file will be created accordingly.</p>
<p>To let Snakemake infer the required steps and their order, we used a predetermined output nomenclature for each stage (<xref ref-type="table" rid="T1">Table 1</xref>).</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Output nomenclature.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Term</th>
<th align="center">Explanation</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<monospace>fp-f{x}-r{y}</monospace>
</td>
<td align="left">fp: stands for fastp application, x:takes the number of nucleotides to be cropped from forward read, y: takes the value of the number of nucleotides to be cropped from reverse read</td>
</tr>
<tr>
<td align="left">
<monospace>bb-t{t}</monospace>
</td>
<td align="left">bb: stands for bbduk application, t: is the trimming threshold applied</td>
</tr>
<tr>
<td align="left">
<monospace>dd</monospace>
</td>
<td align="left">dd: stands for DADA2 algorithm</td>
</tr>
<tr>
<td align="left">
<monospace>cls-{x}</monospace>
</td>
<td align="left">cls: stands for taxonomy classifier, x: takes one of the values: &#x201c;gg&#x201d; for Greengenes, &#x201c;silva&#x201d; for SILVA classifier and &#x201c;silvaV34&#x201d; for SILVA classifier trained on V3 and V4 regions</td>
</tr>
<tr>
<td align="left">
<monospace>rrf-d{x}</monospace>
</td>
<td align="left">rrf: stands for rarefaction and x is the value of the rarefaction</td>
</tr>
<tr>
<td align="left">
<monospace>alphadiversity</monospace>
</td>
<td align="left">alpha diversity</td>
</tr>
<tr>
<td align="left">
<monospace>beta</monospace>
</td>
<td align="left">beta: stands for beta diversity</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The stages of analysis in the target file name are divided using the character &#x201c;&#x2b;&#x201d; (plus sign). For example, let us consider the case of the target file shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. Here, we are requesting Snaq to produce a summarized result (indicated by the extension .<monospace>zip</monospace>) for the input data located in folder <monospace>data/AB/</monospace>
<xref ref-type="fn" rid="fn4">
<sup>4</sup>
</xref>. Sequentially, <monospace>bb-t18</monospace> indicates a trimming stage with threshold value 18; <monospace>fp-f17-r21</monospace> indicates the use of fastp with a forward cropping value of 17 and a reverse cropping value of 21; <monospace>dd</monospace> indicates the use of the DADA2 algorithm; <monospace>cls-gg</monospace> request a taxonomy classification using Greengenes; and finally <monospace>rrf-d10000</monospace> indicates the use of rarefaction with a sampling depth of 10,000. It is worth noting that the order of the analysis follows the order of stages.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Descriptive target file name example. A target file is required for Snaq to execute correctly.</p>
</caption>
<graphic xlink:href="fbinf-02-893933-g001.tif"/>
</fig>
<p>We believe that in most cases, users will use Snaq through the definition of a single target file and a single command-line instruction. However, intermediate results files can also be produced upon request by using the corresponding target file. For example, if a user were only to import the dataset into a QIIME2 artifact, this could be done by using <monospace>results/AB/AB.qza</monospace> as target file name. Similarly, if trimming were to be added, the corresponding target file would be <monospace>results/AB/AB&#x2b;bb-t18.qza,</monospace> and so on. Notice that the addition of stages is typically in a forward fashion; this means that later analysis stages can not be added to the target file name without their previous stages also being part of it.</p>
<p>The multiple stages specified in the target file name define an execution pipeline, as the one shown in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Directed Acyclic Graph showing the key steps in the Snaq pipeline. Notice that the given target file will determine a specific traversal of the graph. Color of nodes is used to indicate the analysis stage, as proposed in the implementation section.</p>
</caption>
<graphic xlink:href="fbinf-02-893933-g002.tif"/>
</fig>
<p>In the following, we provide a detailed description of the pipeline stages and their corresponding<xref ref-type="fn" rid="fn5">
<sup>5</sup>
</xref> descriptive file conventions:<list list-type="simple">
<list-item>
<p>&#x2022; Import data: This stage imports FASTQ files from the source folder to a QIIME2 artifact (qza). Notice that, to avoid any confusion, dataset needs to be named using only capital letters. The results for this step are stored in folder <monospace>results/AB/AB.qza</monospace>. The command required to run this step is: <monospace>snakemake &#x2013;&#x2013;use-conda &#x2013;&#x2013;cores 10 results/AB/AB.qza</monospace> Hereafter, we will omit the command and options (<monospace>snakemake &#x2013;&#x2013;use-conda &#x2013;&#x2013;cores 10</monospace>), and focus only on the target file name.</p>
</list-item>
<list-item>
<p>&#x2022; Primer cropping: This stage uses fastp (<xref ref-type="bibr" rid="B10">Chen et al., 2018</xref>) to crop a specified number of nucleotides in both reads. The format of the target is <monospace>fp-fX-rY</monospace> where X represents the number of nucleotides to be cropped from the 5&#x2032; end of the forward reads (R1), and Y is the number of nucleotides to be cropped from the reverse reads (R2). For example, in order to add the cropping of 17 bases from the 5&#x2032; end of R1 and 21 bases from R2 to our previously loaded dataset, the target would be: <monospace>results/AB/AB&#x2b;fp-f17-r21.qza</monospace>
</p>
</list-item>
<list-item>
<p>&#x2022; Quality trimming: It uses bbduk (part of the bbmap tools) (<xref ref-type="bibr" rid="B7">Bushnell, 2021</xref>) to trim the section with low quality at the end of the reads in both R1 and R2. The format target for this step is <monospace>bb-tX</monospace> where X represents the trimming threshold. To add a quality trimming of reads with threshold of 18, the target file name becomes:</p>
<list list-type="simple">
<list-item>
<p>
<monospace>results/AB/AB&#x2b;fp-f17-r21&#x2b;bb-t18.qza</monospace>
</p>
</list-item>
</list>
</list-item>
<list-item>
<p>&#x2022; Both primer cropping and quality trimming procedures are optional (can be omitted) and their order can be reversed. For example, the following target file names are also valid:</p>
<list list-type="simple">
<list-item>
<p>
<monospace>results/AB/AB&#x2b;bb-t18.qza</monospace>
</p>
</list-item>
<list-item>
<p>
<monospace>results/AB/AB&#x2b;bb-t18&#x2b;fp-f17-r21.qza</monospace>
</p>
</list-item>
</list>
</list-item>
<list-item>
<p>&#x2022; DADA2 algorithm: The DADA2 (<xref ref-type="bibr" rid="B9">Callahan et al., 2016</xref>) stage filters the reads, joins pairs, and removes chimera producing Amplicon sequence variant tables (ASVs) that replace OTUs in traditional clustering methods such as UCLUST (<xref ref-type="bibr" rid="B8">Callahan et al., 2017</xref>). As result, three different outputs are generated: an Amplicon sequence variant (ASV) frequency table <monospace>(dd_table.qza)</monospace>, a table of representative sequences for ASVs <monospace>(dd_seq.qza)</monospace> and the statistics of DADA2 performance <monospace>(dd_stats.qza)</monospace>. Using any one of these targets will trigger the generation of all three files, for example:</p>
<list list-type="simple">
<list-item>
<p>
<monospace>results/AB/AB&#x2b;fp-f17-r21&#x2b;bb-t18&#x2b;dd_seq.qza</monospace>
</p>
</list-item>
<list-item>
<p>When the DADA2 algorithm stage is part of a longer pipeline, its inclusion in the target file name can be simply identified by using the word dd (see next section&#x2019;s target file name).</p>
</list-item>
</list>
</list-item>
<list-item>
<p>&#x2022; Taxonomy assignment: It uses the &#x201c;feature-classifier&#x201d; plugin (<xref ref-type="bibr" rid="B4">Bokulich NA. et al., 2018</xref>; <xref ref-type="bibr" rid="B29">Robeson et al., 2021</xref>) to predict the taxonomy of ASVs. Three different classifiers are available: Greengenes <monospace>(cls-gg)</monospace> (<xref ref-type="bibr" rid="B11">DeSantis et al., 2006</xref>; <xref ref-type="bibr" rid="B21">McDonald et al., 2011</xref>), SILVA <monospace>(cls-silva)</monospace> (<xref ref-type="bibr" rid="B26">Pruesse et al., 2007</xref>; <xref ref-type="bibr" rid="B15">Gl&#xf6;ckner et al., 2017</xref>), and SILVA trained on V3 and V4 regions <monospace>(cls-silvaV34)</monospace> (<xref ref-type="bibr" rid="B22">Mohsen, 2021</xref>). The resulting output can be generated both as a QIIME2 artifact <monospace>(cls-&#x3c;classifier&#x3e;_taxonomy.qza)</monospace> or as tab separated file <monospace>(cls-&#x3c;classifier&#x3e;_taxonomy.tsv)</monospace>. For the Greengenes classifier, the two target file name alternatives would be:</p>
<list list-type="simple">
<list-item>
<p>
<monospace>results/AB/AB&#x2b;fp-f17-r21&#x2b;bb-t18&#x2b;dd&#x2b;cls-gg_taxonomy.qza</monospace> and <monospace>results/AB/AB&#x2b;fp-f17-r21&#x2b;bb-t18&#x2b;dd&#x2b;cls-gg_taxonomy.tsv</monospace>
</p>
</list-item>
</list>
</list-item>
<list-item>
<p>&#x2022; Phylogenetic tree building: This step uses the fasttree algorithm (<xref ref-type="bibr" rid="B24">Price et al., 2010</xref>) and QIIME2 phylogeny plugin (<xref ref-type="bibr" rid="B27">qiime2, 2021</xref>) to produce a phylogenetic tree file in NWK format (using <monospace>fasttree.nwk</monospace> as target) or QIIME2 artifact (using <monospace>fasttree_rooted.qza</monospace> as target). Notice that, since the building of a phylogenetic tree can be done directly after the DADA2 algorithm, the following is a valid target file name: <monospace>results/AB/AB&#x2b;bb-t18&#x2b;fp-f17-r21&#x2b;dd&#x2b;fasttree.nwk</monospace>
</p>
</list-item>
<list-item>
<p>&#x2022; Rarefaction: The inclusion of a rarefication stage is indicated by using the <monospace>rrf-dX</monospace> target, where <monospace>X</monospace> represents the sampling depth as defined in (<xref ref-type="bibr" rid="B19">Hughes and Hellmann, 2005</xref>). Notice that rarefaction needs to be applied before alpha diversity, non-phylogenic beta diversity measurements or the generation of biom tables. To apply rarefaction the following target file names can be used to generate a QIIME2 artifact, or a tab separated value file respectively:</p>
<list list-type="simple">
<list-item>
<p>
<monospace>results/AB/AB&#x2b;bb-t18&#x2b;fp-f17-r21&#x2b;dd_table&#x2b;rrf-d10000.qza</monospace>
</p>
</list-item>
<list-item>
<p>
<monospace>results/AB/AB&#x2b;bb-t18&#x2b;fp-f17-r21&#x2b;dd_table&#x2b;rrf-d10000.tsv</monospace>
</p>
</list-item>
<list-item>
<p>In this step, the part _table was not omitted because rarefaction affects only the table and because in the following stages, this rarefied table is to be used to create biom tables and manta files.</p>
</list-item>
</list>
</list-item>
<list-item>
<p>&#x2022; Diversity measurement: At this stage, QIIME2 is used for the computation of alpha (simpson, chao1, shannon and observed features) and beta diversities. Whilst the target for alpha diversity is simply <monospace>alphadiversity</monospace>, different types of beta diversity are specified through the target <monospace>beta_&#x3c;type&#x3e;</monospace>, where <monospace>&#x3c;type&#x3e;</monospace> is one of the following: <monospace>braycurtis, jaccard, unweightedunifrac</monospace> or <monospace>weightedunifrac</monospace>. Sample target file names are as follows:</p>
<list list-type="simple">
<list-item>
<p>
<monospace>results/AB/AB&#x2b;bb-t18&#x2b;fp-f17-r21&#x2b;dd&#x2b;rrf-d10000&#x2b;alphadiversity.tsv</monospace>
</p>
</list-item>
<list-item>
<p>
<monospace>results/AB/AB&#x2b;bb-t18&#x2b;fp-f17-r21&#x2b;dd&#x2b;cls-gg&#x2b;rrf-d10000&#x2b;beta_braycurtis.tsv</monospace>
</p>
</list-item>
<list-item>
<p>
<monospace>results/AB/AB&#x2b;bb-t18&#x2b;fp-f17-r21&#x2b;dd&#x2b;cls-gg&#x2b;rrf-d10000&#x2b;beta_weightedunifrac.qza</monospace>
</p>
</list-item>
<list-item>
<p>Notice that additional alpha and beta diversity measures can be added by modifying the scripts that define them and that can be found inside the <monospace>scripts</monospace> folder of Snaq.</p>
</list-item>
</list>
</list-item>
<list-item>
<p>&#x2022; Summary: Having in mind the need of users to link the analysis made on snaq to other software tools, we prepared a series of special targets that generate results ready to be used elsewhere:</p>
<list list-type="simple">
<list-item>
<p>&#x2022; Phyloseq: Generates a Phyloseq object in RDS file, that can be easily imported to an R environment for subsequent analysis steps. This object includes the ASV table, taxonomy, and phylogenetic tree without rarefaction.</p>
</list-item>
<list-item>
<p>&#x2022; Biom: Produces biom table with taxonomy after rarefaction.</p>
</list-item>
<list-item>
<p>&#x2022; Manta: produces manta ready input files that can be easily uploaded in Manta for results storage and further analysis.</p>
<p>Finally, a special <monospace>zip</monospace> file can be produced, as the one shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, that includes all content summarized in <xref ref-type="table" rid="T2">Table 2</xref>. A complete list of the files produced for an example analysis process is provided in <xref ref-type="sec" rid="s11">Supplementary Table S1</xref>.</p>
</list-item>
</list>
</list-item>
<list-item>
<p>&#x2022; Quality control: An additional, optional step, that runs fastqc (<xref ref-type="bibr" rid="B3">Andrews, 2010</xref>) and/or multiqc (<xref ref-type="bibr" rid="B13">Ewels et al., 2016</xref>). Unlike with other targets, the results of this stage are saved to a different <monospace>quality</monospace> folder, in our example: <monospace>result/AB/quality</monospace>. It is executed by using targets in the following form: <monospace>results/AB/quality/AB&#x2b;bb-t18/multiqc/results/AB/quality/AB/multiqc/</monospace>
</p>
<list list-type="simple">
<list-item>
<p>Quality control can be applied either to the original data <monospace>AB</monospace>, or to any intermediate result obtained before the use of the DADA2 step. A special case that combines all fastqc reports in all subfolders of <monospace>quality</monospace> folder into a new <monospace>quality_summary</monospace> folder is achieved by using the target <monospace>results/AB/quality_summary/</monospace>.</p>
</list-item>
</list>
</list-item>
</list>
</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Summarized output, all the files are preceded by the parameters for the analysis.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Group</th>
<th align="center">File name</th>
<th align="center">Content</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="3" align="left">DADA2</td>
<td align="left">
<monospace>cls-gg_taxonomy.tsv</monospace>
</td>
<td align="left">tsv file of taxonomy assigned to ASVs</td>
</tr>
<tr>
<td align="left">
<monospace>table&#x002B;rrf-d10000.tsv</monospace>
</td>
<td align="left">Rarefied DADA2 features table</td>
</tr>
<tr>
<td align="left">
<monospace>dd_seq.tsv</monospace>
</td>
<td align="left">ASV sequences produced by DADA2</td>
</tr>
<tr>
<td align="left">Phyloseq</td>
<td align="left">
<monospace>phyloseq.RDS</monospace>
</td>
<td align="left">R phyloseq package object of data on ASV level</td>
</tr>
<tr>
<td rowspan="2" align="left">MANTA</td>
<td align="left">
<monospace>manta.tsv</monospace>
</td>
<td align="left">Taxonomy and abundance output friendly to be uploaded to a MANTA database</td>
</tr>
<tr>
<td align="left">
<monospace>manta_tax.tsv</monospace>
</td>
<td align="left">Manta taxonomy ID and taxonomy names</td>
</tr>
<tr>
<td rowspan="2" align="left">Biom</td>
<td align="left">
<monospace>otu_tax.biom</monospace>
</td>
<td align="left">OTU BIOM table of the output collapsed to species level (ASV ignored)</td>
</tr>
<tr>
<td align="left">
<monospace>otu_tax_biom.tsv</monospace>
</td>
<td align="left">OTU BIOM table saved as tsv</td>
</tr>
<tr>
<td rowspan="5" align="left">Diversity</td>
<td align="left">
<monospace>alphadiversity.tsv</monospace>
</td>
<td align="left">Table of alpha diversity for all samples</td>
</tr>
<tr>
<td align="left">
<monospace>beta_braycurtis.tsv</monospace>
</td>
<td align="left">Braycurtis beta diversity</td>
</tr>
<tr>
<td align="left">
<monospace>beta_jaccard.tsv</monospace>
</td>
<td align="left">Jaccard beta diversity</td>
</tr>
<tr>
<td align="left">
<monospace>beta_unweightedunifrac.tsv</monospace>
</td>
<td align="left">Unweighted Unifrac distance between samples</td>
</tr>
<tr>
<td align="left">
<monospace>beta_weightedunifrac.tsv</monospace>
</td>
<td align="left">Weighted Unifrac distance between samples</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="s3-2">
<title>Installation</title>
<p>The only prerequisite on Linux or Mac is the installation of both Conda (<xref ref-type="bibr" rid="B2">Anaconda, 2020</xref>) and Mamba (<xref ref-type="bibr" rid="B28">QuantStack Scientific Computing, 2021</xref>). They are required to manage running environment, and facilitate the installation of QIIME2 and other required tools automatically. We recommend running Snaq after activating the Snakemake environment installed using Conda. Docker installation is the only requirement in running Snaq on Windows using a docker container. <xref ref-type="fig" rid="F3">Figure 3</xref> shows the file structure after installation.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Folder structure of Snaq after installation (classifiers and quality folders will be created at later stage if required). Contents of data and results folders will vary according to use after installation. Notice that all data sub-folders need to be named using capital letters.</p>
</caption>
<graphic xlink:href="fbinf-02-893933-g003.tif"/>
</fig>
<p>The input data of Snaq are paired-end FASTQ files. Snaq automatically distinguishes pair ends by one of two identifiers _<monospace>R1</monospace>_ or _<monospace>1.fastq</monospace>. If other identifiers are used, a manifest file needs to be prepared and saved as <monospace>results/AB/AB_manifest.tsv</monospace> following the QIIME2 manifest file instructions. If this file is present, the first step of creating a manifest file will be ignored.</p>
<p>Other classifiers can also be added to the classifiers/folder, and the file name can be used in the target file name; if a classifier is named &#x201c;abc&#x201d; and saved as <monospace>abc-classifier.qza</monospace>, then we can use it with this target: <monospace>results/AB/AB&#x2b;bb-t18&#x2b;fp-f17-r21&#x2b;dd&#x2b;cls-abc&#x2b;rrf-d1000.zip</monospace> without any modifications.</p>
</sec>
</sec>
<sec sec-type="results|discussion" id="s4">
<title>Results and Discussion</title>
<p>Input data must be saved inside the data folder <monospace>&#x3c;snaq folder&#x3e;</monospace>
<monospace>/data/</monospace>after creating a new folder with the dataset name inside it. Dataset names should consist of capital letters without numbers or special characters in order to avoid confusing them with terms reserved to represent different pipeline stages. Once input data is available and the first step in the analysis is executed, Snaq (through Snakemake) will automatically build the Conda environment required for that step and download the QIIME2 plugins specified in the corresponding environment YAML file. Environment description files are located in the &#x3c;<monospace>snaq folder&#x3e;/envs/</monospace>folder. Notice that this makes the installation of necessary software and the download of taxonomy classifiers an automatic process, only to be performed the first time it is required.</p>
<p>Although Snaq does not cover all the possible uses of QIIME2 and related platforms in 16S data analysis, it provides a complete pipeline that can be extended by adding new rules or modifying the currently available ones. Moreover, following the descriptive target file name strategy makes it easier for Snaq to decide which step to run and skip. That also gives the developer who wants to modify Snaq the freedom to modify the pipeline and add new rules besides the current ones, as a different sequence of rules can be followed depending on the target file name.</p>
<p>Compared to the pipelines mentioned above, Snaq allows dynamic modification of key parameters by modifying the target file name. It also provides a more straightforward installation process and clear output. Moreover, Snaq allows running multiple data sets in the same pipeline setting by having multiple folders in the data folder.</p>
<p>The concept of a descriptive output file name allows high freedom for the pipeline extension. New tools are added to the pipeline through the addition of new rules in the Snakefile. For example, to add trimmomatic the user simply requires to add the corresponding rule. For each new tool added to Snaq, the user requires to assign unique identifier, and identify any key parameters used by the tool. The identifier and parameters are then used in the definition of the Snakefile rule in order to specify the target that will later be used in the definition of an execution pipeline<xref ref-type="fn" rid="fn6">
<sup>6</sup>
</xref>.</p>
<p>Following on our example for trimmomatic, let us use tm as identifier and consider the use of a single parameter. Then, the rule in the Snakefile would follow a structure as shown in Code:1, whilst an example target name could be <monospace>results/AB/AB&#x2b;tm-p12&#x2b;bb-t18</monospace>, where tm identefies trimmomatic and p12 represents the specification of a value for its parameter.<list list-type="simple">
<list-item>
<p>
<monospace>rule NAME:</monospace>
</p>
</list-item>
<list-item>
<p>
<monospace>input:</monospace>
</p>
</list-item>
<list-item>
<p>
<monospace>&#x201c;input-file&#x201d;, &#x201c;other-input-file"</monospace>
</p>
</list-item>
<list-item>
<p>
<monospace>output:</monospace>
</p>
</list-item>
<list-item>
<p>
<monospace>&#x201c;&#x3c;previous-step&#x3e;&#x2b;tm-pvalue.qza"</monospace>
</p>
</list-item>
<list-item>
<p>&#x2026;</p>
</list-item>
<list-item>
<p>
<monospace>shell: &#x3c;command&#x3e; input output.</monospace>
</p>
</list-item>
</list>
</p>
<p>Code 1. Pseudo-code of a rule used for the incorporation of trimmomatic to Snaq.</p>
</sec>
<sec sec-type="conclusion" id="s5">
<title>Conclusion</title>
<p>We have introduced Snaq, a Snakemake pipeline for QIIME2 16S data analysis, including data QC and trimming.</p>
<p>Snaq is designed to wrap QIIME2 processing of paired-end FASTQ files generated by Illumina sequencers to help automation, optimization, and take care of the data storage. It requires minimal effort in installation and configuration; moreover, it can run on all major operating systems. The user only needs a single command to run the pipeline defining required parameters for the analysis in the target file name.</p>
<p>Snaq can be installed directly from GitHub into a user specified location of choice. Notice that the installation directory needs to have enough free space to allocate for all input and intermediate data sets, together with all final results for any particular analysis. Free space is also required for the software programs and databases used in the analysis.</p>
<p>Snaq is designed to be dynamic by using a customly specified target file naming system. Modifying key parameters within the target name also helps the user efficiently perform a series of iterative analyses, taking automatic advantage of previously calculated intermediate steps and keeping track of results.</p>
<p>Installation and running of Snaq are easy. Moreover, Snaq can be extended according to the users&#x2019; needs by adding new rules.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://github.com/attayeb/snaq">https://github.com/attayeb/snaq</ext-link>.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>Conceptualization: AM and KM; Methodology: AM, Y-AC, and KM; Software: AM; Validation: AM, CH, and Y-AC; Writing-Original Draft: AM; Writing-review and editing: AM, KM, Y-AC, and RA. All the authors reviewed the final draft and approved the submission.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>This study was in part supported by The Ministry of Health and Welfare of Japan and Public/Private R&#x26;D Investment Strategic Expansion Program: PRISM (to KM).</p>
</sec>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ack>
<p>We are grateful to Jonguk Park, Hitoshi Kawashima, Lokesh P. Tripathi for their valuable suggestions and help.</p>
</ack>
<sec id="s11">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fbinf.2022.893933/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fbinf.2022.893933/full&#x23;supplementary-material</ext-link>.</p>
<supplementary-material xlink:href="DataSheet1.PDF" id="SM1" mimetype="application/PDF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>For details on the architecture of QIIME2 visit <ext-link ext-link-type="uri" xlink:href="https://dev.qiime2.org/latest/architecture/">https://dev.qiime2.org/latest/architecture/</ext-link>. Last accessed: 24 January 2022.</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>If a rule named &#x201c;all&#x201d; is defined, it is possible to run a predefined workflow without the specification of a target file.</p>
</fn>
<fn id="fn3">
<label>3</label>
<p>Some tools required by QIIME2 do not run natively on Windows environments. Windows users could potentially use snaq through containers, windows Linux subsystem, and/or virtual machines.</p>
</fn>
<fn id="fn4">
<label>4</label>
<p>Even when the command indicates the location of the result files, this also automatically maps to input folder under data/with the given name.</p>
</fn>
<fn id="fn5">
<label>5</label>
<p>All examples assume the use of an input dataset, located in a source folder named data/AB folder.</p>
</fn>
<fn id="fn6">
<label>6</label>
<p>Details for Snakemake rule definition can be found at <ext-link ext-link-type="uri" xlink:href="https://snakemake.readthedocs.io/en/stable">https://snakemake.readthedocs.io/en/stable</ext-link>.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Afgan</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Batut</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>van den Beek</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bouvier</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Cech</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>The Galaxy Platform for Accessible, Reproducible and Collaborative Biomedical Analyses: 2018 Update</article-title>. <source>Nucleic Acids Res.</source> <volume>46</volume>, <fpage>W537</fpage>&#x2013;<lpage>W544</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gky379</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="web">
<collab>Anaconda</collab> (<year>2020</year>). <article-title>Anaconda Software Distribution</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://docs.anaconda.com/">https://docs.anaconda.com/</ext-link>
</comment> (<comment>Accessed December 20, 2022</comment>). </citation>
</ref>
<ref id="B3">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Andrews</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Fastqc: A Quality Control Tool for High Throughput Sequence Data</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/">https://www.bioinformatics.babraham.ac.uk/projects/fastqc/</ext-link>
</comment> (<comment>Accessed November 11, 2022</comment>). </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bokulich</surname>
<given-names>N. A.</given-names>
</name>
<name>
<surname>Kaehler</surname>
<given-names>B. D.</given-names>
</name>
<name>
<surname>Rideout</surname>
<given-names>J. R.</given-names>
</name>
<name>
<surname>Dillon</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bolyen</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>R.</given-names>
</name>
<etal/>
</person-group> (<year>2018b</year>). <article-title>Optimizing Taxonomic Classification of Marker-Gene Amplicon Sequences with QIIME 2&#x27;s Q2-Feature-Classifier Plugin</article-title>. <source>Microbiome</source> <volume>6</volume>, <fpage>90</fpage>. <pub-id pub-id-type="doi">10.1186/s40168-018-0470-z</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bokulich</surname>
<given-names>N. A.</given-names>
</name>
<name>
<surname>Kaehler</surname>
<given-names>B. D.</given-names>
</name>
<name>
<surname>Rideout</surname>
<given-names>J. R.</given-names>
</name>
<name>
<surname>Dillon</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bolyen</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Knight</surname>
<given-names>R.</given-names>
</name>
<etal/>
</person-group> (<year>2018a</year>). <article-title>Optimizing Taxonomic Classification of Marker-Gene Amplicon Sequences with QIIME 2&#x27;s Q2-Feature-Classifier Plugin</article-title>. <source>Microbiome</source> <volume>6</volume>, <fpage>470</fpage>. <pub-id pub-id-type="doi">10.1186/s40168-018-0470-z</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bolyen</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Rideout</surname>
<given-names>J. R.</given-names>
</name>
<name>
<surname>Dillon</surname>
<given-names>M. R.</given-names>
</name>
<name>
<surname>Bokulich</surname>
<given-names>N. A.</given-names>
</name>
<name>
<surname>Abnet</surname>
<given-names>C. C.</given-names>
</name>
<name>
<surname>Al-Ghalith</surname>
<given-names>G. A.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using QIIME 2</article-title>. <source>Nat. Biotechnol.</source> <volume>37</volume>, <fpage>852</fpage>&#x2013;<lpage>857</lpage>. <pub-id pub-id-type="doi">10.1038/s41587-019-0209-9</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Bushnell</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Bbmap Short Read Aligner, and Other Bioinformatic Tools</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://sourceforge.net/projects/bbmap/">https://sourceforge.net/projects/bbmap/</ext-link>
</comment>. </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Callahan</surname>
<given-names>B. J.</given-names>
</name>
<name>
<surname>McMurdie</surname>
<given-names>P. J.</given-names>
</name>
<name>
<surname>Holmes</surname>
<given-names>S. P.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Exact Sequence Variants Should Replace Operational Taxonomic Units in Marker-Gene Data Analysis</article-title>. <source>ISME J.</source> <volume>11</volume>, <fpage>2639</fpage>&#x2013;<lpage>2643</lpage>. <pub-id pub-id-type="doi">10.1038/ismej.2017.119</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Callahan</surname>
<given-names>B. J.</given-names>
</name>
<name>
<surname>McMurdie</surname>
<given-names>P. J.</given-names>
</name>
<name>
<surname>Rosen</surname>
<given-names>M. J.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>A. W.</given-names>
</name>
<name>
<surname>Johnson</surname>
<given-names>A. J.</given-names>
</name>
<name>
<surname>Holmes</surname>
<given-names>S. P.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>DADA2: High-Resolution Sample Inference from Illumina Amplicon Data</article-title>. <source>Nat. Methods</source> <volume>13</volume>, <fpage>581</fpage>&#x2013;<lpage>583</lpage>. <pub-id pub-id-type="doi">10.1038/nmeth.3869</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Gu</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Fastp: An Ultra-Fast All-In-One FASTQ Preprocessor</article-title>. <source>Bioinformatics</source> <volume>34</volume>, <fpage>i884</fpage>&#x2013;<lpage>i890</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bty560</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>DeSantis</surname>
<given-names>T. Z.</given-names>
</name>
<name>
<surname>Hugenholtz</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Larsen</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Rojas</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Brodie</surname>
<given-names>E. L.</given-names>
</name>
<name>
<surname>Keller</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2006</year>). <article-title>Greengenes, a Chimera-Checked 16s rRNA Gene Database and Workbench Compatible with ARB</article-title>. <source>Appl. Environ. Microbiol.</source> <volume>72</volume>, <fpage>5069</fpage>&#x2013;<lpage>5072</lpage>. <pub-id pub-id-type="doi">10.1128/aem.03006-05</pub-id> </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Estaki</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Bokulich</surname>
<given-names>N. A.</given-names>
</name>
<name>
<surname>McDonald</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Gonz&#xe1;lez</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kosciolek</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>QIIME 2 Enables Comprehensive End-To-End Analysis of Diverse Microbiome Data and Comparative Studies with Publicly Available Data</article-title>. <source>Curr. Protoc. Bioinforma.</source> <volume>70</volume>, <fpage>e100</fpage>. <pub-id pub-id-type="doi">10.1002/cpbi.100</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ewels</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Magnusson</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lundin</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>K&#xe4;ller</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report</article-title>. <source>Bioinformatics</source> <volume>32</volume>, <fpage>3047</fpage>&#x2013;<lpage>3048</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btw354</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fung</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Rusling</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lampeter</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Love</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Karim</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bongiorno</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Automation of QIIME2 Metagenomic Analysis Platform</article-title>. <source>Curr. Protoc.</source> <volume>1</volume>, <fpage>e254</fpage>. <pub-id pub-id-type="doi">10.1002/cpz1.254</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gl&#xf6;ckner</surname>
<given-names>F. O.</given-names>
</name>
<name>
<surname>Yilmaz</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Quast</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Gerken</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Beccati</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Ciuprina</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>25 Years of Serving the Community with Ribosomal RNA Gene Reference Databases and Tools</article-title>. <source>J. Biotechnol.</source> <volume>261</volume>, <fpage>169</fpage>&#x2013;<lpage>176</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbiotec.2017.06.1198</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Go&#x142;&#x119;biewski</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Tretyn</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Generating Amplicon Reads for Microbial Community Assessment with Next-Generation Sequencing</article-title>. <source>J. Appl. Microbiol.</source> <volume>128</volume>, <fpage>330</fpage>&#x2013;<lpage>354</lpage>. <pub-id pub-id-type="doi">10.1111/jam.14380</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>S. K.</given-names>
</name>
<name>
<surname>Alexander</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Pipeline to Run Qiime2 with Snakemake</article-title>. <comment>
<italic>Github Repository</italic>. Available at: <ext-link ext-link-type="uri" xlink:href="http://Https://github.com/shu251/tagseq-qiime2-snakemake">Https://github.com/shu251/tagseq-qiime2-snakemake</ext-link> (Accessed on 09 04, 2021).</comment> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hugerth</surname>
<given-names>L. W.</given-names>
</name>
<name>
<surname>Andersson</surname>
<given-names>A. F.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Analysing Microbial Community Composition Through Amplicon Sequencing: From Sampling to Hypothesis Testing</article-title>. <source>Front. Microbiol.</source> <volume>8</volume>, <fpage>1561</fpage>. <pub-id pub-id-type="doi">10.3389/fmicb.2017.01561</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hughes</surname>
<given-names>J. B.</given-names>
</name>
<name>
<surname>Hellmann</surname>
<given-names>J. J.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>The Application of Rarefaction Techniques to Molecular Inventories of Microbial Diversity</article-title>. <source>Methods in Enzymology</source> <volume>397</volume>, <fpage>292</fpage>&#x2013;<lpage>308</lpage>. <pub-id pub-id-type="doi">10.1016/s0076-6879(05)97017-1</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>K&#xf6;ster</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Rahmann</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Snakemake&#x2013;a Scalable Bioinformatics Workflow Engine</article-title>. <source>Bioinformatics</source> <volume>28</volume>, <fpage>2520</fpage>&#x2013;<lpage>2522</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/bts480</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>McDonald</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Price</surname>
<given-names>M. N.</given-names>
</name>
<name>
<surname>Goodrich</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Nawrocki</surname>
<given-names>E. P.</given-names>
</name>
<name>
<surname>DeSantis</surname>
<given-names>T. Z.</given-names>
</name>
<name>
<surname>Probst</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2011</year>). <article-title>An Improved Greengenes Taxonomy with Explicit Ranks for Ecological and Evolutionary Analyses of Bacteria and Archaea</article-title>. <source>ISME J.</source> <volume>6</volume>, <fpage>610</fpage>&#x2013;<lpage>618</lpage>. <pub-id pub-id-type="doi">10.1038/ismej.2011.139</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mohsen</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <source>Qiime2 Classifiers</source>. <pub-id pub-id-type="doi">10.5281/ZENODO.5535616</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>M&#xf6;lder</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Jablonski</surname>
<given-names>K. P.</given-names>
</name>
<name>
<surname>Letcher</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>M. B.</given-names>
</name>
<name>
<surname>Tomkins-Tinch</surname>
<given-names>C. H.</given-names>
</name>
<name>
<surname>Sochat</surname>
<given-names>V.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Sustainable Data Analysis with Snakemake</article-title>. <source>F1000Res.</source> <volume>10</volume>, <fpage>33</fpage>. <pub-id pub-id-type="doi">10.12688/f1000research.29032.110.12688/f1000research.29032.2</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Price</surname>
<given-names>M. N.</given-names>
</name>
<name>
<surname>Dehal</surname>
<given-names>P. S.</given-names>
</name>
<name>
<surname>Arkin</surname>
<given-names>A. P.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>FastTree 2--approximately Maximum-Likelihood Trees for Large Alignments</article-title>. <source>PLoS ONE</source> <volume>5</volume>, <fpage>e9490</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0009490</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Prodan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tremaroli</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Brolin</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zwinderman</surname>
<given-names>A. H.</given-names>
</name>
<name>
<surname>Nieuwdorp</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Levin</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Comparing Bioinformatic Pipelines for Microbial 16s rRNA Amplicon Sequencing</article-title>. <source>PLoS ONE</source> <volume>15</volume>, <fpage>e0227434</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0227434</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pruesse</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Quast</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Knittel</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Fuchs</surname>
<given-names>B. M.</given-names>
</name>
<name>
<surname>Ludwig</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Peplies</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2007</year>). <article-title>SILVA: a Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible with ARB</article-title>. <source>Nucleic Acids Res.</source> <volume>35</volume>, <fpage>7188</fpage>&#x2013;<lpage>7196</lpage>. <pub-id pub-id-type="doi">10.1093/nar/gkm864</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="web">
<collab>qiime2</collab> (<year>2021</year>). <article-title>q2/q2-phylogeny</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://github.com/qiime2/q2-phylogeny.%20Github%20repository">https://github.com/qiime2/q2-phylogeny. Github repository</ext-link>
</comment> (<comment>Accessed January 23, 2022</comment>). </citation>
</ref>
<ref id="B28">
<citation citation-type="web">
<collab>QuantStack Scientific Computing</collab> (<year>2021</year>). <article-title>Mamba, the Fast Cross Platform Package Manager</article-title>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="https://github.com/mamba-org/mamba.%20Github%20repository">https://github.com/mamba-org/mamba. Github Repository</ext-link>
</comment>. (<comment>Accessed October 12, 2021</comment>). </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Robeson</surname>
<given-names>M. S.</given-names>
</name>
<name>
<surname>O&#x2019;Rourke</surname>
<given-names>D. R.</given-names>
</name>
<name>
<surname>Kaehler</surname>
<given-names>B. D.</given-names>
</name>
<name>
<surname>Ziemski</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Dillon</surname>
<given-names>M. R.</given-names>
</name>
<name>
<surname>Foster</surname>
<given-names>J. T.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>RESCRIPt: Reproducible Sequence Taxonomy Reference Database Management for the Masses</article-title>. <source>PLoS Comput. Biol.</source> <volume>17</volume> (<issue>11</issue>), <fpage>e1009581</fpage>. <pub-id pub-id-type="doi">10.1101/2020.10.05.326504</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schloss</surname>
<given-names>P. D.</given-names>
</name>
<name>
<surname>Westcott</surname>
<given-names>S. L.</given-names>
</name>
<name>
<surname>Ryabin</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Hall</surname>
<given-names>J. R.</given-names>
</name>
<name>
<surname>Hartmann</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hollister</surname>
<given-names>E. B.</given-names>
</name>
<etal/>
</person-group> (<year>2009</year>). <article-title>Introducing Mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities</article-title>. <source>Appl. Environ. Microbiol.</source> <volume>75</volume>, <fpage>7537</fpage>&#x2013;<lpage>7541</lpage>. <pub-id pub-id-type="doi">10.1128/AEM.01541-09</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wei&#xdf;becker</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Schnabel</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Heintz-Buschart</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Dadasnake, a Snakemake Implementation of DADA2 to Process Amplicon Sequencing Data for Microbial Ecology</article-title>. <source>GigaScience</source> <volume>9</volume>, <fpage>135</fpage>. <pub-id pub-id-type="doi">10.1093/gigascience/giaa135</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>