<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Microbiol.</journal-id>
<journal-title>Frontiers in Microbiology</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Microbiol.</abbrev-journal-title>
<issn pub-type="epub">1664-302X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmicb.2020.02067</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Microbiology</subject>
<subj-group>
<subject>Technology and Code</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>KmerGO: A Tool to Identify Group-Specific Sequences With <italic>k</italic>-mers</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Wang</surname> <given-names>Ying</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/452253/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Chen</surname> <given-names>Qi</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Deng</surname> <given-names>Chao</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Zheng</surname> <given-names>Yiluan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Sun</surname> <given-names>Fengzhu</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="corresp" rid="c002"><sup>&#x002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/47840/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Automation, Xiamen University</institution>, <addr-line>Xiamen</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision-Making</institution>, <addr-line>Xiamen</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California</institution>, <addr-line>Los Angeles, CA</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Qunfeng Dong, Loyola University Chicago, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Yanni Sun, City University of Hong Kong, Hong Kong; Junhua Li, Beijing Genomics Institute (BGI), China</p></fn>
<corresp id="c001">&#x002A;Correspondence: Ying Wang, <email>wangying@xmu.edu.cn</email></corresp>
<corresp id="c002">Fengzhu Sun, <email>fsun@usc.edu</email></corresp>
<fn fn-type="other" id="fn004"><p>This article was submitted to Evolutionary and Genomic Microbiology, a section of the journal Frontiers in Microbiology</p></fn>
</author-notes>
<pub-date pub-type="epub">
<day>25</day>
<month>08</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>11</volume>
<elocation-id>2067</elocation-id>
<history>
<date date-type="received">
<day>07</day>
<month>06</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>08</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x00A9; 2020 Wang, Chen, Deng, Zheng and Sun.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Wang, Chen, Deng, Zheng and Sun</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract>
<p>Capturing group-specific sequences between two groups of genomic/metagenomic sequences is critical for the follow-up identifications of singular nucleotide variants (SNVs), gene families, microbial species or other elements associated with each group. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered a &#x201C;group-specific&#x201D; sequence in our study. We developed a user-friendly tool, KmerGO, to identify group-specific sequences between two groups of genomic/metagenomic long sequences or high-throughput sequencing datasets. Compared with other tools, KmerGO captures group-specific <italic>k</italic>-mers (<italic>k</italic> up to 40 bps) with much lower requirements for computing resources in much shorter running time. For a 1.05 TB dataset (.fasta), it takes KmerGO about 21.5 h (including <italic>k</italic>-mer counting) to return assembled group-specific sequences on a regular stand-alone workstation with no more than 1 GB memory. Furthermore, KmerGO can also be applied to capture trait-associated sequences for continuous-trait. Through multi-process parallel computing, KmerGO is implemented with both graphic user interface and command line on Linux and Windows free from any pre-installed supporting environments, packages, and complex configurations. The output group-specific <italic>k</italic>-mers or sequences from KmerGO could be the inputs of other tools for the downstream discovery of biomarkers, such as genetic variants, species, or genes. KmerGO is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/ChnMasterOG/KmerGO">https://github.com/ChnMasterOG/KmerGO</ext-link>.</p>
</abstract>
<kwd-group>
<kwd>group-specific <italic>k</italic>-mer</kwd>
<kwd>sequences comparison</kwd>
<kwd>high-throughput sequencing data</kwd>
<kwd>genomic comparison</kwd>
<kwd>metagenomic comparison</kwd>
</kwd-group>
<contract-sponsor id="cn001">National Natural Science Foundation of China<named-content content-type="fundref-id">10.13039/501100001809</named-content></contract-sponsor><contract-sponsor id="cn002">Natural Science Foundation of Fujian Province<named-content content-type="fundref-id">10.13039/501100003392</named-content></contract-sponsor>
<counts>
<fig-count count="4"/>
<table-count count="2"/>
<equation-count count="0"/>
<ref-count count="21"/>
<page-count count="8"/>
<word-count count="0"/>
</counts>
</article-meta>
</front>
<body>
<sec id="S1">
<title>Implementation</title>
<sec id="S1.SS1">
<title>Background</title>
<p>Fast developments of high-throughput sequencing technologies spout large volume of shotgun genomic/metagenomic data. The comparisons of high-throughput sequencing data under various phenotypes are critical to understand the mechanism behind their differences.</p>
<p>Short <italic>k</italic>-mer (<italic>k</italic> &#x003C; 15) based measures, such as <inline-formula><mml:math id="INEQ1"><mml:msubsup><mml:mi>d</mml:mi><mml:mn>2</mml:mn><mml:mi>s</mml:mi></mml:msubsup></mml:math></inline-formula>, <inline-formula><mml:math id="INEQ2"><mml:msubsup><mml:mi>d</mml:mi><mml:mn>2</mml:mn><mml:mo>&#x002A;</mml:mo></mml:msubsup></mml:math></inline-formula> and <italic>CVtree</italic>, calculate dissimilarity between sequences or high-throughput sequencing samples (<xref ref-type="bibr" rid="B7">Jiang et al., 2012</xref>; <xref ref-type="bibr" rid="B11">Liao et al., 2016</xref>; <xref ref-type="bibr" rid="B18">Song et al., 2019</xref>) using the global statistical models. Based on long <italic>k</italic>-mers (<italic>k</italic> &#x003E; 21), Mash (<xref ref-type="bibr" rid="B12">Ondov et al., 2016</xref>), Skmer (<xref ref-type="bibr" rid="B17">Sarmashghi et al., 2019</xref>), and Kmer-db (<xref ref-type="bibr" rid="B1">Deorowicz et al., 2018</xref>) use MinHash to approximate Jaccard distance between pairwise sequences based on randomly sampled small set of <italic>k</italic>-mers. However, these measures only return dissimilarity between two data sets, but do not capture specific biomarkers associated with different phenotypes.</p>
<p>Long <italic>k</italic>-mers contain richer biological information and are able to depict specific signatures in nucleotide sequences (<xref ref-type="bibr" rid="B21">Wang et al., 2016</xref>). Therefore, <italic>k</italic>-mers with length &#x2265;20 bp have been utilized to identify biomarkers, such as sequences (<xref ref-type="bibr" rid="B2">Drouin et al., 2016</xref>; <xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>), genetic variants (<xref ref-type="bibr" rid="B6">Jaillard et al., 2018</xref>; <xref ref-type="bibr" rid="B15">Rahman et al., 2018</xref>; <xref ref-type="bibr" rid="B19">Standage et al., 2019</xref>), and genes (<xref ref-type="bibr" rid="B4">Han et al., 2017</xref>) specific to categorical phenotypes. <xref ref-type="bibr" rid="B15">Rahman et al. (2018)</xref> identified the significant differentially abundant <italic>31</italic>-mers between two human populations and then discovered single nucleotide polymorphisms (SNPs). Two long <italic>k</italic>-mer based GWAS tools were developed for bacterial genomes to detect <italic>de Novo</italic> variants (<xref ref-type="bibr" rid="B6">Jaillard et al., 2018</xref>; <xref ref-type="bibr" rid="B19">Standage et al., 2019</xref>). For microbial community, <xref ref-type="bibr" rid="B4">Han et al. (2017)</xref> predicted microbial genes in the gut associated with type II diabetes (T2D) by detecting differentially abundant <italic>21</italic>-mers. In our previous study, we developed a computational framework using <italic>40</italic>-mers (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>) to capture group-specific sequences between two groups of large-scale metagenomic datasets, including LC (Liver Cirrhosis)-associated (<xref ref-type="bibr" rid="B14">Qin et al., 2014</xref>), IBD (Inflammatory Bowel Disease)-associated (<xref ref-type="bibr" rid="B13">Qin et al., 2010</xref>) and WT2D (Type 2 Diabetes in Women)-associated (<xref ref-type="bibr" rid="B9">Karlsson et al., 2013</xref>). The assembled group-specific sequences possess the discriminative power to separate the samples from disease and health groups.</p>
<p>&#x201C;Group-specific&#x201D; means elements (<italic>k</italic>-mers, genes, species, genetic variants) that are present, or rich, in one group, but absent, or scarce, in another group. Specifically, a group-specific <italic>k</italic>-mer in our study means only using the current <italic>k</italic>-mer as a single feature can separate the case and control groups with accuracy higher than a preset threshold. No matter what final group-specific elements are, the identification of group-specific <italic>k</italic>-mers is the common key step. It is also the most consuming step for computing time and resource. However, the tools developed by the studies mentioned above, MetaGO (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>), HAWK (<xref ref-type="bibr" rid="B15">Rahman et al., 2018</xref>), Kover (<xref ref-type="bibr" rid="B2">Drouin et al., 2016</xref>) and Kevlar (<xref ref-type="bibr" rid="B19">Standage et al., 2019</xref>), required high memory; or/and complex prerequisites of supporting environments, packages; or/and complicated deployments, which are described in detail in <xref ref-type="table" rid="T1">Tables 1</xref>, <xref ref-type="table" rid="T2">2</xref>. KMC3 (<xref ref-type="bibr" rid="B10">Kokot et al., 2017</xref>) and GenomeTester4 (<xref ref-type="bibr" rid="B8">Kaplinski et al., 2015</xref>) offer set operations for <italic>k</italic>-mers. Our experiments demonstrate that they cannot return <italic>k</italic>-mer frequency matrix and they can only obtain strictly-limited unique <italic>k</italic>-mers that are present in 100% of samples in one group and absent from 100% in the other group using a combination of set operations. However, biological samples are highly diverse, and the strict limitation on unique k-mers would miss some potential useful <italic>k</italic>-mers that have different abundance profiles in two groups or that are present in most samples in one group and absent in most samples in the other group.</p>
<table-wrap position="float" id="T1">
<label>TABLE 1</label>
<caption><p>Installation and running requirements of the five tools.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center"><bold>Final purpose</bold></td>
<td valign="top" align="center"><bold>Installation requirements</bold></td>
<td valign="top" align="center"><bold>Operation system</bold></td>
<td valign="top" align="center"><bold>Running interface</bold></td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">KmerGO</td>
<td valign="top" align="center">Group-specific sequences</td>
<td valign="top" align="center">No prerequisites; No installation; One-click running.</td>
<td valign="top" align="center">Linux, Windows</td>
<td valign="top" align="center">Graphic User interface; Command Line</td>
</tr>
<tr>
<td valign="top" align="left">MetaGO (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>)</td>
<td valign="top" align="center">Group-specific sequences</td>
<td valign="top" align="center">Deployment of <italic>Spark</italic>, Python.</td>
<td valign="top" align="center">Linux</td>
<td valign="top" align="center">Command Line</td>
</tr>
<tr>
<td valign="top" align="left">HAWK (<xref ref-type="bibr" rid="B15">Rahman et al., 2018</xref>)</td>
<td valign="top" align="center">Group-specific genetic variants</td>
<td valign="top" align="center">R with <italic>foreach</italic> and <italic>doParallel</italic> packages; JELLYFISH; EIGENSTRAT; ABYSS.</td>
<td valign="top" align="center">Linux</td>
<td valign="top" align="center">Command Line</td>
</tr>
<tr>
<td valign="top" align="left">Kover (<xref ref-type="bibr" rid="B2">Drouin et al., 2016</xref>)</td>
<td valign="top" align="center">Group-specific k-mers and then mapped to genes</td>
<td valign="top" align="center">CMake; GNU C++ compiler; GNU Fortran; The HDF5 library; NumPy; Python 2.7.x; Python development headers; SciPy.</td>
<td valign="top" align="center">Linux, Mac</td>
<td valign="top" align="center">Command Line</td>
</tr>
<tr>
<td valign="top" align="left">Kevlar (<xref ref-type="bibr" rid="B19">Standage et al., 2019</xref>)</td>
<td valign="top" align="center">Group-specific genetic variants</td>
<td valign="top" align="center">Python 3 with <italic>network</italic> and <italic>khmer</italic> packages; <italic>pysam</italic> module; <italic>pandas</italic>, <italic>scipy</italic> and <italic>intervaltree</italic> librarys; BWA.</td>
<td valign="top" align="center">Linux</td>
<td valign="top" align="center">Command Line</td>
</tr>
</tbody>
</table></table-wrap>
<table-wrap position="float" id="T2">
<label>TABLE 2</label>
<caption><p>Testing five tools on two groups of <italic>E. coli</italic> high-throughput sequencing dataset.</p></caption>
<table cellspacing="5" cellpadding="5" frame="hsides" rules="groups">
<thead>
<tr>
<td valign="top" align="left"></td>
<td valign="top" align="center"><bold>Memory peak</bold></td>
<td valign="top" align="center"><bold>Running time&#x002A;</bold></td>
<td valign="top" align="left"><bold>Number of group-specific <italic>k</italic>-mers</bold></td>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">KmerGO</td>
<td valign="top" align="center">305 MB</td>
<td valign="top" align="center">40 min</td>
<td valign="top" align="left">1,087 (ASS = 0.8); 6,156 (ASS = 0.7)</td>
</tr>
<tr>
<td valign="top" align="left">MetaGO (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>)</td>
<td valign="top" align="center">50 MB</td>
<td valign="top" align="center">3 h</td>
<td valign="top" align="left">1,087 (ASS = 0.8); 6,156 (ASS = 0.7)</td>
</tr>
<tr>
<td valign="top" align="left">HAWK (<xref ref-type="bibr" rid="B15">Rahman et al., 2018</xref>)</td>
<td valign="top" align="center">3.91 GB</td>
<td valign="top" align="center">2.05 h</td>
<td valign="top" align="left">4,446</td>
</tr>
<tr>
<td valign="top" align="left">Kover (<xref ref-type="bibr" rid="B2">Drouin et al., 2016</xref>)</td>
<td valign="top" align="center">&#x003E;128 G</td>
<td valign="top" align="justify" colspan="2">In the step <italic>dsk2kover</italic>, Kover was terminated by the workstation</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="justify" colspan="2">&#x2003;because the running required more than 128 GB memory.</td>
</tr>
<tr>
<td valign="top" align="left">Kevlar (<xref ref-type="bibr" rid="B19">Standage et al., 2019</xref>)</td>
<td valign="top" align="center">76.95 G</td>
<td valign="top" align="justify" colspan="2">In the step <italic>Kevlar novel</italic>, it took Kevlar 6.7 h to process every 5,000,000 reads.</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="justify" colspan="2">&#x2003;Because the total number of reads in testing data is more than 297,000,000,</td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="justify" colspan="2">&#x2003;it would require about 400 h to finish the processing. We stopped the experiment.</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<attrib><italic>&#x002A;The running time of <italic>k</italic>-mer counting was exclude because they all used different tools of third-party.</italic></attrib>
</table-wrap-foot>
</table-wrap>
<p>Therefore, we developed a tool, KmerGO, to identify group-specific sequences between two groups of sequences or high-throughput sequencing datasets. We also extended KmerGO to capture trait-associated sequences for continuous trait, such as height, weight, blood pressure and so on. KmerGO offers a user-friendly graphical interface with one-click installation free from any configurations. KmerGO is computational efficient running with a loser tree structure and multiple processes with low requirement for memory, which can be run on a regular stand-alone server with Linux or Windows.</p>
</sec>
<sec id="S1.SS2">
<title>The Framework of KmerGO</title>
<p>KmerGO is developed by <italic>C++</italic> and <italic>Python</italic>, offering running modes of graphical user interface and command line. As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, KmerGO includes four modules, producing <italic>k</italic>-mer counting vector, obtaining union of <italic>k</italic>-mer counting vectors over two groups of samples (called &#x201C;<italic>k</italic>-mer frequency matrix&#x201D; in our study), identifying group-specific <italic>k</italic>-mers, and assembling group-specific sequences. The modules producing <italic>k</italic>-mer frequency matrix and group-specific <italic>k</italic>-mers are implemented on multiple processes. The graphic interface of KmerGO is shown in the right panel of <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption><p>The diagram of KmerGO: KMC3 is adopted to obtain <italic>k</italic>-mer counting vector for each sample. Each vector is split into <italic>n</italic> blocks for calculating the union matrix over two groups of samples and filtering for group-specific <italic>k</italic>-mers in multiple processes. CAP3 is then used to assemble the group-specific <italic>k</italic>-mers into sequences. The right side figure is the graphic interface of KmerGO.</p></caption>
<graphic xlink:href="fmicb-11-02067-g001.tif"/>
</fig>
<sec id="S1.SS2.SSS1">
<title>Mode I: <italic>k</italic>-mer Counting</title>
<p>KMC3 (<xref ref-type="bibr" rid="B10">Kokot et al., 2017</xref>) is adopted to count the number of occurrences of each <italic>k</italic>-mer within the sequencing data and takes complementary <italic>k</italic>-mers into consideration. Only the <italic>k</italic>-mers occurred equal or greater than a certain threshold (default is 2) are kept. Then <italic>k</italic>-mers are sorted according to their lexicographic order using KMC3. This module produces a <italic>k</italic>-mer counting vector for each sample data.</p>
</sec>
<sec id="S1.SS2.SSS2">
<title>Mode II: <italic>k</italic>-mer Frequency Matrix</title>
<p>In this module, <italic>k</italic>-mer counting is normalized by the total number of occurrences in the vector for each data. Then all the <italic>k</italic>-mer vectors from two groups of data are merged into a <italic>k</italic>-mer frequency matrix through <italic>union</italic> operations with each <italic>k</italic>-mer as a row and each data as a column, which is used for identifying group-specific <italic>k</italic>-mers. This is the most time-consuming step in most long <italic>k</italic>-mer based tools. In KmerGO, we adopt multi-processes parallel computing and loser tree algorithm to accelerate the running. The schematic diagram is shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. In KmerGO, sorted <italic>k</italic>-mer vectors are split into <italic>n</italic> processes (<italic>n</italic> from 1 to 256) based on <italic>k</italic>-mer prefix in lexicographic order to implement multi-processing parallel computing. For example, when <italic>n</italic> is 4, <italic>k</italic>-mer frequency vectors are split by their initials (A, C, G, or T). The split operation is implemented using jumping files&#x2019; pointer on <italic>Multiprocessing</italic> package in <italic>python</italic>. The <italic>k</italic>-mer loser tree is built and updated iteratively in each process to fulfill fast <italic>k</italic>-mer comparison.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption><p>The schematic diagram of producing <italic>k</italic>-mer frequency matrix. The <italic>k</italic>-mer frequency vectors are split into <italic>n</italic> processes (<italic>n</italic> from 1 to 256) based on <italic>k</italic>-mer prefix in lexicographic order. The loser tree is built and updated iteratively based on <italic>N</italic> sample frequency vectors in each process. The winners of loser tree are written to the <italic>k</italic>-mer frequency matrix.</p></caption>
<graphic xlink:href="fmicb-11-02067-g002.tif"/>
</fig>
<sec id="S1.SS2.SSS2.Px1">
<title>The description of the loser tree structure</title>
<p>The loser tree is a tournament binary tree (<xref ref-type="bibr" rid="B16">Sahili, 2004</xref>), which was originally designed for fast numerical comparison. An example of generating frequency matrix using the loser-tree algorithm is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. For <italic>N</italic> samples, the first <italic>k</italic>-mer of each vector is read out and build a binary loser-tree with each <italic>k</italic>-mer as a leaf node. Two children nodes are compared, and then the winner (smaller <italic>k</italic>-mer) is pop-out to compare with upper level node and the loser (larger <italic>k</italic>-mer) is kept as parent node. The final winner and its frequency are written to the union frequency matrix. In the following update steps, the previous winner leaf node is replaced by the second <italic>k</italic>-mer from the same sample. The corresponding loser nodes are then updated with hierarchical comparison between the new node with its parent node, then the final winner is written to the frequency matrix. The processing is repeated until all the <italic>k</italic>-mers from all the samples are written to the matrix.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption><p>A schematic example to obtain frequency matrix using loser-tree algorithm. Using a three-sample dataset as an example, in step 1, the first <italic>k</italic>-mers of the three frequency vectors are pop-out as the leaves of a loser tree. Because &#x201C;AAAAA&#x201D; = &#x201C;AAAAA,&#x201D; the &#x201C;AAAAA&#x201D; in <italic>sample 1</italic> is randomly picked as the winner and the other one is kept as the loser in the Parent node. The winner &#x201C;AAAAA&#x201D; is then compared to another leaf node &#x201C;AAAAG,&#x201D; the larger one &#x201C;AAAAG&#x201D; is the loser and kept as root node. The winner &#x201C;AAAAA&#x201D; and its corresponding frequency in <italic>sample 1</italic> is written to the frequency matrix. In step 2, the second <italic>k</italic>-mer &#x201C;AAAAT&#x201D; in <italic>sample 1</italic> is pop-out to replace the previous winner node &#x201C;AAAAA.&#x201D; &#x201C;AAAAT&#x201D; is compared to Parent node &#x201C;AAAAA&#x201D; in second level, and the &#x201C;AAAAA&#x201D; in <italic>sample 2</italic> is the winner and the Parent node of this branch is updated as the loser &#x201C;AAAAT.&#x201D; Then the winner &#x201C;AAAAA&#x201D; is still the winner when compared to root node &#x201C;AAAAG&#x201D;, so the corresponding frequency of &#x201C;AAAAA&#x201D; in <italic>sample 2</italic> is updated in the frequency matrix. In step 3, the winner is &#x201C;AAAAC,&#x201D; which means there is no other samples containing <italic>k</italic>-mer &#x201C;AAAAA.&#x201D; And the winner and its corresponding frequency is written to the frequency matrix.</p></caption>
<graphic xlink:href="fmicb-11-02067-g003.tif"/>
</fig>
</sec>
<sec id="S1.SS2.SSS2.Px2">
<title>The complexity analysis of the loser tree structure</title>
<p>We assume that the number of <italic>k</italic>-mers in each sample is <italic>M</italic> and the number of samples is <italic>N</italic>. For <italic>k = 40</italic>, <italic>M</italic> is between 10<sup>8</sup> and 10<sup>9</sup>. (1) Loser tree structure has significant superiority in space complexity. In each round of iteration, loser tree reads one <italic>k</italic>-mer from each sample, and stores <italic>k</italic>-mers using a binary tree structure, so totally 2<italic>N k</italic>-mers in memory. Therefore, the space complexity <italic>S</italic>(<italic>M</italic>,<italic>N</italic>) = <italic>O</italic>(<italic>N</italic>), which is the reason that the peak memory of KmerGO is only about 300 MB for Tera Bytes of dataset. In comparison, GenomeTester4 obtains the <italic>k</italic>-mer union set on groups of samples using pair by pair union operations. To avoid frequent hard-disk reading-writing operations, the files necessary for the following iterations should be kept in memory. Therefore, the space complexity of pairwise union algorithm is <italic>S</italic>(<italic>M</italic>,<italic>N</italic>) = <italic>O</italic>(<italic>M</italic><italic>N</italic>), where the number of <italic>k</italic>-mer<sub><italic>s</italic></sub> <italic>M</italic> increases exponentially with the growth of <italic>k</italic>-mer length <italic>k</italic>. In addition, if we want to reduce the space complexity from <italic>O</italic>(<italic>M</italic><italic>N</italic>) to &#x03BC;<italic>O</italic>(<italic>M</italic><italic>N</italic>) (0 &#x003C; &#x03BC; &#x003C; 1), the hard-disk reading-writing time complexity will increase from <italic>O</italic>(<italic>M</italic><italic>N</italic>) to <italic>O</italic>(<italic>M</italic><italic>N</italic> + (<sub>log2</sub><italic>N</italic>&#x2212;1)(1&#x2212;&#x03BC;)<italic>M</italic><italic>N</italic>). (2) The time complexity of the loser tree in KmerGO is <italic>T</italic>(<italic>M</italic>,<italic>N</italic>) = <italic>O</italic>(<italic>M</italic><italic>N</italic><sub>log2</sub><italic>N</italic>). Because in each iteration, when a new <italic>k</italic>-mer replaces the pop-out node in the last iteration of the existing loser tree, the new <italic>k</italic>-mer is only required to compare with its parent node hierarchically, so only <sub>log2</sub><italic>N</italic> comparisons are required. However, if the new <italic>k</italic>-mer is directly compared to the remaining (<italic>N</italic>&#x2212;1) <italic>k</italic>-mers to find the smallest <italic>k</italic>-mer instead of using the loser tree, the time complexity of comparison is <italic>O</italic>(<italic>N</italic>). Thus, the overall time complexity would be <italic>T</italic>(<italic>M</italic>,<italic>N</italic>) = <italic>O</italic>(<italic>M</italic><italic>N</italic><sup>2</sup>), which is larger than <italic>T</italic>(<italic>M</italic>,<italic>N</italic>) = <italic>O</italic>(<italic>M</italic><italic>N</italic><sub>log2</sub><italic>N</italic>) in the loser tree.</p>
<p>Therefore, loser tree is better than pairwise union strategy in space complexity; and is better than direct sorting among all the samples in time complexity. Furthermore, if the final winner <italic>k</italic>-mer is different from the winner of the previous iteration, the union of the current <italic>k</italic>-mer is complete, which does not require traverse all the samples. Once the loser tree is built for the first <italic>k</italic>-mers from the <italic>N</italic> samples, it is only required to update the corresponding nodes for the new incoming <italic>k</italic>-mer.</p>
</sec>
</sec>
<sec id="S1.SS2.SSS3">
<title>Mode III: Group-Specific <italic>k</italic>-mer Identification</title>
<p>In the module of group-specific <italic>k</italic>-mers, the <italic>k-</italic>mers absent in more than 80% of control samples and 80% of case samples are removed. KmerGO uses the strategy from MetaGO (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>) to identify group-specific <italic>k</italic>-mers, because the performance of the strategy has been evaluated and validated by that study. The processing strategy are briefly described as follows. The group-specific <italic>k</italic>-mers are obtained using the following criteria: (1) If the average of sensitivity and specificity (ASS) for classifying cases versus controls using the current single <italic>k</italic>-mer&#x2019;s presence or absence in the sequencing data is higher than a preset threshold, the <italic>k</italic>-mer is considered as group-specific; (2) If the difference of the current <italic>k</italic>-mer&#x2019;s frequencies between two groups are statistically significant with <italic>p</italic>-value less than a preset threshold (e.g., 0.05) based on the <italic>Wilcoxon rank sum test</italic> and the ASS is higher than a preset threshold using logistic regression, the <italic>k</italic>-mer is considered as group-specific. The detail descriptions about the identification for the group-specific <italic>k</italic>-mers can be found in Section 2 of MetaGO (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>).</p>
<p>Furthermore, KmerGO is extended to capture trait-associated sequences for dataset with continuous trait. The processing strategy is also composed of two parts: (1) For presence/absence of a <italic>k</italic>-mer, we compare the distributions of trait values of individuals having the <italic>k</italic>-mer with that of individuals not having the <italic>k</italic>-mer using the <italic>Wilcoxon rank sum test.</italic> The <italic>k</italic>-mer is considered as trait-associated if the resulting <italic>p</italic>-value is less than a preset threshold; (2) For <italic>k</italic>-mer abundance, we calculate the <italic>Spearman&#x2019;s rank correlation coefficient</italic> between the current <italic>k</italic>-mer&#x2019;s frequencies and the trait values of the samples. If the correlation coefficient is higher than a preset threshold, the <italic>k</italic>-mer is considered as trait-associated.</p>
</sec>
<sec id="S1.SS2.SSS4">
<title>Mode IV: Identifying Group-Specific Sequences Through Group-Specific <italic>k</italic>-mer Assembly</title>
<p>In the module of group-specific sequences assembly, KmerGO uses CAP3 (<xref ref-type="bibr" rid="B5">Huang and Madan, 1999</xref>) to assemble the identified group-specific <italic>k</italic>-mers into sequences. Only using the overlap information between <italic>k</italic>-mers, the specific parameter settings for CAP3 is shown in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
</sec>
</sec>
</sec>
<sec id="S2">
<title>The Functions of KmerGO</title>
<p>KmerGO supports end-to-end running or mid-way input and output. Therefore, KmerGO can be used in the following three situations:</p>
<list list-type="simple">
<list-item>
<label>&#x2022;</label>
<p>Identify group-specific/trait-associated <italic>k</italic>-mers/sequences from categorical- or continuous- trait based on sequences or high-throughput sequencing datasets. The group-specific/trait-associated <italic>k-</italic>mers/sequences can be used for the follow-up discovery of biomarkers, such as genetic variants, species, or genes.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>Obtain union matrix of <italic>k</italic>-mer frequency vectors from multiple high-throughput sequencing data or multiple files with long sequences, where the input could be sequencing files (.fasta,.fastq, fasta.gz, fastq.gz) or frequency vectors with text format from the KMC tool. Users can also run KmerGO for union matrix and then make the filtering for group-specific <italic>k</italic>-mers with their own strategies.</p>
</list-item>
<list-item>
<label>&#x2022;</label>
<p>Output group-specific elements for a matrix composed of the features from two groups of samples. The features could be the abundances, frequencies or other quantified features.</p>
</list-item>
</list>
</sec>
<sec id="S3">
<title>Comparison With Other Four Tools in Identifying Group-Specific <italic>k</italic>-mers</title>
<p>Although MetaGO (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>), HAWK (<xref ref-type="bibr" rid="B15">Rahman et al., 2018</xref>), Kover (<xref ref-type="bibr" rid="B2">Drouin et al., 2016</xref>), and Kevlar (<xref ref-type="bibr" rid="B19">Standage et al., 2019</xref>) were designed for identifying different group-specific elements, all of them include the key step of identify group-specific <italic>k</italic>-mers. Therefore, KmerGO and the four tools were installed and applied to a testing data for comparison. The testing data is from the testing experiment of HAWK (<xref ref-type="bibr" rid="B15">Rahman et al., 2018</xref>). We installed and ran the five tools in a stand-alone workstation in Linux.</p>
<sec id="S3.SS1">
<title>Installations and Running Requirements</title>
<p>KmerGO is free from any installation and environmental configuration. It is run directly with the executive file. The other four tools have different prerequisites of supporting environments, packages, or/and complicated deployments. KmerGO is the only one to offer GUI (Graphic User Interface) among the five tools. The detail of installation and running requirements are shown in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
</sec>
<sec id="S3.SS2">
<title>Experiments on a Testing Dataset</title>
<p>The testing dataset composes of 241 high-throughput sequencing data of <italic>Escherichia coli</italic> strains. The dataset had been used to test the performance of HAWK (<xref ref-type="bibr" rid="B15">Rahman et al., 2018</xref>). The two groups are 189 <italic>E. coli</italic> strains resistant to ampicillin and 52 <italic>E. coli</italic> strains sensitive to ampicillin. The size of the dataset is 116 GB in.fasta format. The five tools were applied to the testing dataset to identify the group-specific <italic>31</italic>-mers in the workstation with regular configuration of <italic>Intel Xeon E5-2620 v4</italic> (2.10 GHz, 8 cores, 16 threads) and 128 GB memory. We set <italic>k</italic> = 31 because 31 is the default setting for most of the tools. When recording the running time, we excluded the <italic>k</italic>-mer counting step (because different third-party tools were integrated) and the steps after the identification of group-specific <italic>k</italic>-mers. On our testing workstation, only KmerGO, MetaGO, and HAWK successfully finished the running on the testing data and output group-specific <italic>k</italic>-mers. As shown in <xref ref-type="table" rid="T2">Table 2</xref>, it takes KmerGO 40 min with only 305 M peak memory for the 116 GB dataset. By contrast, it takes HAWK 2 h with 3.91 GB peak memory. Although the peak memory of MetaGO is only 50 MB, the running time is 3 h, longer than KmerGO and HAWK. The running descriptions of Kover, Kevlar are given in <xref ref-type="table" rid="T2">Table 2</xref>.</p>
</sec>
<sec id="S3.SS3">
<title>Comparison of Group-Specific <italic>k</italic>-Mers Identified by KmerGO and HAWK</title>
<p>Because KmerGO adopted the identical filtering strategy for group-specific <italic>k</italic>-mers with MetaGO, their output results are the same. We compared the group-specific <italic>k</italic>-mers identified by KmerGO and HAWK. The dataset are two groups of <italic>E. coli</italic> strains resistant and sensitive to ampicillin, respectively. As shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, when ASS threshold is set as 0.8, KmerGO identified 1,087 resistant-specific <italic>31</italic>-mers, and all of them are included in the 4,446 resistant-specific <italic>31</italic>-mers by HAWK. When ASS threshold is relaxed to 0.7, KmerGO output 6,156 resistant-specific <italic>31</italic>-mers, and 3,263 of them overlap with results of HAWK. Both KmerGO and HAWK do not find any sensitive-specific <italic>31</italic>-mers. This result is consistent with the analysis of the original paper (<xref ref-type="bibr" rid="B3">Earle et al., 2016</xref>) of the dataset, which mentioned the resistance mechanism is caused by the presence of &#x03B2;&#x2212;<italic>l</italic><italic>a</italic><italic>c</italic><italic>t</italic><italic>a</italic><italic>m</italic><italic>a</italic><italic>s</italic><italic>e</italic><italic>g</italic><italic>e</italic><italic>n</italic><italic>e</italic><italic>s</italic><italic>b</italic><italic>l</italic><italic>a</italic><sub><italic>T</italic><italic>E</italic><italic>M</italic></sub>. Therefore, no control associated (sensitive-specific) markers would be found (<xref ref-type="bibr" rid="B3">Earle et al., 2016</xref>). The difference of the identified group-specific <italic>k</italic>-mers between KmerGO and HAWK is because they used various filtering strategies. The objective of HAWK is to find SNPs from a single genome that distinguish cases from controls. HAWK computes <italic>p</italic>-values using likelihood test assuming Poisson distributions for the numbers of occurrences of <italic>k</italic>-mers in both cases and controls, and then adjusts <italic>p</italic>-values based on the first ten principal components of the numbers of occurrences of <italic>k</italic>-mers to correct for population stratification. KmerGO outputs group-specific <italic>k</italic>-mers, and each of them has distinguishing power to separate two groups. Therefore, different objectives of the two tools lead to differences of their results.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption><p>The Venn diagram of group-specific <italic>k</italic>-mers identified by HAWK and KmerGO with <bold>(A)</bold> ASS = 0.8 and <bold>(B)</bold> ASS = 0.7.</p></caption>
<graphic xlink:href="fmicb-11-02067-g004.tif"/>
</fig>
<p>Compared with MetaGO (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>), HAWK (<xref ref-type="bibr" rid="B15">Rahman et al., 2018</xref>), Kover (<xref ref-type="bibr" rid="B2">Drouin et al., 2016</xref>), and Kevlar (<xref ref-type="bibr" rid="B19">Standage et al., 2019</xref>), KmerGO has several advantages: (1) KmerGO can be run with one-click installation under Windows and Linux operating systems, free from any environmental configurations and deployments. (2) KmerGO offers both graphic user interface and command lines, which supports the easy running for biologists and HPC job submission. (3) KmerGO is faster than other tools with much lower memory requirements. (4) KmerGO is applicable to handle both genomic and metagenomic data, long sequences and high-throughput sequencing data.</p>
</sec>
</sec>
<sec id="S4">
<title>Comparison With KMC3 and Genometester4: Two Additional Tools to Identify Unique <italic>k</italic>-mers</title>
<p>KMC3 (<xref ref-type="bibr" rid="B10">Kokot et al., 2017</xref>) and GenomeTester4 (<xref ref-type="bibr" rid="B8">Kaplinski et al., 2015</xref>) integrated the <italic>k</italic>-mer counting and set operations. According to their available options, KMC3 and GenomeTester4 cannot obtain the <italic>k</italic>-mer frequency matrix for multiple <italic>k</italic>-mer vectors. Instead, they can only output the <italic>k</italic>-mer union set and the corresponding sum/min/max of their frequencies for multiple <italic>k</italic>-mer frequency vectors due to their processing data structures.</p>
<p>Furthermore, KMC3 and GenomeTester4 can only strictly-limited unique k-mers. The difference between a unique and a group-specific <italic>k</italic>-mer is that the unique <italic>k</italic>-mers are required to be present to all the samples of one group but absent from all the samples of another group. When the threshold ASS = 1, a group-specific <italic>k</italic>-mer is a unique <italic>k</italic>-mer. Therefore, the set of unique <italic>k</italic>-mers is the special case of the set of group-specific <italic>k</italic>-mers.</p>
<p>The basic idea of KMC3 and GenomeTester4 to obtain unique <italic>k</italic>-mers can be described as follows. Let as <italic>A</italic><sub><italic>i</italic></sub> and <italic>B</italic><sub><italic>j</italic></sub> be the numbers of occurrences of a certain <italic>k</italic>-mer in sample <italic>i</italic> of group A and sample <italic>j</italic> of group B, respectively. If <inline-formula><mml:math id="INEQ20"><mml:mrow><mml:mrow><mml:mrow><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>A</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>B</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>B</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>A</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula>, the current <italic>k</italic>-mer is unique to group A; If <inline-formula><mml:math id="INEQ21"><mml:mrow><mml:mrow><mml:mrow><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>A</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow><mml:mo>+</mml:mo><mml:mrow><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>B</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>B</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:munder><mml:mo movablelimits="false">&#x2211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x2208;</mml:mo><mml:mi>B</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mi>B</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mrow></mml:math></inline-formula>, the current <italic>k</italic>-mer is unique to group B. In KMC3, the idea is implemented by the combination of &#x201C;<italic>kmc_tools</italic>,&#x201D; &#x201C;<italic>union</italic>,&#x201D; &#x201C;<italic>intersection</italic>,&#x201D; &#x201C;<italic>counters_subtract</italic>,&#x201D; &#x201C;<italic>kmers_subtract.</italic>&#x201D; In GenomeTester4, the idea is implemented by &#x201C;<italic>glistcompare</italic>,&#x201D; &#x201C;<italic>union</italic>,&#x201D; <italic>&#x201C;intersection,&#x201D; &#x201C;diff_union.&#x201D;</italic> The running scripts of KMC3 and GenomeTester4 are available at <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
<p>KMC3 and GenomeTester4 are also applied to the 241 high-throughput sequencing data of <italic>Escherichia coli</italic> strains that are tested on the five tools in last section. Compared with the 6,156 and 4,446 resistant-specific <italic>k</italic>-mer identified by KmerGO and HAWK, KMC3 and GenomeTester4 do not find any unique <italic>k</italic>-mer to resistant group. The experiment demonstrates that KMC3 and GenomeTester4 can only implement highly inflexible filtering. However, biological individuals are highly diverse, the strict limitation would miss potential useful k-mers having consistent characteristics in most cases instead of all cases.</p>
</sec>
<sec id="S5">
<title>Application of KmerGO on a Large-Scale Metagenomic Sequencing Dataset</title>
<p>KmerGO was also applied to the large-scale metagenomic liver cirrhosis-associated dataset (<xref ref-type="bibr" rid="B14">Qin et al., 2014</xref>), which was tested on MetaGO (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>). The dataset includes 66 liver cirrhosis patients and 56 healthy controls with high-throughput sequences by Illumina HiSeq 2000 with 1.07 TB file size in.fasta format. Using the regular stand-alone workstation with CPU <italic>Intel(R) Xeon(R) E5-2620 v4</italic> and 128G memory for <italic>k</italic> = 40, it takes 21.5 h to identify the group-specific sequences, including 4 h <italic>k</italic>-mer counting by KMC3 and 17.5 h (16 processes) for obtaining the union matrix and identifying the group-specific <italic>k</italic>-mers. The memory peak is no more than 1 GB. The output of KmerGO is identical with that of MetaGO, and the effectiveness, excellent performance and biological implications were validated in the MetaGO paper (<xref ref-type="bibr" rid="B20">Wang et al., 2018</xref>).</p>
</sec>
<sec id="S6">
<title>Conclusion</title>
<p>Group-specific nucleotide sequences offer important information to understand the differences between two groups of genomic/metagenomic samples. Free from reference sequences, assembly, and alignment, KmerGO identifies group-specific/trait-associated sequences (<italic>k</italic> up to 40 bps) and return the assembled group-specific/trait-associated sequences. The identified <italic>k</italic>-mers present discriminant power, paving the way for a new paradigm of biomarker discovery for different phenotypes.</p>
<p>Free from any pre-installed supporting environments, packages, and complex configurations, KmerGO offers a graphic user interface by direct running the executive file for Linux and Windows, and command line running mode for job submission in HPC, which is extraordinary friendly to users from various backgrounds. Through multi-processing parallel computing, KmerGO is highly time efficient with low requirements for computational resources (CPU, memory). Therefore, on a regular standalone workstation, it takes KmerGO a total of 21.5 h to output group-specific <italic>k</italic>-mers for 1.05 TB (.fasta) two groups of high-throughput sequencing data.</p>
<p>KmerGO is suitable for both long sequences and high-throughput sequencing data. Supporting end-to-end running or mid-way input and output, KmerGO can also be a tool to obtain the union matrix over <italic>k</italic>-mer frequency vector of a large number of samples; to filter the group-specific elements for feature matrix composed of two groups of samples with quantified values. The output group-specific <italic>k</italic>-mers or sequences from KmerGO could be the input of other tools for the following discovery of biomarkers, such as genetic variants, species, or genes.</p>
</sec>
<sec id="S7">
<title>Data Availability Statement</title>
<p>All datasets presented in this study are included in the article/<xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
</sec>
<sec id="S8">
<title>Author Contributions</title>
<p>YW and FS designed the KmerGO. QC developed and implemented the KmerGO. CD and YZ proposed the algorithms for KmerGO. All authors read and approved the final manuscript.</p>
</sec>
<sec id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<fn-group>
<fn fn-type="financial-disclosure">
<p><bold>Funding.</bold> This research was supported by the National Natural Science Foundation of China (61673324), National Key Research and Development Program of China (2018YFD0901401), Natural Science Foundation of Fujian (2018J01097), and Open Fund of Engineering Research Center for Medical Data Mining and Application of Fujian Province (MDM2018002).</p>
</fn>
</fn-group>
<sec id="S10" sec-type="supplementary material"><title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fmicb.2020.02067/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fmicb.2020.02067/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Table_1.DOCX" id="SM1" mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deorowicz</surname> <given-names>S.</given-names></name> <name><surname>Gudy&#x015B;</surname> <given-names>A.</given-names></name> <name><surname>D&#x0142;ugosz</surname> <given-names>M.</given-names></name> <name><surname>Kokot</surname> <given-names>M.</given-names></name> <name><surname>Danek</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>Kmer-db: instant evolutionary distance estimation.</article-title> <source><italic>Bioinformatics</italic></source> <volume>35</volume> <fpage>133</fpage>&#x2013;<lpage>136</lpage>.</citation></ref>
<ref id="B2"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Drouin</surname> <given-names>A.</given-names></name> <name><surname>Gigu&#x00E8;re</surname> <given-names>S.</given-names></name> <name><surname>D&#x00E9;raspe</surname> <given-names>M.</given-names></name> <name><surname>Marchand</surname> <given-names>M.</given-names></name> <name><surname>Tyers</surname> <given-names>M.</given-names></name> <name><surname>Loo</surname> <given-names>V. G.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Predictive computational phenotyping and biomarker discovery using reference-free genome comparisons.</article-title> <source><italic>BMC Genomics</italic></source> <volume>17</volume>:<issue>754</issue>. <pub-id pub-id-type="doi">10.1186/1471-2164-13-754</pub-id></citation></ref>
<ref id="B3"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Earle</surname> <given-names>S. G.</given-names></name> <name><surname>Wu</surname> <given-names>C.-H.</given-names></name> <name><surname>Charlesworth</surname> <given-names>J.</given-names></name> <name><surname>Stoesser</surname> <given-names>N.</given-names></name> <name><surname>Gordon</surname> <given-names>N. C.</given-names></name> <name><surname>Walker</surname> <given-names>T. M.</given-names></name></person-group> (<year>2016</year>). <article-title>Identifying lineage effects when controlling for population structure improves power in bacterial association studies.</article-title> <source><italic>Nat. Microbiol.</italic></source> <volume>1</volume>:<issue>16041</issue>.</citation></ref>
<ref id="B4"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>W.</given-names></name> <name><surname>Wang</surname> <given-names>M.</given-names></name> <name><surname>Ye</surname> <given-names>Y.</given-names></name></person-group> (<year>2017</year>). &#x201C;<article-title>A concurrent subtractive assembly approach for identification of disease associated sub-metagenomes</article-title>,&#x201D; in <source><italic>Proceedings of the International Conference on Research in Computational Molecular Biology</italic></source>, <publisher-loc>Hong Kong</publisher-loc>.</citation></ref>
<ref id="B5"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>X.</given-names></name> <name><surname>Madan</surname> <given-names>A.</given-names></name></person-group> (<year>1999</year>). <article-title>CAP3: a DNA sequence assembly program.</article-title> <source><italic>Genome Res.</italic></source> <volume>9</volume> <fpage>868</fpage>&#x2013;<lpage>877</lpage>. <pub-id pub-id-type="doi">10.1101/gr.9.9.868</pub-id> <pub-id pub-id-type="pmid">10508846</pub-id></citation></ref>
<ref id="B6"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jaillard</surname> <given-names>M.</given-names></name> <name><surname>Lima</surname> <given-names>L.</given-names></name> <name><surname>Tournoud</surname> <given-names>M.</given-names></name> <name><surname>Mah&#x00E9;</surname> <given-names>P.</given-names></name> <name><surname>van Belkum</surname> <given-names>A.</given-names></name> <name><surname>Lacroix</surname> <given-names>V.</given-names></name><etal/></person-group> (<year>2018</year>). <article-title>A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between k-mers and genetic events.</article-title> <source><italic>PLoS Genet.</italic></source> <volume>14</volume>:<issue>e1007758</issue>. <pub-id pub-id-type="doi">10.1371/journal.pgen.1007758</pub-id> <pub-id pub-id-type="pmid">30419019</pub-id></citation></ref>
<ref id="B7"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>B.</given-names></name> <name><surname>Song</surname> <given-names>K.</given-names></name> <name><surname>Ren</surname> <given-names>J.</given-names></name> <name><surname>Deng</surname> <given-names>M.</given-names></name> <name><surname>Sun</surname> <given-names>F.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name></person-group> (<year>2012</year>). <article-title>Comparison of metagenomic samples using sequence signatures.</article-title> <source><italic>BMC Genomics</italic></source> <volume>13</volume>:<issue>730</issue>. <pub-id pub-id-type="doi">10.1186/1471-2164-13-730</pub-id> <pub-id pub-id-type="pmid">23268604</pub-id></citation></ref>
<ref id="B8"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaplinski</surname> <given-names>L.</given-names></name> <name><surname>Lepamets</surname> <given-names>M.</given-names></name> <name><surname>Remm</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>GenomeTester4: a toolkit for performing basic set operations - union, intersection and complement on k-mer lists.</article-title> <source><italic>Gigascience</italic></source> <volume>4</volume>:<issue>58</issue>.</citation></ref>
<ref id="B9"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Karlsson</surname> <given-names>F. H.</given-names></name> <name><surname>Tremaroli</surname> <given-names>V.</given-names></name> <name><surname>Nookaew</surname> <given-names>I.</given-names></name> <name><surname>Bergstr&#x00F6;m</surname> <given-names>G.</given-names></name> <name><surname>Behre</surname> <given-names>C. J.</given-names></name> <name><surname>Fagerberg</surname> <given-names>B.</given-names></name><etal/></person-group> (<year>2013</year>). <article-title>Gut metagenome in European women with normal, impaired and diabetic glucose control.</article-title> <source><italic>Nature</italic></source> <volume>498</volume> <fpage>99</fpage>&#x2013;<lpage>103</lpage>. <pub-id pub-id-type="doi">10.1038/nature12198</pub-id> <pub-id pub-id-type="pmid">23719380</pub-id></citation></ref>
<ref id="B10"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kokot</surname> <given-names>M.</given-names></name> <name><surname>D&#x0142;ugosz</surname> <given-names>M.</given-names></name> <name><surname>Deorowicz</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>KMC 3: counting and manipulating k-mer statistics.</article-title> <source><italic>Bioinformatics</italic></source> <volume>33</volume> <fpage>2759</fpage>&#x2013;<lpage>2761</lpage>. <pub-id pub-id-type="doi">10.1093/bioinformatics/btx304</pub-id> <pub-id pub-id-type="pmid">28472236</pub-id></citation></ref>
<ref id="B11"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liao</surname> <given-names>W.</given-names></name> <name><surname>Ren</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>K.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Zeng</surname> <given-names>F.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Alignment-free transcriptomic and metatranscriptomic comparison using sequencing signatures with variable length markov chains.</article-title> <source><italic>Sci. Rep.</italic></source> <volume>6</volume>:<issue>37243</issue>.</citation></ref>
<ref id="B12"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ondov</surname> <given-names>B. D.</given-names></name> <name><surname>Treangen</surname> <given-names>T. J.</given-names></name> <name><surname>Melsted</surname> <given-names>P.</given-names></name> <name><surname>Mallonee</surname> <given-names>A. B.</given-names></name> <name><surname>Bergman</surname> <given-names>N. H.</given-names></name> <name><surname>Koren</surname> <given-names>S.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Mash: fast genome and metagenome distance estimation using MinHash.</article-title> <source><italic>Genome Biol.</italic></source> <volume>17</volume>:<issue>132</issue>.</citation></ref>
<ref id="B13"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>R.</given-names></name> <name><surname>Raes</surname> <given-names>J.</given-names></name> <name><surname>Arumugam</surname> <given-names>M.</given-names></name> <name><surname>Burgdorf</surname> <given-names>K. S.</given-names></name> <name><surname>Manichanh</surname> <given-names>C.</given-names></name><etal/></person-group> (<year>2010</year>). <article-title>A human gut microbial gene catalogue established by metagenomic sequencing.</article-title> <source><italic>Nature</italic></source> <volume>464</volume> <fpage>59</fpage>&#x2013;<lpage>65</lpage>.</citation></ref>
<ref id="B14"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Qin</surname> <given-names>N.</given-names></name> <name><surname>Yang</surname> <given-names>F.</given-names></name> <name><surname>Li</surname> <given-names>A.</given-names></name> <name><surname>Prifti</surname> <given-names>E.</given-names></name> <name><surname>Chen</surname> <given-names>Y.</given-names></name> <name><surname>Shao</surname> <given-names>L.</given-names></name><etal/></person-group> (<year>2014</year>). <article-title>Alterations of the human gut microbiome in liver cirrhosis.</article-title> <source><italic>Nature</italic></source> <volume>513</volume> <fpage>59</fpage>&#x2013;<lpage>64</lpage>.</citation></ref>
<ref id="B15"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rahman</surname> <given-names>A.</given-names></name> <name><surname>Hallgr&#x00ED;msd&#x00F3;ttir</surname> <given-names>I.</given-names></name> <name><surname>Eisen</surname> <given-names>M.</given-names></name> <name><surname>Pachter</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>Association mapping from sequencing reads using k-mers.</article-title> <source><italic>eLife</italic></source> <volume>7</volume>:<issue>e32920</issue>.</citation></ref>
<ref id="B16"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sahili</surname> <given-names>A. E.</given-names></name></person-group> (<year>2004</year>). <article-title>Trees in tournaments.</article-title> <source><italic>J. Combinator. Theor. Ser. B</italic></source> <volume>92</volume> <fpage>183</fpage>&#x2013;<lpage>187</lpage>. <pub-id pub-id-type="doi">10.1016/j.jctb.2004.04.002</pub-id></citation></ref>
<ref id="B17"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sarmashghi</surname> <given-names>S.</given-names></name> <name><surname>Bohmann</surname> <given-names>K.</given-names></name> <name><surname>Gilbert</surname> <given-names>M. T. P.</given-names></name> <name><surname>Bafna</surname> <given-names>V.</given-names></name> <name><surname>Mirarab</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>Skmer: assembly-free and alignment-free sample identification using genome skims.</article-title> <source><italic>Genome Biol.</italic></source> <volume>20</volume>:<issue>34</issue>.</citation></ref>
<ref id="B18"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>K.</given-names></name> <name><surname>Ren</surname> <given-names>J.</given-names></name> <name><surname>Sun</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>Reads binning improves alignment-free metagenome comparison.</article-title> <source><italic>Front. Genet.</italic></source> <volume>10</volume>:<issue>1156</issue>. <pub-id pub-id-type="doi">10.3389/fgene.2019.01156</pub-id> <pub-id pub-id-type="pmid">31824565</pub-id></citation></ref>
<ref id="B19"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Standage</surname> <given-names>D. S.</given-names></name> <name><surname>Brown</surname> <given-names>C. T.</given-names></name> <name><surname>Hormozdiari</surname> <given-names>F.</given-names></name></person-group> (<year>2019</year>). <article-title>Kevlar: a mapping-free framework for accurate discovery of de novo variants.</article-title> <source><italic>iScience</italic></source> <volume>18</volume> <fpage>28</fpage>&#x2013;<lpage>36</lpage>. <pub-id pub-id-type="doi">10.1016/j.isci.2019.07.032</pub-id> <pub-id pub-id-type="pmid">31377530</pub-id></citation></ref>
<ref id="B20"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Fu</surname> <given-names>L.</given-names></name> <name><surname>Ren</surname> <given-names>J.</given-names></name> <name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Sun</surname> <given-names>F.</given-names></name></person-group> (<year>2018</year>). <article-title>Identifying group-specific sequences for microbial communities using long k-mer sequence signatures.</article-title> <source><italic>Front. Microbiol.</italic></source> <volume>9</volume>:<issue>872</issue>. <pub-id pub-id-type="doi">10.3389/fmicb.2018.00872</pub-id> <pub-id pub-id-type="pmid">29774017</pub-id></citation></ref>
<ref id="B21"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Lei</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Song</surname> <given-names>N.</given-names></name> <name><surname>Zeng</surname> <given-names>F.</given-names></name><etal/></person-group> (<year>2016</year>). <article-title>Effect of k-tuple length on sample-comparison with high-throughput sequencing data.</article-title> <source><italic>Biochem. Biophys. Res. Commun.</italic></source> <volume>469</volume> <fpage>1021</fpage>&#x2013;<lpage>1027</lpage>. <pub-id pub-id-type="doi">10.1016/j.bbrc.2015.11.094</pub-id> <pub-id pub-id-type="pmid">26721429</pub-id></citation></ref>
</ref-list>
</back>
</article>
