ORIGINAL RESEARCH article

Front. Bioinform., 03 March 2025

Sec. Genomic Analysis

Volume 5 - 2025 | https://doi.org/10.3389/fbinf.2025.1504728

MetaComBin: combining abundances and overlaps for binning metagenomics reads

  • Department of Information Engineering, University of Padova, Padua, Italy

Article metrics

View details

1,9k

Views

407

Downloads

Abstract

Introduction:

Metagenomics is the discipline that studies heterogeneous microbial samples extracted directly from their natural environment, for example, from soil, water, or the human body. The detection and quantification of species that populate microbial communities have been the subject of many recent studies based on classification and clustering, motivated by being the first step in more complex pipelines (e.g., for functional analysis, de novo assembly, or comparison of metagenomes). Metagenomics has an impact on both environmental studies and precision medicine; thus, it is crucial to improve the quality of species identification through computational tools.

Methods:

In this paper, we explore the idea of improving the overall quality of metagenomics binning at the read level by proposing a computational framework that sequentially combines two complementary read-binning approaches: one based on species abundance determination and another one relying on read overlap in order to cluster reads together. We called this approach MetaComBin (metagenomics combined binning).

Results and Discussion:

The results of our experiments with the MetaComBin approach showed that the combination of two tools, based on different approaches, can improve the clustering quality in realistic conditions where the number of species is not known beforehand.

1 Introduction

Microbes influence everyday life in countless ways, from helping modulate the atmosphere to keeping animals (including humans) and plants in a healthy status and helping detect environmental pollution and disease spread. Traditional genomic-based approaches require prior clone and laboratory culturing for further investigation (Felczykowska et al., 2012). However, not all bacteria can be cultured in a laboratory, as they might require habitat conditions that cannot easily be reproduced. Moreover, in a laboratory culture, the presence of multiple species, which is the norm in living environments, is usually considered contamination, thus preventing holistic study.

The advent of metagenomics has revolutionized the field of microbiology by shifting the focus from the individual microbe study to that of a complex microbial community. Metagenomics is the study of heterogeneous microbial communities by directly sampling the natural environment in which they live and sequencing the entire microbial community it contains (Kang et al., 2015). Samples can be taken from a variety of environments (e.g., soil, water, or the human gut or saliva, etc.) with the primary goal of determining the taxonomic identity of the microorganisms that are present in the samples (Staley and Konopka, 1985). Microbial studies play a prominent role in both environmental studies and precision medicine. In fact, among the advantages of metagenomics is the possibility of studying interactions among microbes living in the same environment (Shreiner et al., 2015) and comparing samples taken from similar environments or at different points in time for environmental monitoring or health screenings (e.g., Ondov et al., 2016; Pellegrina et al., 2020).

Alongside opening new research perspectives, metagenomics brings both experimental challenges for correct environmental sampling and computational challenges for quality control, assembly, and taxonomic and functional classification of large-scale complex communities (Bharti and Grimm, 2021). In particular, the detection and quantification of the species in a metagenomics sample is of paramount interest both as a challenging computational problem per se and as the first step in complex pipelines for functional analysis and sample comparisons (Mande et al., 2012). Despite extensive studies, accurate identification at the read level remains challenging (Sczyrba et al., 2017; Comin et al., 2021). Supervised methods can obtain high precision levels, but they rely on reference database completeness. Moreover, the construction of a -mers DB usually requires computing capabilities with large amounts of RAM and disk space. Another drawback is the inherent incompleteness of available databases: most bacteria found in environmental samples are unknown and cannot be cultured and separated in the laboratory; thus, their genome is not yet present in reference databases. For these reasons, the number of unassigned reads can be very high when using supervised methods (Lindgreen et al., 2013; Girotto et al., 2017a).

On the other hand, unsupervised classification tools, also known as metagenome binning algorithms, are based on the observation that the -mer distributions of the DNA fragments from the same genome are more similar than those from different genomes. Thus, without using any reference genome, one can determine if two fragments are from genomes of similar species based on their -mer distributions. In this study, we will focus on the unsupervised detection of species in a sample without the use of reference genomes and consider the short reads provided by the sequencing process as fragments.

One of the major problems when processing metagenomic data is the fact that the proportion of species in a sample, that is, the abundance rate, can vary greatly. Some tools, for example, AbundanceBin (Wu and Yuzhen, 2011), explicitly exploit this variability, clustering reads based on their abundance ratio. Although this approach works well if all the species in the sample have a different abundance, the approach struggles to distinguish among species with the same abundance. Approaches based on exploiting read overlaps and subsequently clustering them are capable of better distinguishing among single species, even if their abundance is similar. In recent years several such approaches have been proposed, mainly differing in the techniques used for feature extraction and the distance measure they use to define similarity (Wang et al., 2012; Vinh et al., 2015; Girotto et al., 2016; Andreace et al., 2021; Balvert et al., 2021).

In this paper1, we explore the idea of improving the overall quality of metagenomics binning at the read level by proposing a metagenomic combined binning framework, MetaComBin, that sequentially combines two complementary read-binning approaches. Read binning is intrinsically more difficult than contig binning, especially when short reads are used due to the limited length that can be exploited to compute statistics on the read itself. The authors of AbundanceBin tried a similar approach, combining their tool with MetaCluster. However, their experiments assumed that the exact number of species in the sample was known beforehand. Motivated by the curiosity of testing this idea on a more realistic framework, we paired AbundanceBin with MetaProb, which has the capability of estimating the number of species and proved in separate experiments Girotto et al. (2016) to outperform MetaCluster.

We run experiments on three datasets with several species, some of which, but not all, occur in the sample with the same abundance. The datasets we chose were among the most difficult to cluster by state-of-the-art methods for read binning, according to previous studies (Girotto et al., 2016). Our experiments suggest that our intuition is correct and that the combination of complementary tools can indeed be beneficial for metagenomic binning at the read level in more realistic, unsupervised settings.

2 Materials and methods

In this section, we will describe MetaComBin, the combined framework we used for our analysis, starting with the methodological details of the two tools we used: AbundanceBin (Wu and Yuzhen, 2011) and MetaProb (Girotto et al., 2016).

2.1 AbundanceBin

AbundanceBin is a tool for metagenomic binning based on abundance estimation. One of its strengths is its ability to produce satisfactory binning results even when the reads are very short (approximately 75 bp). The tool can work in an unsupervised manner, not requiring any information regarding the number of bins, similar to the data obtained in real and non-simulated situations when we do not know the composition of the samples. The working hypothesis of AbundanceBin is that the distribution of reads follows the Lander–Waterman model, whereby the coverage of the various nucleotide positions is modeled via a Poisson distribution. The metagenomics sequencing procedure can be viewed as a set of Poisson distributions, each of which represents a different species. In the presence of m different species, therefore, it is possible to identify m Poisson distributions. The mean of each of these distributions represents the abundance of the species and is, therefore, the element that must be calculated to obtain an estimate of their abundance.

AbundanceBin thus solves an optimization problem using an expectation-maximization (EM) algorithm. Once the EM algorithm has converged, it is possible to calculate the probability of assigning a read to a bin, even if there is the possibility that the read remains unassigned. The EM algorithm requires the number of bins as input. To solve this problem, AbundanceBin adopts a recursive approach that is based on dividing the dataset into two bins, subsequently iterating the process until bins with very different abundances are obtained. AbundanceBin performs well in situations in which the abundance of species is different, although not less than a 1:2 ratio. In cases with less variability when the species have a comparable abundance, AbundanceBin is no longer an optimal choice and shows very high error rates because it will most likely group different species with a similar abundance in the same bin.

2.2 MetaProb

MetaProb is a two-step approach for metagenomic read binning. The reads are first grouped together based on their overlap, measured in terms of the number of shared -mers, with by default. The output of this phase is a relatively large number of small groups of very connected reads that are, therefore, likely to belong to the same species. Next, within each group, a set of representative, not overlapping (to avoid redundancy) reads is chosen, and from it, an -mer profile ( by default) is extracted and normalized to obtain a group signature. Such signatures are finally given in input to the -means clustering algorithm that will group signatures (and their corresponding groups of reads) to obtain the final clusters that represent the different species in the sample.

Similarly to EM, the -means algorithm also requires previous knowledge of the number of clusters to obtain. MetaProb can both accept this parameter as input or estimate the value of by exploiting the Kolmogorov–Smirnov test.

2.3 Combined framework: MetaComBin

The idea of our framework is a two-step approach. First, in Step 1, we partition the reads so that all the reads of species with the same abundance are clustered with the AbundanceBin abundance-based algorithm.

Next, in Step 2, the MetaProb overlap-based approach is applied to each of the obtained clusters in order to separate the species within it. Figure 1 (top) shows the ideal pipeline of our approach. In reality, especially if the final number of expected species is not known and given, the combination of the two tools is not as smooth as in the ideal pipeline. In addition to the fact that none of the currently available read-binning algorithms is capable of perfect clustering, AbundanceBin, unlike most read-binning tools (including MetaProb), has not been designed to take into account paired-end reads. This means that it is possible that reads that are paired (and thus belong to the same species) are assigned by AbundanceBin to different clusters. This raises the problem of how to deal with paired-end reads that have been wrongly separated by AbundanceBin. In addition to being a conceptual error from AbundanceBin, this also creates a practical problem at the following step because MetaProb needs sets of paired-end reads as input. Possible options include deleting all unpaired reads or designing a reassignment strategy. However, determining the destination cluster for each read is complex, requiring a case-by-case evaluation based on the obtained results and the overall composition of the clusters. In our experiments, we explored two possible approaches: i) the reassignment of unpaired reads only from clusters with a very high percentage of unpaired reads; ii) the reassignment of unpaired reads starting from the cluster with the highest percentage of unpaired reads, and iteration of the reassignment until no more unpaired reads remain.

More formally, let

be the input dataset of paired-end reads, and let

be the paired-end couple

,

. Let

be the set of clusters obtained by AbundanceBin. Because AbundanceBin considers each read separately, it is possible for a pair

that

and

with

. Let

be the subset of unpaired reads in

. The two approaches we tested are:

  • 1. Static reassignment: given a threshold . For each cluster with , move each read in in the cluster of its paired read; that is, if and , move in .

  • 2. Iterative reassignment: sort the clusters in by decreasing number of unpaired reads. Starting from , the cluster with the highest number of unpaired reads, reassign its unpaired reads in to the clusters where their paired read is. Iterate the process by considering and continue until there are no more unpaired reads.

FIGURE 1

In Figure 1 (bottom), we can see a more realistic pipeline with the added intermediate processing.

3 Results and discussion

In this section, we will describe the dataset composition and the clustering quality measures used to assess the methods. Then, we will discuss the results obtained for the analysis of each of the considered datasets. The two tools, AbundanceBin and MetaProb, were run with default parameters. We did all the experiments without giving the number of expected clusters in input to mimic a more realistic context. All the experiments were performed on a machine with Intel(R) Xeon(R) Gold 5220 CPUs @ 2.20/3.90 GHz and 2TB of RAM.

3.1 Datasets

For our analysis, we chose three datasets used in several previous papers on metagenomic read binning (Girotto et al., 2017a; Vinh et al., 2015; Girotto et al., 2016; Andreace et al., 2021). We chose the datasets S7, S9, and MIXK, which are suitable for verifying our hypothesis, that is, whether using multiple clustering phases leads to improved results compared to using a single read-binning tool.

The datasets S7 and S9 were produced by Vinh et al. (2015), and they were downloaded by MetaProb repository2. The S7 dataset is composed of short paired-end reads and includes five species with abundance ratios 1:1:1:4:4 and with phylogenetic distance at the order and genus levels. The dataset was simulated using MetaSim, a tool for generating metagenomic reads, using the Illumina error profile with an error rate of 1%. Details of the dataset composition are given in Table 1.

TABLE 1

SpeciesNameCoveragePaired reads
Species 1Actinobacillus pleuropneumoniae serovar 5b str. L2010141,928
Species 2Aliivibrio salmonicida LFI12381075,183
Species 3Haemophilussomnus 129PT10126,183
Species 4Pasteurellamultocida 3695040588,088
Species 5Vibrio cholerae M66-240722,168

Details of the dataset S7.

The S9 dataset was also simulated using MetaSim and obtained according to the same error profile. However, it was reduced in size with respect to the original file because AbundanceBin was not capable of analyzing it. From the original S9 dataset, we selected 10 species with abundance ratios 1:1:2:2:2:2:3:3:3:3. Their phylogenetic distance was at either the phylum or the family level. We called this dataset S9red. Details of the dataset composition are given in Table 2.

The MIXK dataset was derived from a synthetic dataset originally produced by the authors of the popular Kraken metagenomic classifier (Wood and Salzberg, 2014). The reads in this dataset were not simulated but were obtained by combining real sequences obtained from projects that sequenced isolated microbial genomes. When creating these synthetic metagenomes, they used data sequenced by the Illumina HiSeq sequencing platform, which is available either at the GAGE-B project or on the NCBI Sequence Read Archive (details are on the Kraken paper). The MIXK dataset contains the reads of the species listed in Table 3 with an abundance ratio of 1:1:1:1:1:1:2:2:3:6.

TABLE 2

SpeciesNameCoveragePaired reads
Species 1Ehrlichiacanis str. Jake1082,012
Species 2Desulfovibrio vulgaris DP410216,912
Species 3Bartonella clarridgeiae 73 5547,629
Species 4Caldicellulosiruptor lactoaceticus 6A10167,267
Species 5Lactobacillusamylovorus GRL111815177,691
Species 6Streptococcusthermophilus JIM 823215181,562
Species 7Helicobactercetorum MIT 00-712815182,576
Species 8Bifidobacterium animalis subsp. lactis B420560,623
Species 9Mesotoga prima MesG1.Ag.4.215279,117
Species 10Geobacter sulfurreducens PCA10238,138

Details of the reduced dataset S9red.

TABLE 3

SpeciesNameCoveragePaired reads
Species 1Streptococcuspneumoniae TIGR42091,532
Species 2Xanthomonas axonopodis pv. Manihotis UA32360308,021
Species 3Bacilluscereus VD1181059,775
Species 4Aeromonashydrophila SSU1051,766
Species 5Mycobacteriumbscessus 6G-0125-R1054,252
Species 6Rhodobacter sphaeroides 2.4.12097,723
Species 7Pelosinus fermentans A111052,623
Species 8Bacteroides fragilis HMW1056,146
Species 9Vibrio cholerae CP1032(5)30167,707
Species 10Staphylococcus aureus M09271060,455

Details of synthetic dataset MIXK.

3.2 Evaluation metrics

To evaluate the quality of the results of the binning tools involved in this study, we used four popular performance evaluation metrics (namely, precision, recall, F-measure and ARI) as defined in other read-binning papers (Girotto et al., 2017a; Vinh et al., 2015; Girotto et al., 2016; Andreace et al., 2021; Balvert et al., 2021; Girotto et al., 2017b) and displayed in Equations 14. Let be the number of species in the simulated dataset, and let be the number of clusters returned by the algorithm. is the number of reads from species assigned to cluster , is the sum over all of the values for each given , and similarly is the sum over all of the values for each given :

3.3 Experimental analysis of dataset S7

3.3.1 Step 1: clustering of reads based on abundances

We obtained three different clusters from AbundanceBin, one more than expected. From Table 4, we can see the actual partition, and, in particular, we can notice the presence of paired reads that have been assigned to different clusters and their abundance in the cluster. The three resulting clusters highlight an effective grouping of the species with the highest abundance, that is, Species 4 and 5 (as illustrated in Table 5), which are mainly located within Cluster 1. The three species with lower abundance are instead distributed between Cluster 2 and Cluster 3, the latter mainly characterized by the presence of unpaired reads.

TABLE 4

ClusterTotal readsPaired readsUnpaired reads% Unpaired reads
Cluster 12,405,4432,152,948252,49510.5%
Cluster 2786,613453,664332,94942.33%
Cluster 3115,04420,50694,53882.18%

Clusters compositions generated by AbundanceBin on the dataset S7.

TABLE 5

ClusterSpecies 1Species 2Species 3Species 4Species 5
Cluster 110,00018,7238,0971,065,1391,303,484
Cluster 2225,827108,303203,762109,648139,073
Cluster 348,02923,34040,5071,3891,779

Number of reads per species in the clusters generated by AbundaceBin on the dataset S7.

Although the obtained number of clusters differs from the actual one, when looking in more detail at the composition of the clusters, we can state that the partitioning of the species within the clusters is congruent with the expected abundance, demonstrating the effectiveness of AbundanceBin in identifying and grouping species with the same abundance in dataset S7.

Tables 68 highlight how many of the associated reads are actually paired or unpaired for each species within each cluster. The % Read column also displays the structure of the cluster and the percentages of the constituent species.

TABLE 6

SpeciesReads% ReadsPairedUnpaired
Species 110,0000.42%30.04%69.96%
Species 218,7230.78%27.98%72.02%
Species 38,0970.34%30.04%69.96%
Species 41,065,13944.30%90.62%9.38%
Species 51,303,48454.16%90.30%9.70%

Detailed composition of Cluster 1 obtained from AbundanceBin on S7.

TABLE 7

SpeciesReads% ReadsPairedUnpaired
Species 1225,82728.70%80.51%19.49%
Species 2108,30313.77%74.09%25.91%
Species 3203,76225.90%81.82%18.18%
Species 4109,64813.94%9.90%90.10%
Species 5139,07317.69%10.10%89.90%

Detailed composition of Cluster 2 obtained from AbundanceBin on S7.

TABLE 8

SpeciesReads% ReadsPairedUnpaired
Species 148,02941.75%18.46%81.54%
Species 223,34020.29%17.93%82.07%
Species 340,50735.21%18.40%81.60%
Species 41,3891.21%0%100%
Species 51,7791.55%0%100%

Detailed composition of Cluster 3 obtained from AbundanceBin on S7.

As can be seen, more than 90% of the reads of the species with the highest abundance are included in Cluster 1. The remaining 10%, mainly composed of unpaired reads, are distributed between Clusters 2 and Cluster 3. The reads of the species with lower abundance are mainly distributed between Cluster 2 and Cluster 3, with some smaller quantities erroneously assigned to Cluster 1. For each of these three species, it is noteworthy that the vast majority of reads are assigned to Cluster 2 as a pair of paired-end reads, while the opposite trend is observed within Cluster 3, in which the prevalence of reads for each species consists of unpaired reads.

The computed values of precision, recall, and F-measure for AbundanceBin are 0.47, 0.87, and 0.61, respectively, in line with those obtained by Girotto et al. (2016). The low value of precision is expected because the AbundanceBin principle is to cluster together reads with the same abundance that, in our case study, can belong to different species. Similarly, we expect a high recall value because most reads of the same species will be included in the same cluster.

3.3.2 Step 2: partitioning of the abundance clusters

Before applying MetaProb to each of the clusters obtained with AbundanceBin, processing was needed to avoid the presence of unpaired reads. We tried two approaches:

  • 1. Reassign all unpaired reads of clusters with a composition of unpaired reads above a threshold (we chose ) to the cluster of their paired read;

  • 2. Reassign all unpaired reads of the cluster with the highest percentage of unpaired reads; iterate the process until no more unpaired reads are left.

In practice, in our case study, with Approach 1, we simply reassigned the unpaired reads of Cluster 3 to their counterparts in Cluster 1 and Cluster 2 and then removed any other unpaired reads that were left. With Approach 2, we did not remove any reads, which resulted in more than 250,000 reads re-paired with respect to Approach 1. We report here in detail only the results of the iterative Approach 2 because it appears to be the most effective one, as we will discuss later. The details of this processing are shown in Tables 911.

TABLE 9

ClusterSpeciesPaired reads
Cluster 1.ASpecies 14,370
Species 215,285
Species 33,211
Species 4541,144
Species 522,428
Cluster 1.BSpecies 14,077
Species 2812
Species 33,649
Species 437,410
Species 5110,605
Cluster 1.CSpecies 151
Species 27
Species 321
Species 43,967
Species 5581,932

Cluster 1 (from AbundanceBin on S7) partitioning obtained running MetaProb on it after unpaired reads reassignement.

The reads of the dominant species in the cluster have been highlighted in bold.

TABLE 10

ClusterSpeciesPaired readsClusterSpeciesPaired reads
Cluster 2.ASpecies 14,698Cluster 2.FSpecies 1554
Species 220,952Species 230,002
Species 312,111Species 30
Species 43,449Species 41
Species 51,853Species 50
Cluster 2.BSpecies 16,508Cluster 2.GSpecies 1133
Species 264Species 2585
Species 36,628Species 35
Species 4757Species 4441
Species 572Species 52,087
Cluster 2.CSpecies 12,381Cluster 2.HSpecies 11,608
Species 2258Species 23,111
Species 391,188Species 3299
Species 453Species 4345
Species 51Species 53,039
Cluster 2.DSpecies 13,178Cluster 2.ISpecies 198,564
Species 21,246Species 20
Species 32,235Species 31,079
Species 4347Species 40
Species 586Species 50
Cluster 2.ESpecies 1268Cluster 2.JSpecies 111,104
Species 2766Species 22
Species 31,257Species 3774
Species 4114Species 460
Species 524Species 541

Cluster 2 (from AbundanceBin on S7) partitioning obtained running MetaProb on it after unpaired reads reassignement.

The reads of the dominant species in the cluster have been highlighted in bold.

TABLE 11

ClusterSpeciesPaired read
Cluster 3.ASpecies 13,281
Species 2303
Species 3564
Species 40
Species 50
Cluster 3.BSpecies 1708
Species 2519
Species 31,502
Species 40
Species 50
Cluster 3.CSpecies 1445
Species 21,271
Species 31,660
Species 40
Species 50

Cluster 3 (from AbundanceBin on S7) partitioning obtained running MetaProb on it after unpaired reads reassignement.

The reads of the dominant species in the cluster have been highlighted in bold.

3.3.2.1 Analysis of Cluster 1

In ideal conditions, Cluster 1 generated by AbundanceBin should have contained only reads from the two most abundant species. However, as previously shown in Table 5, some noise from other species was also introduced. Nonetheless, the results confirm that MetaProb is able to effectively identify two sub-clusters covering about 90% of the reads of Cluster 1 and containing the majority of the reads from these two species: 92% of Cluster 1.A contains reads that belong to Species 4, while more than 99% of Cluster 1.C contains reads from Species 5. The remaining Cluster 1.B covers only 11% of the total reads of Cluster 1. Of these, 70% belong to Species 5, and 24% of reads belong to Species 4, with the remaining 6% consisting of reads from the minority species. The size and mixture of this cluster are, therefore, those of a “spurious” cluster possibly produced by the noise introduced by AbundanceBin, by a wrong estimate number of clusters from MetaProb, or by a combination of both.

It is important to highlight that MetaProb is naturally better suited to working with clusters that contain species with similar abundances. This is evident in the correct classification of Species 4 and 5. However, MetaProb encounters difficulties when the variation in species abundance is more marked, as in the case of Cluster 2. This aspect motivates the approach adopted in this experiment, which aimed to improve the performance of MetaProb by removing one of its weak points, a limitation similar to that found in other software based on DNA composition.

3.3.2.2 Analysis of Cluster 2

Giving the reads of Cluster 2 as input to MetaProb caused the generation of 10 different clusters. Considering that 95% of the reads in Clusters 2 belong to Species 1, Species 2, and Species 3, this result was somehow unexpected. However, by carefully examining each cluster composition, we can observe that four of these clusters (2.A, 2.C, 2.F, and 2.I) contain more than 30,000 paired reads. Specifically, reads from Species 1 and Species 3 are mainly assigned, respectively, to Cluster 2.I and 2.C, and in both cases, they represent about 98% of the total composition of these clusters. Species 2 instead has been mainly split between Clusters 2.A and 2.F that together contain more than 90% of the reads of this species.

The remaining six clusters have been assigned a smaller number of reads and can either be seen as a relatively small erroneous splitting of a species (e.g., Cluster 2.J that is almost entirely composed of reads of Species 1, correct identification of a species that should not have been in Cluster 2 (Cluster 2.J mainly contains reads from Species 5), or a mixture generated by the intrinsic similarity in terms of -mers that some species may share.

Overall, this in-depth analysis allows us to conclude that the output of MetaProb consists of four main clusters characterized by reads of the three species expected from this cluster, plus some noise. We will discuss later ideas on how to further improve this result.

3.3.2.3 Analysis of Cluster 3

Once the unpaired reads had been moved, Cluster 3 contained only 10,253 pairs of paired-end reads, which represents less than 1% of the total number of reads in the dataset, confirming the fact that it is a “spurious” cluster generated by AbundanceBin. Although it exclusively contained reads from the three species with lower abundance, given their non-representative nature, MetaProb could not fully exploit its distinguishing power based on overlap and composition signature. Nonetheless, as can be seen in Table 11, it was able to clearly distinguish reads of Species 1 (80% of Cluster 3.A is composed of reads from this species), while the distinction between Species 2 and Species 3 was more difficult. It is worth noting that within Cluster 3, Species 1 has basically twice as many reads as the other two species. We speculate that this could have helped MetaProb in finding the overlaps it needs to build its initial clusters.

3.3.3 Comparison with standalone tools

To summarize our results, we computed quality measures of the final binning obtained with those obtained by the single use of AbundanceBin and MetaProb. Moreover, we included the results of the currently available version of MetaCluster (5.0), which includes a low/high abundance partitioning phase before the final clustering.

Results shown in Table 12 support our intuition: among the datasets analyzed in previous studies, S7 was one of the most challenging as AbundanceBin struggles to obtain good precision, while MetaProb shows lower performances in both precision and recall with respect to other datasets. MetaCluster 5.0, which already includes a high/low abundance partitioning, showed the highest precision but the worst recall. Our combined approach slightly improves over MetaProb in terms of both precision (showing results comparable to those of MetaCluster) and recall, while it can maintain the high recall while doubling the precision with respect to AbundanceBin. The values of F-measure and ARI confirm the overall better performances of the proposed combined complementary approach.

TABLE 12

Dataset S7MetaClusterMetaProbAbundanceBinMetaComBin
Precision0.9250.8180.4770.912
Recall0.6710.7450.8790.812
F-measure0.7780.7800.6180.859
ARI0.5200.3440.730

Standalone tools vs. MetaComBin combined framework on the dataset S7.

In bold the best performance for each metric.

3.4 Experimental analysis of dataset S9red

The analysis for S9red was performed similarly. We will present the results for this dataset in a more compact way.

3.4.1 Step 1: clustering of reads based on abundances

We do not make any assumptions about the number of clusters when running the tools. Given S9red in input, AbundanceBin detected two major clusters rather than the expected three. After read pairing, we have Cluster 1 composed of 1,567,881 reads (96% of the total) and Cluster 2 composed of 65,646 reads. Cluster 2 is mainly composed (73%) of the two low-abundance species (Species 3 and Species 8), while Cluster 1 is composed of the great majority of reads of the medium- and high-abundance species. Note that high-abundance species are 1.5 more abundant than those with medium abundance. Because the difference between these two classes is not larger than 2, AbundanceBin is expected to struggle to distinguish between them, and, in fact, it puts them together in a single cluster.

3.4.2 Step 2: partitioning of the abundance clusters

When applying MetaProb to Cluster 1, again without specifying the number of expected clusters , we obtain many groups, specifically 21. However, eight such groups have a number of reads that is less than 1% of the size of the cluster and can be considered noise. If we focus on the nine groups that contain at least 5% of the reads in Cluster 1 (details in Table 13), we have that eight contain more than 99% of reads of the same species, and seven of those eight cover more than 70% of the reads of the same species (more than 88% for Species 1, 2, 5, 7, 9, and 10, and about 73% for Species 4). Species 6 is mainly split between two groups, covering about 60% and 26%, respectively. The remaining group, Cl7, shown in the table, covers most of the remaining reads of Species 4 (about 26%).

TABLE 13

Cl1Cl5Cl7Cl8Cl10Cl11Cl15Cl19Cl21
sp112179,10260718000
sp21,541000102801,291198,707
sp34,7681,0641730170513010
sp43,4251,219121,95344230000
sp520,6942,6851351141,0165140025
sp646,57983201121109,026000
sp75542500175,985010920
sp8508000090325437
sp91,98200006266,94400
sp101,6615211010000211,5891,557
tot81,83385,673122,368176,031141,418110,125266,944213,298200,726
%sp25.6596.4572.9196.3979.3660.0595.6488.8591.61
%sc56.9292.3399.6699.9799.7299.00100.0099.2098.99
idSpecies 6Species 1Species 4Species 7Species 5Species 6Species 9Species 10Species 2

Detailed partitioning of Cluster 1 (after unpaired-reads reassignment on the clusters obtained by AbundanceBin on S9red) with MetaProb. With %sp, we indicate the percentage of the majority species wrt the total number of reads for that species in the dataset (i.e., how well the group covers the majority species in it); with %sg, we indicate the percentage of the reads of the majority species wrt the total reads in that group (i.e., how well-defined the cluster is).

The reads of the dominant species in the cluster have been highlighted in bold.

Among the remaining groups with a size between 1% and 5% of Cluster 1 (not shown in the table because of space constraints), we have one well-defined group (about 86% of reads in it) covering 44% of Species 8, another well-defined group with 33% of the reads of Species 3, and a less well-defined group mostly composed by residual reads of Species 10.

When applying MetaProb to Cluster 2, we obtain five groups. Two of them are composed of 90% of reads from Species 8, covering together 39% of the reads of the species. The other two well-defined groups cover about 42% of Species 2. The remaining group is less well defined. The majority species in it, Species 10, covers 44% of the group. However, considering that Cluster 2 is much smaller than Cluster 1, these results can be considered noise as they represent only 2% of the reads of that species, which is much better represented by the groups Cl19 obtained in Cluster 1.

To summarize our finding, after applying our pipeline, we were able to identify the high- and medium-abundance species with both very high precision and recall. The reads of the low-abundance species were split between the two clusters found by AbundanceBin (57%–43% For species 3 and 54%–46% for Species 8). It is possible that the phylogenetic closeness of some species made their reads difficult to partition based on abundance only. However, when applying the second step of our analysis, the reads of these species were well separated from the others.

3.4.3 Comparison with standalone tools

Table 14 shows the result of the comparison of our pipeline with respect to those of the two tools run separately. Unfortunately, we were not able to run MetaCluster on our computer system. Because we needed to reduce S9 to be able to run AbundanceBin, we could not refer to MetaCluster to results appearing in previous papers, as we did for the analysis of S7.

TABLE 14

Dataset S9redMetaProbAbundanceBinMetaComBin
Precision0.9210.1910.921
Recall0.7850.9010.824
F-measure0.8430.3150.870
ARI0.7780.0140.833

Standalone tools vs. MetaComBin combined framework on the reduced dataset S9.

In bold the best performance for each metric.

3.5 Results on the synthetic dataset MIXK

The results of the application of our pipeline to the synthetic dataset MIXK are shown in Table 15 and in Table 16 for each of the two clusters, Cluster 1 and Cluster 2, detected by AbundanceBin and then partitioned by MetaProb. For ease of visualization, we report only the cluster with a size larger than 2% of the dataset.

TABLE 15

cl1.1cl1.2cl1.3cl1.4cl1.5cl1.6cl1.7cl1.8
sp11,5841,950325556000418
sp22000154264431
sp38,26016,29814,59916,913500552
sp493529626027635,5582,3681,879284
sp5183003,57028,0234,16827
sp63550701,7231,99159,940230
sp723,0718,2285,4979,807913642,280
sp86,3548,2841,8356,0148274117,831
sp91,59381222914425290317
sp105,5042,26627,8206,6000007
tot23,07116,29827,82016,91335,55828,02359,94017,831
sp43.8427.2746.0228.2968.6951.6561.3431.76
sc48.7242.6855.0141.9686.2985.4890.7781.25
idSpecies 7Species 3.bSpecies 10.aSpecies 3.aSpecies 4Species 5Species 6Species 8

Detailed partitioning of Cluster 1 (after unpaired-reads reassignment on the clusters obtained by AbundanceBin on MIXK) with MetaProb. With %sp, we indicate the percentage of the majority species wrt the total number of reads for that species in the dataset (i.e., how well the group covers the majority species in it); with %sg, we indicate the percentage of the reads of the majority species wrt the total reads in that group (i.e., how well-defined the cluster is). Clusters with a size greater than 2% of the total size of the dataset are reported.

TABLE 16

cl2.1cl2.2cl2.3cl2.4
sp119,32460,9385,6630
sp22,292152103288,161
sp32651,1791,2411
sp4556501230
sp5554039
sp6747965227
sp72894373760
sp88278602742
sp9126,30821,4447,71716
sp106522,18115,3550
tot151,31587,34130,746288,476
sp75.3166.5825.4093.55
sc83.4769.7749.9499.89
idSpecies 9Species 1Species 10.bSpecies 2

Detailed partitioning of Cluster 2 (after unpaired-reads reassignment on the clusters obtained by AbundanceBin on MIXK) with MetaProb. With %sp, we indicate the percentage of the majority species wrt the total number of reads for that species in the dataset (i.e., how well the group covers the majority species in it); with %sg, we indicate the percentage of the reads of the majority species wrt the total reads in that group (i.e., how well-defined the cluster is). Clusters with a size greater than 2% of the total size of the dataset are reported.

It is interesting to note that among the bigger clusters, it is possible to associate a distinct species to each cluster, except for Species 3 and Species 10, which were split between two clusters.

Finally, we report the comparison between AbundanceBin and MetaProb, which are run separately and as a pipeline, in Table 17. Even for this dataset, the combined approach showed generally better behavior than the tools run separately.

TABLE 17

Dataset MIXKMetaProbAbundanceBinMetaComBin
Precision0.6570.3770.787
Recall0.6540.9280.684
F-measure0.6560.5360.732
ARI0.4580.2530.740

Standalone tools vs. the MetaComBin combined framework on the synthetic dataset MIXK.

In bold the best performance for each metric.

4 Conclusion and future work

This study explored the combined use of complementary tools for metagenomics read binning in order to improve the overall quality of the binning process when some species have the same abundance ratio, and no knowledge of the actual number of species is given, as in a realistic context. Our results on three datasets that were difficult to analyze with other popular read-binning tools showed that a combined framework could exploit the strengths of different read-binning approaches to obtain better values in terms of clustering quality metrics than a single tool. Moreover, although we did our test with two specific metagenomic read-binning approaches based on abundance and overlaps (AbundanceBin and MetaProb), in principle, our framework can potentially combine any two tools with these characteristics.

Our analysis also pointed out some aspects that can be the subject of future studies: for example, the total number of clusters produced by the current pipeline is larger than the exact number of clusters. Although some over-estimation is expected because the exact estimation of the number of clusters is a challenge itself, and the problem is common to all the tools when is not given in input, it would be interesting to see if and to which extent this estimation can be improved. Two directions we plan to investigate aim at: i) the reduction of the noise introduced by AbundanceBin not taking into consideration paired-end reads by developing a strategy that clusters paired-end reads together or by adopting other strategies for reads reassignment, and ii) the further merge of “minor” clusters by adding a post-processing step to our pipeline.

Statements

Data availability statement

The original contributions presented in the study are included in the article/supplementary material; further inquiries can be directed to the corresponding author.

Author contributions

FT: formal analysis, software, visualization, writing–original draft, writing–review and editing, and investigation. CP: conceptualization, formal analysis, methodology, supervision, writing–original draft, writing–review and editing, validation, and visualization.

Funding

The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. CP is supported in part by the project funded under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.4 - Call for tender No. 3138 of 16 December 2021, rectified by Decree n.3175 of 18 December 2021 of the Italian Ministry of University and Research funded by the European Union–NextGenerationEU. Award Number: Project code CN_00000033, Concession Decree No. 1034 of 17 June 2022 adopted by the Italian Ministry of University and Research, CUP D33C22000960007, project title “National Biodiversity Future Center - NBFC.” Open Access funding provided by Universitá degli Studi di Padova | University of Padua, Open Science Committee.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1.^A preliminary version was presented at ICCABS2023.

2.^https://bitbucket.org/samu661/metaprob/src/master/

References

  • 1

    AndreaceF.PizziC.CominM. (2021). Metaprob 2: metagenomic reads binning based on assembly using minimizers and k-mers statistics. J. Comput. Biol.28, 10521062. 10.1089/cmb.2021.0270

  • 2

    BalvertM.LuoX.HauptfeldE.SchonhuthA.DutilhB. (2021). Ogre:overlap graph-based metagenomic read clustering. Bioinformatics17, 905912. 10.1093/bioinformatics/btaa760

  • 3

    BhartiR.GrimmD. (2021). Current challenges and best-practice protocols for microbiome analysis. Brief. Bioinform18, 178193. 10.1093/bib/bbz155

  • 4

    CominM.Di CamilloB.PizziC.VandinF. (2021). Comparison of microbiome samples: methods and computational challenges. Brief. Bioinform22, 8895. 10.1093/bib/bbaa121

  • 5

    FelczykowskaA.BlochS.Nejman-FaleńczykB.BarańskaS. (2012). Metagenomic approach in the investigation of new bioactive compounds in the marine environment. Acta Biochim. Pol.59, 501505. 10.18388/abp.2012_2084

  • 6

    GirottoS.CominM.PizziC. (2017a). Higher recall in metagenomic sequence classification exploiting overlapping reads. BMC Genomics18 (Suppl. 10), 917. 10.1186/s12864-017-4273-6

  • 7

    GirottoS.CominM.PizziC. (2017b). Metagenomic reads binning with spaced seeds. Theor. Comput. Sci.698, 8899. 10.1016/j.tcs.2017.05.023

  • 8

    GirottoS.PizziC.CominM. (2016). Metaprob: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics32, i567i575. 10.1093/bioinformatics/btw466

  • 9

    KangD.FroulaJ.EganR.WangZ. (2015). Metabat, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ3, e1165. 10.7717/peerj.1165

  • 10

    LindgreenS.AdairK.GardnerP. (2013). An evaluation of the accuracy and speed of metagenome analysis tools. Sci. Rep.18, 19233. 10.1038/srep19233

  • 11

    MandeS.MohammedM.GhoshT. (2012). Classification of metagenomic sequences: methods and challenges. Brief. Bioinform13, 669681. 10.1093/bib/bbs054

  • 12

    OndovB. D.TreangenT. J.MelstedP.MalloneeA. B.BergmanN. H.KorenS.et al (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol.17, 132. 10.1186/s13059-016-0997-x

  • 13

    PellegrinaL.PizziC.VandinF. (2020). Fast approximation of frequent k-mers and applications to metagenomics. J. Comput. Biol.27, 15341549. 10.1089/cmb.2019.0314

  • 14

    SczyrbaA.HofmannP.BelmannP.KoslickiD.JanssenS.DrögeJ.et al (2017). Critical assessment of metagenome interpretation - a benchmark of metagenomics software. Nat. Methods14, 10631071. 10.1038/nmeth.4458

  • 15

    ShreinerA.KaoJ.YoungV. (2015). The gut microbiome in health and in disease. Curr. Opin. Gastroenterol.31, 6975. 10.1097/mog.0000000000000139

  • 16

    StaleyJ.KonopkaA. (1985). Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terrestrial habitats. Acta Biochim. Pol.39, 321346. 10.1146/annurev.mi.39.100185.001541

  • 17

    VinhL. V.LangT. V.BinhL. T.HoangT. V. (2015). A two-phase binning algorithm using l-mer frequency on groups of non-overlapping reads. Algorithms Mol. Biol.10, 2. 10.1186/s13015-014-0030-4

  • 18

    WangY.LeungH.YiuS.ChinF. (2012). Metacluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics28, i356i362. 10.1093/bioinformatics/bts397

  • 19

    WoodD.SalzbergS. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol.15, R46. 10.1186/gb-2014-15-3-r46

  • 20

    WuY.YuzhenY. (2011). A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol.18, 523534. 10.1089/cmb.2010.0245

Summary

Keywords

metagenomics, reads binning, abundance, overlap, k-mers, clustering

Citation

Tomasella F and Pizzi C (2025) MetaComBin: combining abundances and overlaps for binning metagenomics reads. Front. Bioinform. 5:1504728. doi: 10.3389/fbinf.2025.1504728

Received

01 October 2024

Accepted

27 January 2025

Published

03 March 2025

Volume

5 - 2025

Edited by

David W. Ussery, University of Arkansas for Medical Sciences, United States

Reviewed by

Youtao Lu, University of Pennsylvania, United States

Kunihiko Sadakane, The University of Tokyo, Japan

Updates

Copyright

*Correspondence: Cinzia Pizzi,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics