Extending Association Rule Mining to Microbiome Pattern Analysis: Tools and Guidelines to Support Real Applications

Boosted by the exponential growth of microbiome-based studies, analyzing microbiome patterns is now a hot-topic, finding different fields of application. In particular, the use of machine learning techniques is increasing in microbiome studies, providing deep insights into microbial community composition. In this context, in order to investigate microbial patterns from 16S rRNA metabarcoding data, we explored the effectiveness of Association Rule Mining (ARM) technique, a supervised-machine learning procedure, to extract patterns (in this work, intended as groups of species or taxa) from microbiome data. ARM can generate huge amounts of data, making spurious information removal and visualizing results challenging. Our work sheds light on the strengths and weaknesses of pattern mining strategy into the study of microbial patterns, in particular from 16S rRNA microbiome datasets, applying ARM on real case studies and providing guidelines for future usage. Our results highlighted issues related to the type of input and the use of metadata in microbial pattern extraction, identifying the key steps that must be considered to apply ARM consciously on 16S rRNA microbiome data. To promote the use of ARM and the visualization of microbiome patterns, specifically, we developed microFIM (microbial Frequent Itemset Mining), a versatile Python tool that facilitates the use of ARM integrating common microbiome outputs, such as taxa tables. microFIM implements interest measures to remove spurious information and merges the results of ARM analysis with the common microbiome outputs, providing similar microbiome strategies that help scientists to integrate ARM in microbiome applications. With this work, we aimed at creating a bridge between microbial ecology researchers and ARM technique, making researchers aware about the strength and weaknesses of association rule mining approach.


1.
Supplementary Table 1 Table describing a simulated dataset 1 composed of 5 taxa and 10 samples (CSV format).

Supplementary Table 2
Table describing a simulated dataset 2 composed of 5 taxa and 10 samples (CSV format). Table 3 ECAM taxa table obtained directly from QIIME2 datasets (Bolyen et al., 2019) in which only taxa assigned to genus level, with a relative abundance > 0.1 % in more than 15% of samples, are considered (TSV format).

Supplementary Table 4
Family ECAM taxa table obtained collapsing the ECAM dataset (Supplementary Table 3; Bolyen et al., 2019) to the family level via QIIME2 plugins (https://github.com/qiime2/q2-taxa) (TSV format). Table 5 Genus ECAM taxa table obtained collapsing the ECAM dataset (Supplementary Table 3; Bolyen et al., 2019) consisting only of taxa with complete taxonomy at the genus level (TSV format). Table 6 Pattern table generated performing microFIM on simulated dataset 1 (Supplementary Table 1) with the minimum support of 0.3, a minimum length of 2 and a maximum length of 10 (CSV format).

Supplementary Table 7
Pattern table generated performing microFIM on simulated dataset 2 (Supplementary Table 2) with the minimum support of 0.3, a minimum length of 2 and a maximum length of 10 (CSV format).

Supplementary Table 8
Table generated performing microFIM on ECAM dataset (Supplementary Table 3) with a minimum support of 0.2, a minimum length of 3 and a maximum length of 15 (CSV format).

Supplementary Table 9
Pattern table generated performing microFIM on ECAM dataset at family level (Supplementary  Table 4) with a minimum support of 0.2, a minimum length of 3 and a maximum length of 15 (CSV format).

Supplementary Table 10
Pattern table generated performing microFIM on ECAM dataset at genus level (Supplementary Table  5) with a minimum support of 0.2, a minimum length of 3 and a maximum length of 15 (CSV format). Figure 11 Supplementary Figure 11. Heatmap representing Jaccard distance matrix was generated via microFIM visualization phase on the ECAM dataset, considering Input 3 and samples belonging to the first sampling date.

Supplementary Table 12
The

Supplementary File 13a
Table generated performing microFIM on genus ECAM dataset considering samples with antibiotic administration. A minimum support of 0.2, a minimum length of 3 and a maximum length of 15 were used (CSV format).

Supplementary File 13b
Table generated performing microFIM on genus ECAM dataset considering samples with no antibiotic administration. A minimum support of 0.2, a minimum length of 3 and a maximum length of 15 were used (CSV format).

Supplementary File 13d
Table generated performing microFIM on genus ECAM dataset considering samples with cesarean delivery metadata. A minimum support of 0.2, a minimum length of 3 and a maximum length of 15 were used (CSV format).

Supplementary Table 14
The

Supplementary File 15
Ravel (2011) taxa tables and datasets obtained by filtering with metadata (nugent score equal to low and high) are available in Supplementary File 15 (zip). Below a description of all the files included.

Supplementary File 15a
Taxa

Supplementary File 15b
Taxa

Supplementary File 15c
Taxa

Supplementary File 15d
Table generated performing microFIM on Supplementary File 15a. A minimum support of 0.2, a minimum length of 3 and a maximum length of 15 were used (CSV format).

Supplementary File 15e
Table generated performing microFIM on Supplementary File 15b. A minimum support of 0.2, a minimum length of 3 and a maximum length of 15 were used (CSV format).

Supplementary File 15f
Table generated performing microFIM on Supplementary File 15c. A minimum support of 0.2, a minimum length of 3 and a maximum length of 15 were used (CSV format).

Supplementary File 15g
Table generated performing microFIM on genus Ravel dataset considering samples with nugent score equal to low. A minimum support of 0.2, a minimum length of 3 and a maximum length of 15 were used (CSV format).

Supplementary File 15h
Table generated performing microFIM on genus Ravel dataset considering samples with nugent score equal to high. A minimum support of 0.2, a minimum length of 3 and a maximum length of 15 were used (CSV format).