# STATISTICAL AND COMPUTATIONAL METHODS FOR MICROBIOME MULTI-OMICS DATA

EDITED BY : Himel Mallick, Vanni Bucci and Lingling An PUBLISHED IN : Frontiers in Genetics

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-091-9 DOI 10.3389/978-2-88966-091-9

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# STATISTICAL AND COMPUTATIONAL METHODS FOR MICROBIOME MULTI-OMICS DATA

Topic Editors:

Himel Mallick, Merck (United States), United States Vanni Bucci, University of Massachusetts Dartmouth, United States Lingling An, University of Arizona, United States

Image: paulista/Shutterstock.com

Citation: Mallick, H., Bucci, V., An, L., eds. (2020). Statistical and Computational Methods for Microbiome Multi-Omics Data. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-091-9

# Table of Contents

*04 Editorial: Statistical and Computational Methods for Microbiome Multi-Omics Data*

Himel Mallick, Vanni Bucci and Lingling An


Zheng-Zheng Tang, Guanhua Chen, Qilin Hong, Shi Huang, Holly M. Smith, Rachana D. Shah, Matthew Scholz and Jane F. Ferguson


Yi-Hui Zhou and Paul Gallins


Antoine Bodein, Olivier Chapleur, Arnaud Droit and Kim-Anh Lê Cao

*123 Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities*

Duo Jiang, Courtney R. Armour, Chenxiao Hu, Meng Mei, Chuan Tian, Thomas J. Sharpton and Yuan Jiang


Kyle M. Carter, Meng Lu, Hongmei Jiang and Lingling An

# Editorial: Statistical and Computational Methods for Microbiome Multi-Omics Data

Himel Mallick <sup>1</sup> \*, Vanni Bucci <sup>2</sup> and Lingling An3,4,5

*<sup>1</sup> Biostatistics and Research Decision Sciences, Merck & Co., Inc., Rahway, NJ, United States, <sup>2</sup> Department of Microbiology and Physiological Systems, University of Massachusetts Medical School, Worcester, MA, United States, <sup>3</sup> Interdisciplinary Program in Statistics and Data Science, The University of Arizona, Tucson, AZ, United States, <sup>4</sup> Department of Epidemiology and Biostatistics, The University of Arizona, Tucson, AZ, United States, <sup>5</sup> Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States*

Keywords: microbiome, metagenomics, metabolomics, multi-omics, biostatistics, computational biology, statistical genomics, data science

**Editorial on the Research Topic**

#### **Statistical and Computational Methods for Microbiome Multi-Omics Data**

There has never been a more exciting time to do microbiome research thanks to the recent completion of several population-scale, longitudinal multi-omics studies including the NIH integrative human microbiome project (iHMP; iHMP Consortium, 2019) that have facilitated a multitude of new avenues of research for future investigations. These breakthroughs utilizing multiple 'omics technologies have paved the way toward investigating biological systems at an unprecedented level of detail, allowing a simultaneous assessment of community function, dynamics, and biochemical signatures across diverse disease states and environments. The field of microbiome multi-omics, however, has not yet reached the maturity attained in other established molecular epidemiology fields such as cancer biomarker discovery and genome-wide association studies (Mallick et al., 2017). As a result, it remains wide open to an in-depth exploration of new analytical methods in order to make the leap from bench to bedside.

This Research Topic is a timely endeavor toward this goal to expand our knowledge on systems biology approaches in understanding microbial communities. Due to the complexity of the associated data, the downstream analysis of microbiome multi-omics remains challenging. While most of the initial studies focused on analyzing single omics (e.g., taxonomic or functional profiles), there has been a shift in the field toward the concurrent investigation of the microbiome and host phenotypes (e.g., metabolomics and host transcriptomics). To this end, many of the articles in this Research Topic focus on new ways to analyze and integrate multi-table data using cutting-edge statistical and computational methods.

Sankaran and Holmes revisit an overwhelmingly large literature and algorithms already available on multi-table data analysis by reviewing both the algorithmic foundations and practical applications of a wide range of analysis approaches and re-evaluate these paradigms with respect to heterogeneity, dimensionality, and sparsity in a fully reproducible setup. In a similar vein, Bodein et al. propose a computational framework to integrate longitudinal microbiome data with other omics and clinical data generated on the same biological specimens based on smoothing splines and multivariate dimension reduction methods. Both these constitute a critical contribution to the field, given the growing commonality of multi-table datasets and the complexity of related study

#### Edited and reviewed by:

*Simon Charles Heath, Center for Genomic Regulation (CRG), Spain*

> \*Correspondence: *Himel Mallick himel.mallick@merck.com*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *05 July 2020* Accepted: *24 July 2020* Published: *25 August 2020*

#### Citation:

*Mallick H, Bucci V and An L (2020) Editorial: Statistical and Computational Methods for Microbiome Multi-Omics Data. Front. Genet. 11:927. doi: 10.3389/fgene.2020.00927*

**4**

designs, including dietary, pharmaceutical, clinical, and environmental covariates, often with samples from multiple time points or tissues.

Many important questions on microbiome multi-omics data integration remain unaddressed, especially those relating to extracting disease-relevant mechanistic networks that can provide insight into the complex web of host-microbiome interactions. Jiang et al. extensively review statistical aspects of relevant microbiome multi-omics network analysis methods by demystifying each class of methods with respect to their practical applicability and biological interpretability. Zhou and Gallins present a tutorial overview of commonly-used machine learning methods for microbiome host trait prediction, accompanied by validated R/Python implementations. The openaccess source codes from these publications not only provide an important resource for algorithm developers but also ensure widespread usage and impact of these methods, facilitating future methodological research advances.

Moving beyond routine univariate analysis methods that ignore the correlations between features, Banerjee et al. take a multivariate approach to differential abundance analysis by jointly modeling all features in a set while maintaining the correct type I error and high power, which is not trivial for many existing per-feature methods (McMurdie and Holmes, 2014; Mandal et al., 2015; Jonsson et al., 2016, 2017; Thorsen et al., 2016; Mallick et al., 2017; Weiss et al., 2017; Hawinkel et al., 2019). Koh et al. introduce a distance-based kernel association test for family-based or longitudinal microbiome studies to associate microbial community composition with any type of host traits based on the generalized linear mixed model, vastly expanding the capability to incorporate non-Gaussian host traits as well as multiple kernels.

Quantitative methods of microbiome multi-omics are by no means limited to downstream analysis of targeted ampliconbased and metagenomic profiling. This Research Topic also contains papers addressing important questions in upstream data processing and quantitative microbiome profiling. For instance, Song et al. focus on the comparison of metagenomic samples using alignment-free methods with reads binning and conclude that alignment-free and alignment-based methods for

#### REFERENCES


metagenome comparison complement each other and should be used interactively to understand the dynamics of microbial communities. Yoon et al. estimate feature-feature correlations and partial correlations from robust measurements of microbial cell count, in particular, flow cytometry, and validate the results in a recent quantitative gut microbiome dataset ensuring both statistical rigor and biological relevance.

Several articles in the Research Topic go beyond integrating multiple omics datasets to establishing causation and molecular mechanism, with an emphasis on methods that aim to detect microbiome-mediated signals through causal mediation analysis. While existing methods in this space make strong parametric assumptions, which can be quite detrimental when the assumptions are violated, Carter et al. turn to nonparametric entropy models to detect significant mediation effects in the presence of high-dimensional exposures and mediators. Tang et al. utilize state-of-the-art microbiome compositional mediation analysis procedures to investigate the diet-microbiome-metabolome interaction in cross-sectional multi-omics samples from healthy subjects. Both these analyses estimate the total mediation effects of microbiome composition, as well as feature-specific mediation effects, providing additional mechanistic insights above and beyond a direct causal relationship.

Taken together, the papers in this Research Topic represent both an incredible amount of progress and an enormous potential for further advances in the near future. As a result, we have launched a second edition of the Research Topic where we will continue to add additional methods, research, and review articles over the next year or so.

#### AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

## ACKNOWLEDGMENTS

We thank the Frontiers editorial staff for providing outstanding assistance in putting together this Research Topic collection.


**Conflict of Interest:** HM is employed by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Kenilworth, NJ, USA.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Mallick, Bucci and An. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# An Adaptive Multivariate Two-Sample Test With Application to Microbiome Differential Abundance Analysis

Kalins Banerjee<sup>1</sup> , Ni Zhao<sup>2</sup> , Arun Srinivasan<sup>3</sup> , Lingzhou Xue<sup>3</sup> , Steven D. Hicks <sup>4</sup> , Frank A. Middleton<sup>5</sup> , Rongling Wu<sup>1</sup> and Xiang Zhan<sup>1</sup> \*

*<sup>1</sup> Department of Public Health Sciences, Pennsylvania State University, Hershey, PA, United States, <sup>2</sup> Department of Biostatistics, Johns Hopkins University, Baltimore, MD, United States, <sup>3</sup> Department of Statistics, Pennsylvania State University, University Park, PA, United States, <sup>4</sup> Department of Pediatrics, Pennsylvania State University, Hershey, PA, United States, <sup>5</sup> Department of Neuroscience, State University of New York Upstate Medical University, Syracuse, NY, United States*

Differential abundance analysis is a crucial task in many microbiome studies, where the central goal is to identify microbiome taxa associated with certain biological or clinical conditions. There are two different modes of microbiome differential abundance analysis: the individual-based univariate differential abundance analysis and the group-based multivariate differential abundance analysis. The univariate analysis identifies differentially abundant microbiome taxa subject to multiple correction under certain statistical error measurements such as false discovery rate, which is typically complicated by the high-dimensionality of taxa and complex correlation structure among taxa. The multivariate analysis evaluates the overall shift in the abundance of microbiome composition between two conditions, which provides useful preliminary differential information for the necessity of follow-up validation studies. In this paper, we present a novel Adaptive multivariate two-sample test for Microbiome Differential Analysis (AMDA) to examine whether the composition of a taxa-set are different between two conditions. Our simulation studies and real data applications demonstrated that the AMDA test was often more powerful than several competing methods while preserving the correct type I error rate. A free implementation of our AMDA method in R software is available at https:// github.com/xyz5074/AMDA.

Keywords: adaptive microbiome differential analysis (AMDA), maximum mean discrepancy (MMD), multivariate two-sample test, permutation, subset testing, taxa-set

# 1. INTRODUCTION

The human microbiome, referred as the aggregate of microorganisms that resides on or within any human tissues and biofluids, has recently gained substantial scientific interest due to its vital role in many human health and disease conditions, including but are not limited to obesity (Turnbaugh et al., 2009), type 2 diabetes (Qin et al., 2012), rheumatoid arthritis (Zhang et al., 2015), inflammatory bowel disease (Morgan et al., 2015), bacterial vaginosis (Mitchell et al., 2017), and colorectal cancer (Louis et al., 2014). High-throughput sequencing technologies have revolutionized microbiome research by allowing culture-free profiling of entire microbiome community. For the most part, 16S rRNA gene amplicon sequencing and metagenomics shotgun

#### Edited by:

*Lingling An, University of Arizona, United States*

#### Reviewed by:

*Michael B. Sohn, University of Rochester, United States Hongmei Jiang, Northwestern University, United States*

> \*Correspondence: *Xiang Zhan xyz5074@psu.edu*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *08 January 2019* Accepted: *01 April 2019* Published: *24 April 2019*

#### Citation:

*Banerjee K, Zhao N, Srinivasan A, Xue L, Hicks SD, Middleton FA, Wu R and Zhan X (2019) An Adaptive Multivariate Two-Sample Test With Application to Microbiome Differential Abundance Analysis. Front. Genet. 10:350. doi: 10.3389/fgene.2019.00350*

**7**

sequencing are routinely used for quantitative characterization of microbiome composition (Wang and Jia, 2016). Although data produced by high-throughput sequencing has been proven extremely useful for quantification of microbiome composition, yet appropriate analysis of such microbiome composition data is still computationally and statistically challenging due to some technical aspects of the data, including high-dimensionality, count or compositional data structure, sparsity (zero-inflation), over-dispersion, among others.

In many microbiome studies, the investigators are often interested in studying how the abundance of microbiome is related with clinical characteristics of the samples, such as health/disease status, smoking status, or dietary habit (highcalorie or low-calorie). That is, many studies attempt to detect differentially abundant microbiome features (species/OTUs) between two predefined classes of samples, where a microbiome feature is considered differentially abundant, if its mean proportion is significantly different between two conditions. This type of analysis can improve understanding the pathology of the disease from a microbiome perspective and potentially lead to preventive or therapeutic strategies (Virgin and Todd, 2011). Microbiome differential abundance analysis (MDA) is a direct analogy to differential expression analysis for gene expression and RNA-seq data, however, the distinct nature of microbiome data renders classic differential expression analysis methods such as DESeq (Anders and Huber, 2010) and edgeR (Robinson et al., 2010) inappropriate for microbiome data (McMurdie and Holmes, 2014; Weiss et al., 2017). Thus, new statistical methods for microbiome differential abundance analysis are desired.

Similar to individual gene-based and pathway-based differential expression analysis, there are two types of microbiome differential analyses: individual taxon-based univariate analysis and taxa set-based multivariate analysis. Along with the recent huge scientific interest in microbiome studies, many statistical methods for microbiome differential analysis have also been proposed (Sohn et al., 2015; Zhao et al., 2015; Zhang et al., 2016; Chen et al., 2017), with most of them focus on examining whether a single taxon is differentially abundant between two different conditions, followed by multiple testing correction methods adjusting for individual taxon p-values (e.g., the Benjamini-Hochberg/BH procedure, Benjamini and Hochberg, 1995). The control of False Discovery Rate (FDR) is necessary, as an excess of false discoveries may lead to costly follow-up validation studies on false positive taxa, which essentially are not differentially abundant. Despite their potential usefulness in identifying differentially abundant taxa, these individual analyses may suffer from the following inherent limitations. First, the type I error of an individual microbiome differential analysis may not be correct (Hawinkel et al., 2017). The BH procedure or its variant can control FDR when individual tests are either independent or under positive dependence assumptions (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001), while negative correlation among taxa abundance is common in microbiome data, especially for compositional data. It is possible that these BH procedures (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001) may fail to control FDR in presence of negative correlations (Hawinkel et al., 2017). Second, the highdimensionality nature of microbiome data increases multiple correction burden of individual analyses, which reduces the power of detecting differentially abundant taxa. Third, as widely observed in literature, the performance of most individual microbiome differential analysis methods heavily rely on the normalization and/or transformation, leading to challenges in independent replication studies (McMurdie and Holmes, 2014; Sohn et al., 2015; Weiss et al., 2017).

An alternative approach to taxon-level microbiome differential analysis is to compare the microbiome composition at the level of taxa-set. Examples of such a taxa set can be either a group of OTUs belonging to the same upper-level taxonomic rank (e.g., phylum, class, order, family, or genus) or even all OTUs in the microbiome community. The multivariate-type microbiome differential analysis usually gains power by reducing the multiple testing correction burden and aggregating modest effects across multiple taxa. Moreover, the multivariate analysis is typically less sensitive to normalization/transformation compared to individual analysis as it has a much larger analysis unit. Motivated by this, many statistical methods for microbiome community-level analysis have been recently proposed (McArdle and Anderson, 2001; Zhao et al., 2015; Tang et al., 2016, 2017; Plantinga et al., 2017; Zhan et al., 2017a).

Despite of the potential power gain, a major critique of these existing multivariate microbiome analyses (e.g., differential analysis) is that the result of the test is global and is unable to identify specific taxon in the taxa-set that are differentially abundant. Besides the limitation in results' interpretation, it may also jeopardize the power of the test when the taxa-set contains many taxa that are not differentially abundant (Cao et al., 2017). To enhance both interpretation and power of existing multivariate analysis in the framework of MDA, we propose a two-stage Adaptive Microbiome Differential Analysis (AMDA) procedure, which first selects some putative taxa that are more likely to be differentially abundant between two conditions, and then examines the differential abundances of the selected taxa-set with a multivariate two-sample test using Maximum Mean Discrepancy (MMD) (Gretton et al., 2007, 2012). Since the test is applied to a subset of taxa that are more likely to be differentially abundant, permutations are used to establish statistical significance to avoid inflated type I error. Despite being a set-based multivariate test that does not target at identifying individual differentially abundant microbial taxa, the intermediate testing subset selection procedure in AMDA can provide useful information regarding the importance of individual taxon in the taxa-set. Simulation studies and real data applications demonstrate the potential usefulness of the new proposed AMDA method and show its superior performance over existing methods across a wide range of scenarios.

# 2. MATERIALS AND METHODS

## 2.1. Data and Normalization

Assume that we have measured the microbiome abundances of a community of p taxa from n(= n<sup>1</sup> + n2) samples collected from two groups with sizes of n<sup>1</sup> and n2, respectively. Here, the term community refers as a taxa-set, which typically consists of taxa from the same taxonomic rank such as genus, family, phylum, or bacteria kingdom. Let **X** (k) = (X (k) 1 , . . . , X (k) nk ) <sup>T</sup> be the observed n<sup>k</sup> × p OTU matrix for group k(k = 1, 2), where X (k) i (i = 1, . . . , n<sup>k</sup> ; k = 1, 2) represents a p × 1 microbiome composition vector (subject to appropriate normalization or transformation). Suppose that, X (k) 1 , . . . , X (k) nk (k = 1, 2) are two independent samples, from p-dimensional multivariate distribution with mean parameters µ (1) and µ (2), respectively. In many practical problems, the hypothesis of interest is to examine whether microbiome abundances are different under two different conditions, that is,

$$H\_0: \mu^{(1)} = \mu^{(2)}\text{ }\nu\text{s. }H\_1: \mu^{(1)} \neq \mu^{(2)}.\tag{1}$$

For microbiome data, due to the varying amount of DNA yielding materials across different samples, the count of microbiome sequencing reads can vary greatly from sample to sample. The normalization of the raw sequencing read counts to relative abundances makes the microbial abundances comparable across samples. Therefore, it is a common practice to analyze highdimensional microbiome compositional data with a unit sum (Li, 2015). As such, applying standard statistical methods developed for unconstrained data to analyze microbiome composition data is usually underpowered and sometimes can render inappropriate results (Cao et al., 2017; Weiss et al., 2017).

A popular approach to relax the compositional constraint of microbiome data is to perform the statistical analysis through log-ratio transformations (Aitchison, 1982). In particular, the centered log-ratio transformation has been widely used among various form of log-ratio transformations (Cao et al., 2017; Zhao et al., 2018). Specifically, the centered log-ratio transformation Z (k) ij of microbiome relative abundance X (k) ij is defined as

$$\begin{aligned} Z\_{ij}^{(k)} &= \log \left( \frac{X\_{ij}^{(k)}}{(\Pi\_{j=1}^{\rho} X\_{ij}^{(k)})^{1/\rho}} \right), \quad i = 1, \ldots, n\_k, j = 1, \ldots, p, \\\ k &= 1, 2. \end{aligned} \tag{2}$$

To avoid a zero relative abundance in Equation (2), as a common practice, a zero count is usually replaced by a pseudo count of 0.5 before the relative abundance normalization and centered logratio transformation (Li, 2015; Cao et al., 2017). For communitybased multivariate differential abundance analysis, it has been shown that testing equality of two compositional vectors is equivalent to testing H′ 0 :µ (1) <sup>Z</sup> = µ (2) Z (Cao et al., 2017), where µ (k) Z is the mean of centered log-ratio transformed compositional vector Z (k) i , i = 1, . . . , n<sup>k</sup> and k = 1, 2. We will develop our AMDA method based on these centered log-ratio transformed relative abundances in the rest of this paper.

#### 2.2. A Multivariate Two-Sample Test Using Maximum Mean Discrepancy

Two-sample testing on the equality of two high-dimensional means has been well studied in the statistical literature (Bai and Saranadasa, 1996; Chen et al., 2010; Cai et al., 2014). These methods are typically not applicable to MDA analysis due to the following two reasons. First, existing methods usually assume normal data, which is not the case for microbiome compositional data. It has been observed that classic statistical methods developed for multivariate Gaussian data may fail for microbiome compositional data (Li, 2015; Cao et al., 2017; Zhao et al., 2018). Second, most existing methods require estimating the covariance matrix. Given the small or modest sample size in a typical microbiome study, the relatively large estimation error of covariance matrix probably deteriorates the performance of two-sample test, as observed in microbiome association tests (Zhan et al., 2017b, 2018).

An alternative approach to test hypothesis (Equation 1)is to use a non-parametric test that does not need to estimate the covariance matrix. One such test is the kernel-based maximum mean discrepancy (MMD) test (Gretton et al., 2007, 2012), originally proposed to examine whether the underlying distribution of two samples are identical. An MMD test first maps the two distributions into a reproducing kernel Hilbert space (RKHS) and then the maximum mean discrepancy metric between the two distributions is defined as the distance of their corresponding images in the RKHS. A good property about MMD is that, MMD is zero if and only if two distributions are identical when the RKHS is sufficiently rich (contain a large enough class of functions). Since the test can be used to examine equality of two multivariate distributions, it suffices for testing (Equation 1), that is, to examine the equality of the mean parameters of two underlying distributions.

In particular, the MMD statistic between two independent samples X (1) 1 , . . . , X (1) n1 and X (2) 1 , . . . , X (2) n2 is defined as

$$\begin{split} \text{MMD}^2 &= \frac{1}{n\_1^2} \sum\_{i=1}^{n\_1} \sum\_{j=1}^{n\_1} k(X\_i^{(1)}, X\_j^{(1)}) + \frac{1}{n\_2^2} \sum\_{i=1}^{n\_2} \sum\_{j=1}^{n\_2} k(X\_i^{(2)}, X\_j^{(2)}) \\ &- \frac{2}{n\_1 n\_2} \sum\_{i=1}^{n\_1} \sum\_{j=1}^{n\_2} k(X\_i^{(1)}, X\_j^{(2)}), \end{split} \tag{3}$$

where k(·, ·) is a characteristic kernel (Gretton et al., 2007, 2012), which spans a RKHS which is sufficiently large that MMD is zero if and only if two samples are from the same underlying distribution. Examples of characteristic kernel include the Gaussian kernel and the Laplace kernel. Under the null hypothesis of identical distribution, the populationlevel MMD<sup>2</sup> statistic is zero, and thus, a larger MMD<sup>2</sup> statistic indicates a larger discrepancy between the two distributions. Asymptotically, MMD<sup>2</sup> follows a mixture of χ 2 1 distribution (Gretton et al., 2007, 2012). As observed in literature, the asymptotic mixture of χ 2 1 distribution is typically not accurate for a statistic calculated from a small sample size, as frequently encountered in microbiome studies (Chen et al., 2016; Zhan et al., 2017b, 2018). A more accurate approach to establish significance is using resamplings (e.g., permuting the group label of each observation) (Wu et al., 2016).

# 2.3. An Adaptive Two-Sample Test for Microbiome Differential Abundance Analysis

A limitation of the aforementioned MMD test is that it equally utilizes information in all dimensions. When the signal is sparse, the MMD test typically has a low power due to the high degrees of freedom paid for many noise variables. The same phenomenon has been widely observed in the field of set-based genetic association studies (Cai et al., 2012; Pan et al., 2014, 2015; Zhan et al., 2015) and community-based microbiome association studies (Wu et al., 2016; Koh et al., 2017). There are in general two types of two-sample test of high-dimensional means. One is based on the sum of squares of mean differences of each dimension [e.g., MiRKAT proposed in Zhao et al., 2015], and the other is based on the largest componentwise mean difference (e.g., the max-type test proposed in Cao et al. (2017)). For microbiome differential abundance analysis, the max-type test tends to be more powerful when only a few taxa are truly differentially abundant. On the other hand, the MiRKATtype test can be more powerful than the max-type test under the scenario of dense signals. In practice, the true underlying biological scenario is never known and thus adaptive methods for microbiome differential abundance analysis are desired.

A common adaptive approach in a multivariate association test or two-sample test is to assign different weights to variables so that important variables are up-weighted and non-informative variables are down-weighted (Cai et al., 2012; Pan et al., 2014, 2015; Wu et al., 2016; Koh et al., 2017). Yet it is often difficult to determine the optimal weights. Some authors propose another loop of permutations to combine multiple sets of weights, which may be computationally challenging since most adaptive tests already need permutations to establish significance (Pan et al., 2014, 2015). In this paper, we propose a different adaptive method, which tests the hypothesis in a selected subset of microbiome features. In other words, instead of applying the MMD test to all p taxa **X** = (X1, . . . , Xp), we apply the test on a putative testing subset **X**S, where S ⊂ {1, . . . , p}. Our method can also be viewed as a weighted approach in the sense that a zero weight is assigned to a feature that is not selected in the testing subset, and an equal weight is assigned to each feature in the testing subset. We defer details of selecting such a testing subset to the next section and present our adaptive microbiome differential analysis (AMDA) procedure in **Algorithm 1**:

# 2.4. A New Permutation-Based Testing Subset Selection Procedure

There is a vast statistical literature on high-dimensional variable selection. Some famous examples include the lasso (Tibshirani, 1996) and the knockoff filter (Barber and Candès, 2015; Candes et al., 2018). The lasso has proven to be a versatile tool with nice asymptotic estimation and prediction properties, yet its performance under small sample size is not guaranteed. On the other hand, knockoff is able to select variables under FDR control with finite samples. But it tends to select a smaller set of variables with less false positives to achieve FDR control ( see **Table S1** in the online supplemental material). As a consequence, many **Algorithm 1:** An adaptive two-sample test for microbiome differential abundance analysis

**Input:** A n × p microbiome composition matrix **X** = (X (1) 1 , . . . , X (1) n1 , X (2) 1 , . . . , X (2) n2 ) T and a n×1 group label vector y = (1, . . . , 1, 2, . . . , 2) associated with the microbiome compositions. **Output:** A p-value for H<sup>0</sup> :µ (1) = µ (2) vs. H<sup>1</sup> :µ (1) 6= µ (2) . **Procedure:**


signals are not selected by knockoff, typically leading to a less powerful test. Recall that, our ultimate goal is to construct a differential test with relatively high power. For this reason, we prefer a procedure that can select a testing subset that contains as many signals as possible. To achieve this goal, we propose the following permutation-based testing subset selection procedure.

We first randomly permute the row indices of matrix **X** (defined in **Algorithm 1**) and obtain a permuted microbiome composition matrix **X**˜ . By the nature of its construction, **X**˜ is not related to outcome y. Next, a one-dimensional two-sample test (e.g., the Kolmogorov-Smirnov test) is applied to each dimension of **X** and **X**˜ , and we denote the corresponding p-values as p1, . . . , p<sup>p</sup> and p˜1, . . . , p˜p, respectively. Because the dimension p is typically much larger than sample size in microbiome studies, we calculate the marginal p-values rather than joint p-values for testing subset selection. For a truly differentially expressed variable X<sup>j</sup> , as X˜ j is not constructed to be outcome-related, it is expected that p<sup>j</sup> < p˜<sup>j</sup> . Hence, we select the testing subset as S = {j : p<sup>j</sup> < p˜j} and conduct our MMD test based on the subdesign matrix **X**<sup>S</sup> . Finally, as we are testing H<sup>0</sup> :µ (1) = µ (2) using microbiome features that are more likely to be differentially expressed, to avoid inflated type I error, resampling methods are required to establish the significance (see details in **Algorithm 1**).

It should be noted that the aforementioned permutationbased procedure is one way to achieve testing subset selection but not the only way, and it is possible to select testing subset **X**<sup>S</sup> using other methods such as lasso and knockoff. We conduct comprehensive simulation studies to compare the power of adaptive two-sample test using different testing subset selection procedures and report the results in the online **Supplementary Material**. As can be observed there, adaptive test based on our permutation-based procedure is more powerful than both lasso-based and knockoff-based tests, as both lasso and knockoff tend to miss more true signals for the sake of achieving sparsity (lasso) or FDR control (knockoff).

#### 3. RESULTS

#### 3.1. Simulation Settings

A comprehensive simulation study has been conducted to compare the performance of AMDA to a wide range of existing microbiome association tests in the framework of microbiome differential abundance analysis. The five other tests evaluated in this simulation include the MiRKAT (Zhao et al., 2015), the original MMD test without testing subset selection (Gretton et al., 2007, 2012), the Quasi-Conditional Association Test/QCAT (Tang et al., 2017), the maximum-type (MAX) test based on the largest sample mean difference (Cao et al., 2017) and the optimal microbiome-based association test/OMiAT (Koh et al., 2017). AMDA, MiRKAT, MMD, QCAT, and MAX are a single test, while OMiAT takes advantage of two series of tests. One is the MiSPU tests (Wu et al., 2016) with different weighting schemes on each individual taxon in the taxa-set. The other is the MiRKAT tests with different kernel functions. The spirit of OMiAT can be easily implemented in AMDA, MiRKAT, and MMD by evaluating multiple kernels and taking the optimal kernel test with minimum p-value. We do not incorporate this strategy, for ease of presenting, and only evaluate the Gaussian kernelbased test for AMDA, MiRKAT, and MMD in this simulation. Correspondingly, we evaluate the OMiAT as the optimal of a series of MiSPU tests (without MiRKAT tests of different kernels) for fair comparison. With a slight abuse of notation, we still term this test as OMiAT, though it does not contain the MiRKAT component compared to the original one (Koh et al., 2017). Moreover, QCAT and MAX tests with asymptotic p-values are found to have inflated type I errors (data not shown). For this reason, we use permutations to calculate the MAX test p-value and the resampling option in the QCAT software (Tang et al., 2017) to calculate QCAT p-value. Finally, the permutation-based procedure is used to select testing subset in the intermediate stage of AMDA in this simulation. The performance of AMDA test based on other subset selection methods such as lasso and knockoff were evaluated in additional simulation studies presented in the online **Supplementary Material**.

We closely followed the simulation design of the MAX test (Cao et al., 2017) to generate microbiome relative abundances data using the logistic normal distribution (Atchison and Shen, 1980). We first simulated **W** (k) <sup>i</sup> ∼ Np(µ (k) , 6) for i = 1, 2, . . . , n, k = 1, 2 and then calculated the microbiome relative abundances as X (k) ij = exp[W (k) ij ]/ P<sup>p</sup> j=1 exp[W (k) ij ] and its centered log-ratio transformation Z (k) ij according to Equation (2). Following the simulation design of MAX (Cao et al., 2017), the components of µ (1) were drawn from a uniform distribution Unif(0,10) and we considered the banded covariance structure 6 = **D**1/2**AD**1/<sup>2</sup> , where **D** is a diagonal matrix with entries randomly drawn from Unif(1,3) and **A** has nonzero entries ajj = 1, aj,j−<sup>1</sup> = aj−1,<sup>j</sup> = −0.5. Under the null model, we set µ (2) = µ (1). Under the alternative model, we randomly picked a subset S ⊂ {1, 2, . . . , p} such that µ (2) <sup>j</sup> = µ (1) <sup>j</sup> + e<sup>j</sup> , where e<sup>j</sup> ∼ Unif(−0.5, 0.5) for all j ∈ S. For the size of signal set S (number of taxa that are truly differentially abundant), we considered low, medium and high signal density levels: p <sup>∗</sup> = |S| = 10%p, 30%p and 50%p with the indices randomly chosen from {1, 2, . . . , p}. Throughout this simulation, we varied n = 50, 100, 200 with n<sup>1</sup> = n<sup>2</sup> = n/2 to investigate the test's performance under different sample sizes, and considered p = 50, 100, 200, 500 representing taxa-sets under different taxonomic ranks.

After the data were simulated, we applied AMDA, MAX, OMiAT, MMD, MiRKAT, and QCAT to examine the two-sample differences. The first three tests AMDA, MAX, OMiAT are adaptive in the sense that they either use a testing subset of the taxa (AMDA and MAX) or assign a different weight for each taxon in the set (OMiAT) to conduct the multivariate two-sample test. The Gaussian kernel (k(x, y) = exp{−||x − y||<sup>2</sup> /ρ}, where x and y are two microbiome compositional vectors) was used in AMDA, MMD, and MiRKAT with the shape parameter ρ selected as the median of sample pairwise Euclidean distance ||x − y||<sup>2</sup> . The type I error was evaluated using 5,000 replicates generated under the null model and the power of test was assessed with 1,000 replicates under the alternative model. Without loss of generality, we set the nominal significance level α = 0.05 throughout this simulation.

#### 3.2. Simulation Results

The type I error of different tests are reported in **Table 1**, where one can see that all tests have the correct type I error across all (n, p)-configurations. The power of different tests are reported **Figure 1** (p = 50 and 100) and **Figure 2** (p = 200 and 500). Since the effect size was arbitrarily chosen to avoid power saturation, we care about the relative power among different methods rather than their absolute magnitudes. As can be seen from both figures, adaptive tests (AMDA, MAX, and OMiAT) are consistently more powerful than the non-adaptive ones (MMD, MiRKAT, and QCAT). This is because the scenarios considered in our simulation studies are relatively sparse (p ∗ /p ≤ 50%), and the adaptive tests can largely boost the power by treating variables (signals and noises) differently.

Among three non-adaptive tests, MMD and MiRKAT have similar power under each scenario. On the other hand, QCAT has the highest power when the dimension of taxa-set is relatively low (**Figure 1**) especially when the sample size is relatively large (n = 200). When the dimension of taxa-set increases, QCAT can quickly lose power and become less powerful than both MMD and MiRKAT (**Figure 2**).

Among the three more powerful adaptive tests, MAX seems to be slightly more powerful than AMDA and OMiAT when the density of signal is sparse (p ∗ /p = 10%) and dimension is relatively low (p = 50,100, and 200) as indicated in **Figure 1** and the top row of **Figure 2**. Compared to AMDA, MAX only utilizes the strongest signal, which could be beneficial when the signals are extremely sparse. When p = 500, there are p <sup>∗</sup> = 50 even under the sparse scenario and AMDA can be more powerful than MAX by including more signals in the testing subset (bottom row of **Figure 2**). On the other hand, when the signal level is moderate (p ∗ /p = 30%)


*Results are averaged over 5,000 replicates.*

or relatively dense (p ∗ /p = 50%), AMDA is much more powerful than MAX under most scenarios in both **Figures 1**, **2**. Finally, as seen from both figures, AMDA is always more powerful than OMiAT across all scenarios. AMDA and OMiAT treat variables in different ways. AMDA selects some variables and excludes the rest for further subset testing, while OMiAT assigns different weights for different variables when calculating the multivariate score test statistic. Despite that a small nonzero weight may be assigned to a noise variable in OMiAT, due to the relatively sparse signal density (p ∗ /p ≤ 50%, which means there are much more noises than signals), the accumulated adverse effects of noise variables can still deteriorate the performance of OMiAT. As a comparison, a zero weight is assigned to a noise variable (by excluding it from the

testing subset) in AMDA, which explains power gain in AMDA over OiMAT.

To conclude, like five other methods, the proposed AMDA method is able to preserve the nominal type I error in microbiome differential abundance analysis. Power-wise speaking, there is no uniformly most powerful test in our simulations. However, the proposed AMDA method is always the most powerful one among all six tests being evaluated in this simulation under most scenarios, and the power advantage of AMDA over the other five methods can be huge (**Figures 1**, **2**). Under only a few particular scenarios with extremely sparse signal (p ∗ /p = 10%) under relative low dimensions (p = 50,100, and 200), MAX can be slightly more powerful than AMDA.

# 3.3. Application to Oral Microbiome Data Collected From Children With Autism Spectrum Disorder

We applied the proposed AMDA method to a study investigating how the oral microbiome differs across children with autistic behaviors (Hicks et al., 2018). The study enrolled 346 children (between 2 and 6 years old), which were divided into three groups according to the severity of disorder/developmental status: autism spectrum disorder (ASD, n = 180), non-autistic developmental delay (DD, n = 60), and typically developing (TD, n = 106). The ASD group was defined using criteria specified in the Diagnostic and Statistical Manual of Mental Disorders (DSM–5) by the American Psychiatric Association. The DD group included children who did not meet DSM-5 criteria for ASD but had developmental delay symptoms (e.g., expressive speech delay and intellectual disability). TD children included children with negative ASD screening and met typical developmental milestones on standardized physician assessment. The oral microbiome composition of these children was quantified with next generation sequencing. The data along with details of data processing are available in the previous publication (Hicks et al., 2018).

Taxonomic reads were further filtered to include only the taxa with counts of more than 10, in more than 20% samples, which ended up with a oral microbiome community of 753 taxa. Sequence alignment with the k-SLAM (Ainsworth et al., 2017) method was used for comprehensive taxonomic classification, and these 753 taxa were classified into 457 species, 266 genera, 142 families, 73 orders, 33 classes, and 16 phyla (each rank had a Unclassified group for taxonomic sequence not identified at that rank). Because the proposed AMDA method is an adaptive multivariate two-sample test, we focused our analysis on higher taxonomic ranks (family, order, class, phylum, and the community of all 753 taxa), as many lower taxonomic ranks contain only a single taxon (e.g., 410 of the 457 species are a singleton). Similarly, for the taxonomic ranks (family, order, class, and phylum) being considered, we further limited our analysis to a particular taxa-set that contains more than two taxa. As a result, 52 families, 34 orders, 18 classes, and 10 phyla were tested in our data analysis. We applied AMDA, MAX, OMiAT, MMD, MiRKAT, and QCAT to this data to examine the oral microbiome differences among three different children developmental profile groups (particularly, ASD vs. DD and ASD vs. TD) at different taxonomic ranks. As 52 families/34 orders/18 classes/10 phyla were tested, we adjusted for multiple testing using the Bonferroni correction to control the familywise error rate at α = 0.05. Correspondingly, B = 10, 000 permutations/resamplings were used in AMDA, MAX, OMiAT, MMD, and QCAT to increase the precision of the test p-values, while the MiRKAT calculates the p-value analytically.

We first applied these tests to examine whether there is an overall shift in oral microbiome composition between different developmental groups by testing the differential abundances of all 753 taxa as a whole community. For the comparison of ASD vs. DD, the test p-values of AMDA, MAX, OMiAT, MMD, MiRKAT, and QCAT are 0.0113, 0.1409, 0.5244, 0.1321, 0.1377, and 0.9802, respectively. AMDA is the only method that is able to detect a significant (p-value < 0.05) difference of microbiome community profiles between ASD and DD. For the comparison of ASD vs. TD, the test p-values of AMDA, MAX, OMiAT, MMD, MiRKAT, and QCAT are 0.0021, 0.0017, 0.0323, 0.3039, 0.3099, and 0.1782, respectively. All three adaptive methods (AMDA, MAX, and OMiAT) are able to detect a significant difference between ASD and TD. In the original study (Hicks et al., 2018), the Mann-Whitney U-test based individual differential analysis was applied to each taxon and only three/six taxa were differentially abundant between ASD vs. DD/ASD vs. TD under FDR = 0.05 [see **Table 2** of Hicks et al. (2018)]. According to the previous simulation results, when the number of signals is relatively small (p <sup>∗</sup> = 3 or 6 as suggested in the original analysis) compared to the number of variables (p = 753), the non-adaptive tests have a low power. This explains that MMD/MiRKAT/QCAT methods are not able to detect a significant difference of microbiome profiles between two conditions in this data. Finally, the AMDA/MAX/OMiAT pvalue of comparison ASD vs. TD is much smaller than that of comparison ASD vs. DD, indicating a more significant overall oral microbiome composition difference between ASD vs. TD than the between ASD vs. DD, which is consistent with the severity of disorder.

Next, we shift our analysis unit to lower ranks than the community-level to comprehensively assess taxa-set (with multiple taxa) at each taxonomic rank that are differentially abundant among different developmental status groups. The testing results are summarized here in **Table 2** . Based on this table, one can observe that the proposed AMDA always declares more significant differences than the other two tests except for one scenario (class-level differential analysis between ASD and TD). The absolute difference among three methods presented in **Table 2** may be small due to the conservativeness of the Bonferroni correction. To observe the relative trends of different tests, the p-values of these tests at family-level are presented in **Figure 3** (p-values at other taxonomic ranks have the similar pattern and hence are not reported). The AMDA p-values tend to be the smallest among p-values of all six tests. Therefore, our method has a clear advantage over the other methods in terms of detecting more significant differences in this oral microbiome data differential abundance analysis.

# 4. DISCUSSION

With the ever-increasing availability of microbiome and metagenomics data generated by next generation sequencing technology, the need to develop and implement efficient statistical analysis for the data is important to ensure both statistical rigor and biological relevance. In this paper, we consider the problem of differential abundance analysis for microbiome data, which leads to a better understanding of the behavior of microbiome communities. Most existing methods tackle this problem using individual taxon-based approach followed by multiple testing adjustment. However, as taxa living in the same community do not grow independently, the complicated interactions among taxa result in complicated correlation structures among taxa relative abundances, which may violate the correlation assumptions (among individual tests) of existing multiple correction methods (Hawinkel et al., 2017). On the other hand, the newly proposed AMDA examines the differential abundance of a taxa-set typically containing taxa from the same genus/family/order/class/phylum, which provides an invaluable compliment to the individual taxon-based differential abundance analysis. Given evidence of an association of a taxaset with the outcome and assuming that at least one outcomeassociated taxon within the set exist, applying AMDA to a high taxonomic rank can provide a useful preliminary screening of the whole microbiome (all species in the community) and facilitate more targeted downstream laboratory-based microbiome finemapping and functional studies (Wang and Jia, 2016).

The AMDA method has two main advantages compared to a traditional individual taxon-based approach. First, it can provide new biological and biomedical insights. The joint modeling of all taxa in the set is able to capture conditional effects of taxa that are missed in the traditional individual taxon-based approach, and thus new insights can be gained by shifting the analysis unit to a higher taxonomic rank. Second, it is statistically powerful by aggregating marginal signals of individual taxon and reducing the multiple testing burden. By adaptively choosing the subset being tested, our AMDA further boosts the statistical testing power compared to existing taxa set-based differential abundance analyses (e.g., MiRKAT). Moreover, the adaptive strategy used in AMDA could be easily extended to other hypothesis testing framework (e.g., association testing) beyond the two-sample problem considered in this paper. We conducted comprehensive numerical simulation studies to show the superior performance of AMDA over existing approaches in terms of maintaining the correct type I error while having a higher power to detect a true difference. The potential usefulness of AMDA was further


TABLE 2 | Number of significant differential abundant taxa-set at each taxonomic rank detected by different methods under family-wise error rate of 0.05.

*Number in parentheses denotes the total number of tests conducted at that rank.*

demonstrated via its application to an oral microbiome data, where AMDA tends to detect more significant differences than its competitors.

For illustration of our method, we applied the Gaussian kernel-based MMD test, which has been shown to be a consistent two-sample test (Gretton et al., 2007, 2012). The numerical performance of AMDA using other kernels including Unifrac and Bray-Curtis (Zhao et al., 2015) is similar to the one based on the Gaussian kernel (data not shown). As the field matures, more complex (such as family-based and longitudinal) study designs have become increasingly popular in the scientific community to study the association between microbiome and various clinical and biological covariates. This is partially because these advanced designs can be more efficient to control potential confounders compared to the populationbased studies with unrelated individuals. The current adaptive multivariate microbiome differential abundance analysis is developed for independent samples. It is of further interest to extend it to accommodate correlated microbiome samples collected from a study using such a complex design. The current permutation-based testing subset selection procedure has been shown to have better numerical performance in terms of selecting more signals into testing subset than existing methods across a wide range of scenarios. Yet, any theoretical guarantees of this permutation-based selection procedure is largely unknown. It is also of interest to further incorporate the phylogenetic tree information into AMDA to facilitate a comprehensive microbiome differential abundance analysis besides applying AMDA to one taxonnomic rank of the tree each time. We believe these issues are of importance and warrant further investigation.

#### ETHICS STATEMENT

This study involves only secondary analyses, where all the utilized data sets are published in a previous study.

#### AUTHOR CONTRIBUTIONS

KB and NZ analyzed the data, drafted the paper, prepared figures and tables, AS and LX conducted the testing subset simulations, SH and FM provided and helped analyze the oral microbiome data. RW contributed substantial expertise to improve the paper and revised the paper. XZ conceived and designed the experiments, analyzed the data, wrote the

#### REFERENCES


paper, and software. All authors read and approved the final manuscript.

#### FUNDING

This work was supported by Quadrant Biosciences Inc. (Research agreement with SH), the National Institutes of Health grants R41 MH111347 (FM), P50 DA039838 (LX) and National Science Foundation grant DMS-1811552 (LX).

#### ACKNOWLEDGMENTS

The authors would like to thank the Associate Editor and two reviewers for their insightful comments that improved the paper. Funding was provided by Quadrant Biosciences Inc. (Research agreement with SH) and NIH STAR (R41 MH111347).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00350/full#supplementary-material

sequencing data. Bioinformatics 34, 643–651. doi: 10.1093/bioinformatics/ btx650


**Conflict of Interest Statement:** The authors declare that this study received funding from a National Institutes of Mental Health STTR award (R41 MH111347) to Quadrant Biosciences, Inc. Quadrant Biosciences was involved with study design, and data collection for the RNA sequencing results employed in this study's secondary data analysis (autism microbiome data). SH and FM serve on the scientific and medical advisory boards of Quadrant Biosciences Inc., and SH is a paid consultant for Quadrant Biosciences Inc.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Banerjee, Zhao, Srinivasan, Xue, Hicks, Middleton, Wu and Zhan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Distance-Based Kernel Association Test Based on the Generalized Linear Mixed Model for Correlated Microbiome Studies

#### Hyunwook Koh<sup>1</sup> , Yutong Li <sup>2</sup> , Xiang Zhan<sup>3</sup> , Jun Chen<sup>4</sup> and Ni Zhao<sup>1</sup> \*

*<sup>1</sup> Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States, <sup>2</sup> School of Physics, Peking University, Beijing, China, <sup>3</sup> Department of Public Health Sciences, Pennsylvania State University, Hershey, PA, United States, <sup>4</sup> Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States*

Researchers have increasingly employed family-based or longitudinal study designs to survey the roles of the human microbiota on diverse host traits of interest (e. g., health/disease status, medical intervention, behavioral/environmental factor). Such study designs are useful to properly control for potential confounders or the sensitive changes in microbial composition and host traits. However, downstream data analysis is challenging because the measurements within clusters (e.g., families, subjects including repeated measures) tend to be correlated so that statistical methods based on the independence assumption cannot be used. For the correlated microbiome studies, a distance-based kernel association test based on the linear mixed model, namely, correlated sequence kernel association test (cSKAT), has recently been introduced. cSKAT models the microbial community using an ecological distance (e.g., Jaccard/Bray-Curtis dissimilarity, unique fraction distance), and then tests its association with a host trait. Similar to prior distance-based kernel association tests (e.g., microbiome regression-based kernel association test), the use of ecological distances gives a high power to cSKAT. However, cSKAT is limited to handling Gaussian traits [e.g., body mass index (BMI)] and a single chosen distance measure at a time. The power of cSKAT differs a lot by which distance measure is used. However, choosing an optimal distance measure is challenging because of the unknown nature of the true association. Here, we introduce a distance-based kernel association test based on the generalized linear mixed model (GLMM), namely, GLMM-MiRKAT, to handle diverse types of traits, such as Gaussian (e.g., BMI), Binomial (e.g., disease status, treatment/placebo) or Poisson (e.g., number of tumors/treatments) traits. We further propose a data-driven adaptive test of GLMM-MiRKAT, namely, aGLMM-MiRKAT, so as to avoid the need to choose the optimal distance measure. Our extensive simulations demonstrate that aGLMM-MiRKAT is robustly powerful while correctly controlling type I error rates. We apply aGLMM-MiRKAT to real familial and longitudinal microbiome data, where we discover significant disparity in microbial community composition by BMI status and the frequency of antibiotic use. In summary, aGLMM-MiRKAT is a useful analytical tool with its broad applicability to diverse types of traits, robust power and valid statistical inference.

Keywords: microbiome association studies, correlated microbiome studies, longitudinal microbiome studies, community-level association analysis, distance-based association analysis, adaptive association analysis

#### Edited by:

*Himel Mallick, Merck (United States), United States*

#### Reviewed by:

*Christine Burns Peterson, University of Texas MD Anderson Cancer Center, United States Ryan Sun, Harvard University, United States Michael B. Sohn, University of Rochester, United States*

> \*Correspondence: *Ni Zhao nzhao10@jhu.edu*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *08 February 2019* Accepted: *30 April 2019* Published: *16 May 2019*

#### Citation:

*Koh H, Li Y, Zhan X, Chen J and Zhao N (2019) A Distance-Based Kernel Association Test Based on the Generalized Linear Mixed Model for Correlated Microbiome Studies. Front. Genet. 10:458. doi: 10.3389/fgene.2019.00458*

# INTRODUCTION

The recent surge in next-generation sequencing technologies has dramatically advanced the human microbiome studies by enabling generic characterization of the microbes in the human body (Hamady and Knight, 2009; Caporaso et al., 2010; Thomas et al., 2012). As the sequencing technology evolves, researchers are able to obtain more accurate metagenomic information with lower cost at a faster speed. Various types of metagenomic information can be obtained by the sequencing platforms, such as microbial abundances and functional/metabolic expressions (Mallick et al., 2017). In this study, we focus on the data for the microbial abundance and phylogenetic information of the surrogate microbial species, known as, operational taxonomic units (OTUs). Furthermore, we focus on the microbiome association studies which test the disparity in microbial community (e.g., bacterial kingdom) composition by a host trait of interest (e.g., health/disease status, clinical intervention, behavioral/environmental factor) (Li, 2015). For example, recent studies have found disparity in microbial community composition for a variety of health/disease status [e.g., obesity (Arslan, 2014), type I diabetes (Zhang et al., 2018a), type II diabetes (Qin et al., 2012), human immunodeficiency virus (Bandera et al., 2018), inflammatory bowel disease (Knights et al., 2013; Borren et al., 2018), and cancers (Zitvogel et al., 2015)], medical interventions [e.g., administration of antibiotics (Zhang et al., 2018a)], and behavioral/environmental factors [e.g., diet, residence, smoking and birth mode (Charlson et al., 2010; Liu et al., 2017)].

Notably, researchers have increasingly employed family-based (Goodrich et al., 2014; Schloss et al., 2014) or longitudinal study designs (Yang et al., 2017; Zhang et al., 2018a). Such study designs are advantageous in properly controlling for potential confounders or the sensitive changes in microbial composition and host traits. That is, because family members share similar environmental/genetic factors (refer that monozygotic twins even have the same genetic background), the use of family controls can efficiently rule out some potential confounding factors. Moreover, because microbial composition and host traits can vary by time, repeated measurements over a lengthy followup period can ensure more reliable analysis outcomes. Examples for such correlated microbiome studies include the familial (Goodrich et al., 2014) and longitudinal (Zhang et al., 2018a) studies, the data of which we use for our real data applications (see Real data applications). Briefly, Goodrich et al. (2014) have collected stool samples from families with twins in the United Kingdom to assess the relationship between obesity and gut microbiota. Zhang et al. (2018a) longitudinally collected fecal, cecal, and ileal samples from non-obese diabetic mice to evaluate whether the intestinal microbiota altered by earlylife antibiotic exposure affects maturation of innate immunity. The downstream data analysis for such studies is challenging because the measurements within clusters (e.g., families, subjects including repeated measures) tend to be correlated. We need to properly model the within-cluster correlation structure for valid statistical inferences. Besides, the unique features of the microbiome data (e.g., high-dimensionality, sparsity, and phylogenetic structure) need to be properly accounted for.

However, most of the current microbial community-level association tests [e.g., PERMANOVA (Anderson, 2001; McArdle and Anderson, 2001; Tang et al., 2016), MiRKAT (Zhao et al., 2015), MiSPU (Wu et al., 2016), OMiAT (Koh et al., 2017), aMiAD (Koh, 2018)] assume independent samples. Hence, they cannot be used for correlated microbiome studies. Zero-inflated Beta regression model (ZIBR) (Chen and Li, 2016) and negative Binomial mixed model (NBMM) (Zhang et al., 2017, 2018b) have recently been proposed for correlated microbiome studies. However, ZIBR and NBMM test individual microbial biomarkers (e.g., OTUs, taxa), not the microbial community as a whole. Hence, they are subject to a substantial loss of power after the requisite multiple testing correction. To our best knowledge, a remarkable community-level association test for correlated microbiome studies is the correlated sequence kernel association test (cSKAT) (Zhan et al., 2018). cSKAT is based on the linear mixed model (Laird and Ware, 1982), where the inherent random effect captures the within-cluster correlation of a host trait, and models the variance covariance structure of the microbial community based on an ecological distance, such as Jaccard dissimilarity (Jaccard, 1912), Bray-Curtis dissimilarity (Bray and Curtis, 1957) or unique fraction (UniFrac) distances (Lozupone and Knight, 2005; Lozupone et al., 2007; Chen et al., 2012). The use of ecological distances, which has also been widely adopted for many prior community-level association tests (Anderson, 2001; McArdle and Anderson, 2001; Zhao et al., 2015; Tang et al., 2016; Koh et al., 2017, 2018; Plantinga et al., 2017; Zhan et al., 2017), gives cSKAT a higher power than the ones based on non-ecological distances (Zhan et al., 2018). This is because the ecological distances are well-informed by properly modeling the microbial abundance and phylogenetic information (Jaccard, 1912; Bray and Curtis, 1957; Lozupone and Knight, 2005; Lozupone et al., 2007; Chen et al., 2012).

However, cSKAT has two major limitations. First, cSKAT is based on the linear mixed model (Laird and Ware, 1982). Hence, it is limited to handling Gaussian traits [e.g., body mass index (BMI)]. However, in practice, investigators can be interested in other trait types. Therefore, we introduce a distance-based kernel association test based on the generalized linear mixed model (GLMM), namely, GLMM-MiRKAT, to handle diverse types of traits, such as Gaussian (e.g., BMI), Binomial (e.g., disease status, treatment/placebo) or Poisson (e.g., number of tumors/treatments) traits. Second, cSKAT is limited to the itemby-item use of the ecological distances (i.e., the approach based on a single chosen ecological distance measure at a time). It is well-recognized in the microbiome research community that the power differs a lot by which distance measure is used, while it is also highly depending on the true underlying association pattern (Zhao et al., 2015; Koh et al., 2017, 2018). In practice, the true association pattern is usually unknown; hence, it is highly difficult to predict which distance measure performs best and choose a single optimal distance measure to use. The approach of individually testing multiple distances also requires multiple testing correction leading to a loss of power. Therefore, for a robustly high power, without the need to choose the optimal distance measure, we propose a data-driven adaptive test of GLMM-MiRKAT, namely, aGLMM-MiRKAT. aGLMM-MiRKAT robustly adapts to diverse association patterns by jointly considering multiple candidate ecological distance measures. Jaccard dissimilarity (Jaccard, 1912), Bray-Curtis dissimilarity (Bray and Curtis, 1957), UniFrac distances (Lozupone and Knight, 2005; Lozupone et al., 2007; Chen et al., 2012) are included as the candidate ecological distance measures because of their well-known features and distinguished performances (details are addressed later) (Zhao et al., 2015). Through extensive simulation experiments, we estimate robustly high power with well-controlled type I error for aGLMM-MiRKAT.

The rest of the paper is organized as follows. (1) In Materials and Methods, we address methodological details. (2) In Simulation, we address extensive simulation experiments. (3) In Real data applications, we apply aGLMM-MiRKAT to real familial and longitudinal microbiome data sets, where we test the association of the microbial community composition with BMI and the frequency of antibiotic use, while making interesting testing attempts and interpretations. (4) In Discussion, we finish with discussion and concluding remarks.

## MATERIALS AND METHODS

#### Notations and Models

We let yij denote a host trait of interest (e.g., health/disease status, medical intervention, behavioral/environmental factor) for the j-th measurement in the i-th cluster (i = 1, . . . , n, j = 1, . . . , mi), zijk denote the abundance level of the k-th OTU among p OTUs in the microbial community (k = 1, · · · , p), and xijl denote a covariate among q covariates (e.g., age, gender) that we want to adjust for (l = 1, . . . , q). We also let N denote the total number of measurements (i.e., N = P<sup>n</sup> <sup>i</sup>=<sup>1</sup> mi), **I<sup>g</sup>** denote the g-th order identity matrix and **1<sup>g</sup>** denote the g × 1 vector of ones. Throughout the paper, we use non-bold lowercase letters for scalars, bold lowercase letters for vectors, and bold uppercase letters for matrices.

To relate the microbial community composition with a host trait adjusting for covariates, we consider a generalized linear mixed model (Breslow and Clayton, 1993) (Equation 1).

$$g(\mu\_{i\bar{j}}) = \varkappa\_{i\bar{j}}^T \mathfrak{a} + s\_{i\bar{j}}^T \mathfrak{v}\_{\bar{i}} + h(z\_{i\bar{j}}),\tag{1}$$

where g(·) is a canonical link function (e.g., identity function for Gaussian traits, logistic function for Binomial traits, log function for Poisson traits) and µij = E(yij). α = α0, . . . , α<sup>q</sup> T are fixed effects for the covariates xij = 1, <sup>x</sup>ij1, . . . , <sup>x</sup>ijq<sup>T</sup> . υ<sup>i</sup> is the random effect for the pre-specified sij to account for the withincluster correlation in responses (i.e., conditional on υ<sup>i</sup> and h(zij), yij are independent with a diagonal variance-covariance matrix σ 2 ε **I**mi ). For example, when sij = 1, υ<sup>i</sup> is the random intercept which is assumed to follow a normal distribution N(0, σ 2 γ ). When sij = 1, <sup>t</sup>ij<sup>T</sup> , where tij is the time point for the i-th cluster and jth measurement, υ**<sup>i</sup>** = (υi1, υi2) is the random intercept and slope which are assumed to follow normal distributions υi<sup>1</sup> ∼ N(0, σ 2 γ 1 ) and υi<sup>2</sup> ∼ N(0, σ 2 γ 2 ). Then, γ **<sup>i</sup>** ≡ (si1υ<sup>i</sup> , . . . , simiυi) T follows a normal distribution with mean zero and m<sup>i</sup> × m<sup>i</sup> variancecovariance matrix 6<sup>i</sup> . The random effect υ<sup>i</sup> is to capture the within-cluster correlation in responses, while h(·) is a function which features the microbiome effect.

Here, we are particularly interested in testing **H**0: h(zij) = 0 (i.e., no association between microbial composition and a host trait adjusting for covariates) and, notably, with different specifications for h(zij), we can characterize different association patterns between microbial composition and a host trait. One may specify h(zij) as a fixed effect using a linear or non-linear function for the OTUs. For example, we can specify h(zij) = ϕ(zij) <sup>T</sup>β, where ϕ(·) is an element-wise transformation (e.g., identity or quadratic) function and β = (β1, . . . , βp) T are regression coefficients for the p OTUs, and then test H0: β = **0** using a p-degrees of freedom test. However, because of the highdimensional nature of the data (i.e., p >> n) and, for example, the resulting issue of low-rank matrices, testing H0: β = **0** with fixed effects might be challenging or even impossible. Therefore, we apply the kernel trick (Cristianini and Shawe-Taylor, 2000) and specify δij ≡ h(zij) = P<sup>n</sup> i ′ = 1 Pm<sup>i</sup> j ′ = 1 ωijκ(zij, z ′ i j ′ ), where κ(·,·) is a positive semi-definite kernel function which measures pairwise similarities in microbial composition, zij = (zij1, . . . , zijp) T is the p × 1 vector for the p OTUs and ωij's are coefficients; as such, h(·) lies in a reproducing kernel Hilbert space spanned by κ(·,·). Then, via the connection between kernel machine regression and mixed effect models (Liu et al., 2007), δ = (δ11, . . . , δ1m<sup>1</sup> , . . . , δn1, . . . , δnm<sup>n</sup> ) T is assumed to follow a distribution with mean zero and variance-covariance matrix τ**K**, where δ is an N × 1 vector, τ is the unknown variance component and **K** is an N × N pairwise similarity matrix. Then, we can perform a variance component test for H0: τ = 0 vs. H1: τ > 0 (Lin, 1997).

To address details on the kernel matrix **K** and the test statistic for H0: τ = 0, we first re-write the model (Equation 1) with matrix forms for all the measurements across all the clusters (Equation 2).

$$\lg(\mu) = \mathbf{X}\boldsymbol{\alpha} + \boldsymbol{\mathcal{y}} + \boldsymbol{\delta},\tag{2}$$

where µ = (µ11, . . . , µ1m<sup>1</sup> , . . . , µn1, . . . , µnm<sup>n</sup> ) T is an N × 1 vector, α = (α0, . . . , αq) T is an (q+1) × 1 vector, **X** = (x11, . . . x1m<sup>1</sup> , . . . , xn1, . . . , xnm<sup>n</sup> ) T is an N × (q+1) matrix, γ = (γ **<sup>1</sup>** , . . . , γ **<sup>n</sup>** ) is an N × 1 vector, and δ = (δ11, . . . , δ1m<sup>1</sup> , . . . , δn1, . . . , δnm<sup>n</sup> ) T is an N × 1 vector. Again, δ is assumed to follow a distribution with mean zero and variance-covariance matrix τ**K**. We further assume that the two random effects γ and δ are independent as in (Lin, 1997). The kernel matrix **K** is an N × N pairwise similarity matrix which is converted from the use of an ecological distance (Zhao et al., 2015), such as Jaccard dissimilarity (Jaccard, 1912), Bray-Curtis dissimilarity (Bray and Curtis, 1957) or UniFrac distances (Lozupone and Knight, 2005; Lozupone et al., 2007; Chen et al., 2012), via (Equation 3).

$$K\_{\{h\}} = -\frac{1}{2} \left( I\_N - \frac{\mathbf{1}\_N \mathbf{1}\_N^T}{N} \right) \mathbf{D}\_{\{h\}}^2 \left( I\_N - \frac{\mathbf{1}\_N \mathbf{1}\_N^T}{N} \right), \tag{3}$$

where **D**(h) is the N × N pairwise distance matrix and **D** 2 (h) is its element-wise square matrix, where h is an index for a chosen measure among diverse ecological distances. This kernel matrix (Equation 3) externally models ecologically meaningful pairwise similarities (correlation) in microbial composition among all the measurements across all the clusters, where the blockdiagonals (i.e., **K**(1,m1), (1,m1) , **K**(m1+1,m1+m<sup>2</sup> ), (m1+1,m1+m2) , . . . , **K**(N−mn+1, <sup>N</sup>), (N−mn+1, <sup>N</sup>) ) model the within-cluster similarities while the off-diagonals model the between-cluster similarities. The extent of OTU abundance and phylogenetic information is properly modulated by different ecological distance measures (Zhao et al., 2015).

#### GLMM-MiRKAT

While we will soon address the issue that the testing performance differs according to the choice of distance measure, we first introduce the variance component score statistic for a single chosen distance measure (i.e., item-by-item approach). Following (Lin, 1997), the variance component score statistic can be formulated with (Equation 4). Here, we construct the kernel matrix K(h) based on an ecological distance, and all the detailed derivation procedures are referred to (Lin, 1997).

$$\frac{\partial l(\mathfrak{a},\ \boldsymbol{\chi},\ \boldsymbol{\tau})}{\partial \boldsymbol{\tau}}|\_{\mathfrak{r}=0,\ \boldsymbol{\alpha}=\hat{\mathfrak{a}}\boldsymbol{\alpha},\boldsymbol{\upnu}=\hat{\mathfrak{p}}\boldsymbol{0}}\tag{4}$$

$$\boldsymbol{\eta} = \frac{1}{2}\left(\boldsymbol{\upnu}^{\ast} - \boldsymbol{X}\hat{\mathfrak{a}}\_{\mathsf{0}}\right)^{T}\hat{\boldsymbol{V}}\_{0}^{-1}\boldsymbol{\mathcal{K}}\_{\{h\}}\hat{\boldsymbol{V}}\_{0}^{-1}(\boldsymbol{\upnu}^{\ast} - \boldsymbol{X}\hat{\mathfrak{a}}\_{\mathsf{0}}) + tr(\hat{\mathcal{V}}\_{0}^{-1}\boldsymbol{\mathcal{K}}\_{\{h\}}),$$

where **y** ∗ = **X**αˆ **<sup>0</sup>** + γˆ**<sup>0</sup>** + 1ˆ **<sup>0</sup>**(**y** - µˆ **<sup>0</sup>** ) is the working vector and **V**ˆ <sup>−</sup>**<sup>1</sup> <sup>0</sup>** <sup>=</sup> (6<sup>ˆ</sup> **<sup>0</sup>** + **W**ˆ **0**) −1 . Here, 1ˆ **<sup>0</sup>** = diag(g ′ (µˆ <sup>0</sup> )) (i.e., 1ˆ <sup>0</sup> = **IN**, 1ˆ <sup>0</sup> = diag((µˆ **<sup>0</sup>** (**1** − µˆ **<sup>0</sup>**))−<sup>1</sup> ) and 1ˆ **<sup>0</sup>** = diag(µˆ **<sup>0</sup>** −**1** ) for Gaussian, Binomial, Poisson traits, respectively), 6ˆ **<sup>0</sup>** = diag(6ˆ **1,0,** . . . , 6ˆ **<sup>n</sup>**,**0**), and **W**ˆ **<sup>0</sup>** is the dispersion parameter for the errors estimated as **W**ˆ <sup>0</sup> = diag(var(µˆ **<sup>0</sup>** ), . . . , var(µˆ **<sup>0</sup>**)) for Gaussian traits and **W**ˆ **<sup>0</sup>** = **I<sup>N</sup>** for Binomial and Poisson traits, where αˆ **<sup>0</sup>**, γˆ**0**, µˆ **<sup>0</sup>** and 6ˆ **<sup>0</sup>** are estimated under the null generalized linear mixed model by the restricted maximum likelihood estimation (REML) method (Harville, 1977) and var(·) is the variance function. This test statistic (Equation 4) is the penalized quasi-likelihood estimating equation in Breslow and Clayton (1993) and the variance component score statistic for testing random effects in Lin (1997) under the above model specifications. This is also the unadjusted variance component score statistic proposed for cSKAT which is based on the linear mixed model for Gaussian traits (Zhan et al., 2018). Similar test statistics have also been widely used for various family-based and longitudinal studies in genetics and neuroscience (Schifano et al., 2012; Chen et al., 2013; Zhang et al., 2014; Wang et al., 2017), while assuming different variance covariance structures and/or applying different weighting schema. Since our p-value computation is based on a permutation approach, the scaling (i.e., <sup>1</sup> 2 ) and additive [i.e., tr(Vˆ <sup>−</sup><sup>1</sup> <sup>0</sup> K(h) )] terms do not change the comparative ranks of the observed and null (i.e., permuted) statistic values (see P-value calculation). Hence, we use a reduced-form statistic (Equation 5).

$$Q\_{(h)} = \left(\mathbf{y}^\* - \mathbf{X}\hat{\alpha}\_\mathbf{0}\right)^T \hat{V}\_\mathbf{0}^{-I} \mathbf{K}\_{(h)} \hat{V}\_\mathbf{0}^{-I} (\mathbf{y}^\* - \mathbf{X}\hat{\alpha}\_\mathbf{0}) \tag{5}$$

#### aGLMM-MiRKAT

The testing performance depends on the choice of distance measure (Zhao et al., 2015). To explain, non-phylogeny-based distances, such as Jaccard (1912) and Bray and Curtis (1957) dissimilarities, measure the disparity only in abundance, while phylogeny-based distances, such as UniFrac distances (Lozupone and Knight, 2005; Lozupone et al., 2007; Chen et al., 2012), measure the disparity both in abundance and phylogeny. Hence, non-phylogeny-based distances are well-suited when associated OTUs have disparity in abundance, while phylogeny-based distances are well-suited when they have disparity both in abundance and phylogeny. Moreover, Jaccard dissimilarity and unweighted UniFrac distance are based on incidence information (i.e., presence/absence of OTUs), while Bray-Curtis dissimilarity and weighted UniFrac distance are based on full abundance information [refer that generalized UniFrac distance modulates the intensity of abundance information between unweighted and weighted UniFrac distances by its parameter θ (Chen et al., 2012)]. Hence, Jaccard dissimilarity and unweighted UniFrac distance are well-suited when associated OTUs are rare in abundance in the sense that prevalent OTUs are likely to exist in all samples, while Bray-Curtis dissimilarity and weighted UniFrac distance are well-suited when they are rich in abundance. However, prior knowledge about the true association pattern is usually absent in reality. Hence, it is highly challenging to choose a single optimal distance measure to use. For a robustly high performance throughout various (but unknown) association scenarios, we propose aGLMM-MiRKAT which is based on the test statistic of the minimum p-value from multiple item-by-item GLMM-MiRKAT analyses (Equation 6).

$$T\_{aGLMM iKAT} = \min\_{h \in \Gamma} P\_{(h)},\tag{6}$$

where h is an index for a distance in a set of candidate ecological distances (Ŵ), where Ŵ = {Jaccard dissimilarity, Bray-Curtis dissimilarity, Unweighted UniFrac distance, Generalized UniFrac distance (θ = 0.5), Weighted UniFrac distance}. Obviously, we do not report the genuine minimum p-value (i.e., TaGLMMMiKAT) as it is. Instead, TaGLMMMiKAT (Equation 6) is the test statistic of aGLMM-MiRKAT, and we estimate the p-value for aGLMM-MiRKAT (PaGLMMMiKAT) using a permutation approach (see P-value calculation). Our extensive simulations reveal that aGLMM-MiRKAT maintains high power throughout all surveyed association scenarios, while the item-byitem GLMM-MiRKAT analyses are limitedly powerful only for some association scenarios. Further details are addressed in the Simulation section.

#### P-value Calculation

We calculate the p-values for the item-by-item GLMM-MiRKAT tests and aGLMM-MiRKAT using a permutation approach. Our permutation approach is semi-parametric as we fit the null model g(µˆ <sup>0</sup>) = Xαˆ <sup>0</sup> + ˆγ<sup>0</sup> (Equation 2) (excluding the microbiome portion) parametrically, and then draw the empirical null distribution of the test statistic (Equations 5, 6) through permutations non-parametrically. In this way, we can estimate the p-values without making distributional assumptions for the microbiome portion. Moreover, we do block permutations to account for any potential mis-specified within-cluster correlation structure based on the procedures in (Winkler et al., 2015). To be specific, for the random intercept model [i.e., rij = 1 (Equation 1)], we permute (1) the whole clusters (only the exchangeable clusters which have the same number of measurements) and (2) the measurements within each cluster, simultaneously. For the random slope model [i.e., rij = 1, <sup>t</sup>ij<sup>T</sup> (Equation 1)], we permute only the whole clusters (the exchangeable clusters which have the same number of measurements and the same time points). The detailed procedures for our permutation approach can be found in **S1. Computational algorithm**.

# RESULTS

## Simulation

#### Simulation Designs

Our simulation designs are based on prior studies (Zhao et al., 2015; Koh et al., 2017; Zhan et al., 2018), but here we conduct more extensive simulation experiments for diverse trait types with different within-cluster correlation structures. In particular, we simulated the data for Gaussian, Binomial and Poisson traits, respectively, based on the following generalized linear mixed models.

$$\begin{aligned} \boldsymbol{\gamma}\_{\boldsymbol{ij}} &= \; 0.5 \times scale(\mathbf{x}\_{i1} + \mathbf{x}\_{i\bar{j}2}) \\ &+ \; \beta \times scale(\sum\_{a \in \mathcal{A}} \mathbf{z}\_{\bar{i}\bar{j}a}) + \; s\_{\bar{i}\bar{j}}^{\top} \boldsymbol{\upsilon}\_{i} + \boldsymbol{\epsilon}\_{\bar{i}\bar{j}} \\ \log(\text{E}(\mathbf{y}\_{\bar{i}\bar{j}} = 1)) &= \; 0.5 \times scale(\mathbf{x}\_{i1} + \mathbf{x}\_{i\bar{j}2}) \\ &+ \; \beta \times scale(\sum\_{a \in \mathcal{A}} \mathbf{z}\_{\bar{i}\bar{j}a}) + \; s\_{\bar{i}\bar{j}}^{\top} \boldsymbol{\upsilon}\_{i} \\ \log(\text{E}(\mathbf{y}\_{\bar{i}\bar{j}})) &= \; 0.5 \times scale(\mathbf{x}\_{i1} + \mathbf{x}\_{i\bar{j}2}) \\ &+ \; \beta \times scale(\sum\_{a \in \mathcal{A}} \mathbf{z}\_{\bar{i}\bar{ja}}) + \; s\_{\bar{i}\bar{j}}^{\top} \boldsymbol{\upsilon}\_{i} \end{aligned}$$

In these equations, **xi1** is a cluster-specific (e.g., gender) covariate generated from the Bernoulli distribution with success probability 0.5, and xij<sup>2</sup> is a non-cluster-specific (e.g., timevarying) covariate generated from 0.5 × scale( P <sup>a</sup>∈A **zija**) + N(0, 1). Note that, xij<sup>2</sup> is a confounder as it is associated with both of the microbial composition and host trait. A is a set of associated OTUs among the total p OTUs in the community, and **zija** is the a-th OTU in A. β is a regression coefficient for the OTUs in A. scale is the standardization function to have mean zero and standard deviation one. υ<sup>i</sup> is the random effect for the pre-specified sij, and εij are errors generated from N(0, 1). We investigate small (n = 20) and moderate (n = 50) numbers of clusters, respectively, while assigning two, three and four measurements, respectively, into each one third of the clusters (i.e., when n = 20, m<sup>i</sup> = 2 for i = 1, . . . , 7, m<sup>i</sup> = 4 for i = 8, . . . , 14 and m<sup>i</sup> = 3 for i = 15, . . . , 20; when n = 50, m<sup>i</sup> = 2 for i = 1, . . . , 17, m<sup>i</sup> = 3 for i = 18, . . . , 34 and m<sup>i</sup> = 4 for i = 35, . . . , 50). This is to mimic (possibly) unbalanced numbers of measurements across clusters. As before, we let i = 1, . . . , n, j = 1, . . . , m<sup>i</sup> , k = 1, . . . , p and l = 1, . . . , q. For the random effect v<sup>i</sup> , we generate (1) random intercepts and (2) random intercepts and slopes, respectively, as follows. For the random intercepts (i.e., sij = 1), we generate v<sup>i</sup> from N(0, σ 2 γ ), while setting σ 2 <sup>γ</sup> = 1 2 , 1 and <sup>3</sup> 2 , respectively, to

investigate different within-cluster correlations, that is, ρj6=<sup>j</sup> ′ = σ 2 γ /(σ 2 <sup>γ</sup> +σ 2 ε ) = 1 3 , 1 2 and <sup>3</sup> 5 . For the random intercepts and slopes (i.e., **sij** = 1, j T ), we generate vi<sup>1</sup> and vi<sup>2</sup> from N(0, σ 2 γ ), while setting σ 2 <sup>γ</sup> = 1 2 , 1 and <sup>3</sup> 2 , respectively and tij = j, to investigate different within-cluster correlations, that is, ρj6=<sup>j</sup> ′ = σ 2 γ /(σ 2 <sup>γ</sup> +σ 2 ε ) = (1+j 2 ) (j <sup>2</sup>+3), (1+j 2 ) (j <sup>2</sup>+2) and (1+<sup>j</sup> 2 ) (j 2+ 5 ) .

3 For the OTUs in the community, we first estimated proportional means and a dispersion parameter for 856 OTUs (i.e., p = 856) in the bacterial kingdom from the real respiratorytract microbiome data (Charlson et al., 2010). Then, OTU counts for each measurement per cluster (i.e., Zij for i = 1, . . . , n, j = 1, . . . , mi) were generated from the Dirichlet-multinomial distribution (Mosimann, 1962) with the pre-specified parameter values of the estimated proportional means and dispersion. The total reads for each measurement were set to be 10,000. To reflect possible within-cluster relatedness among microbial communities, we updated the second and third measurements of microbial community using a random perturbation function: Zij = 1 2 (Zi(j−1) + Zij) for j =2, . . . , m<sup>i</sup> .

To estimate empirical type I error rates, we set β = 0. To estimate statistical powers, we set β = 1, while selecting a set of associated OTUs (A) by four different association scenarios as in Koh et al. (2017, 2018) and Koh (2018) (1) 50 random OTUs among the OTUs in lower half of abundance, (2) 50 random OTUs, (3) 50 random OTUs among the OTUs in upper half of abundance, and (4) OTUs in a cluster among 10 clusters partitioned by the partition around medoids (PAM) algorithm (Reynolds et al., 2006) based on OTUs' cophenetic distances (Sneath et al., 1975), respectively. The first three scenarios mimic the situations when associated OTUs are rare, medium and abundant, respectively, while the fourth scenario mimics the situation when they are close in phylogeny. For the fourth scenario, we randomized the selection of an associated cluster among the 10 clusters to avoid arbitrary cluster selection. To estimate empirical type I error rates, we conducted 30,000 replicates for each combination of the model, sample size and correlation structure. To estimate statistical powers, we conducted 10,000 replicates for each combination of the model, sample size, correlation structure and association scenario.

#### **Model fitting**

We fit the random intercept model (i.e., sij = 1) when the random intercepts are generated, and we fit the random slope model (i.e., **sij** = 1, j T ) when the random intercepts and slopes are generated, while including the two covariates and all the 856 OTUs in the community.

#### Simulation Outcomes

#### **Type I error**

We estimate well-controlled empirical type I error rates at the significance level of 0.05 for any item-by-item GLMM-MiRKAT or aGLMM-MiRKAT test, for any type of traits (i.e., Gaussian, Binomial and Poisson traits), for both small (n = 20) and moderate (n = 50) numbers of clusters, for any imposed within-cluster correlation, and for both random intercept (**Table 1**) and slope models (**Table 2**). However, we



*KJ: Jaccard dissimilarity; KBC: Bray-Curtis dissimilarity; KU: Unweighted UniFrac distance; K*0.5*: Generalized UniFrac distance (*θ =*0.5); K<sup>W</sup> : Weighted UniFrac distance; adaptive: adaptive GLMM-MiRKAT (aGLMM-MiRKAT). L: low within-cluster correlation (*ρ *j*6=*j* ′ = 1 3 *); M: medium within-cluster correlation (*ρ *j*6=*j* ′ = 1 2 *); H: high within-cluster correlation (*ρ *j*6=*j* ′ = 3 5 *).*

estimate inflated empirical type I error rates (>0.05) for the prior microbial community-level association tests, OMiRKAT (Zhao et al., 2015), aMiSPU (Wu et al., 2016), OMiAT (Koh et al., 2017), and aMiAD (Koh, 2018) (**Table 3**). This is because these tests treat all the measurements across all the clusters as independent samples in an exaggerated manner. We also observe in general that the higher the within-cluster correlation, the greater the type I error inflation (**Table 3**), as explained by the higher the within-cluster correlation, the smaller the effective sample size.

#### **Power**

We estimate in general that the moderate number of clusters (n =50) (**Figures 1**, **2**) is more powerful than the small number of clusters (n = 20) (**Figures S1**, **S2**), yet we observe the same comparative powers among different GLMM-MiRKAT analyses for the small (n = 20) and moderate (n = 50) number of clusters. Thus, to save space, the power outcomes for the small (n = 20) number of clusters are placed in (**Figures S1**,**S2**).

We estimate in general that the Gaussian models (**Figures 1A–C**, **2A–C**) are more powerful than the Binomial (**Figures 1D–F**, **2D–F**) and Poisson (**Figures 1G–I**, **2G–I**) models, where the Binomial models are the least powerful.



*KJ: Jaccard dissimilarity; KBC: Bray-Curtis dissimilarity; KU: Unweighted UniFrac distance; K*0.5*: Generalized UniFrac distance (*θ =*0.5); K<sup>W</sup> : Weighted UniFrac distance; adaptive: adaptive GLMM-MiRKAT (aGLMM-MiRKAT). L: low within-cluster correlation (*ρ *j*6=*j* ′ = (1+*j* 2 ) (*j* <sup>2</sup>+3) *); M: medium within-cluster correlation (*ρ *j*6=*j* ′ = (1+*j* 2 ) (*j* <sup>2</sup>+2) *); H: high within-cluster correlation (*ρ *j*6=*j* ′ = (1+*j* 2 ) (*j* <sup>2</sup>+ 5 ) *).*

3

This is because the continuous traits are better informed than the discrete traits, but not because our methods better suit the Gaussian models. We also observe in general that the higher the within-cluster correlation, the lower the power (i.e., **Figures 1A,D,G**, **2A,D,G** > **Figures 1B,E,H**, **2B,E,H** > **Figures 1C,F,I**, **2C,F,I**), as explained by the higher the withincluster correlation, the smaller the effective sample size. We observe similar comparative powers among different GLMM-MiRKAT analyses across Gaussian, Binomial and Poisson models for both of the random intercept (**Figure 1**) and slope (**Figure 2**) models. We address the detailed description on the comparative powers below.

GLMM-MiRKAT using Jaccard dissimilarity or unweighted UniFrac distance is more powerful in the first scenario when associated OTUs are rare in abundance (**Figures 1**, **2**: P1), while GLMM-MiRKAT using Bray-Curtis dissimilarity or weighted UniFrac distance is relatively more powerful in the second and third scenarios when associated OTUs are mid-abundant and abundant (**Figures 1**, **2**: P2-P3), as expected by their distinct weighting schema. GLMM-MiRKAT using weighted UniFrac distance or generalized UniFrac distance is more powerful in the fourth scenario when associated OTUs are close in TABLE 3 | Estimated type I error rates at the significance level of 5% for the prior microbial community-level association tests, OMiRKAT, aMiSPU, OMiAT, and aMiAD, for the clustered microbiome data (Unit: %).

#### Random intercepts


*L: low within-cluster correlation (*ρ *j*6=*j* ′ = 1 3 *for the random intercepts,* ρ *j*6=*j* ′ = (1+*j* 2 ) (*j* <sup>2</sup>+3) *for the random intercepts and slopes); M: medium within-cluster correlation (*ρ *j*6=*j* ′ = 1 2 *for the random intercepts,* ρ *j*6=*j* ′ = (1+*j* 2 ) (*j* <sup>2</sup>+2) *for the random intercepts and slopes); H: high withincluster correlation (*ρ *j*6=*j* ′ *=* 3 5 *for the random intercepts,* ρ *j*6=*j* ′ = (1+*j* 2 ) (*j* <sup>2</sup>+ 5 3 ) *for the random intercepts and slopes).*

phylogeny (**Figures 1**, **2**: P4), where GLMM-MiRKAT using Jaccard dissimilarity or Bray-Curtis dissimilarity is less powerful (**Figures 1**, **2**: P4), as expected by their use or non-use of phylogenetic information. Notably, none of the item-by-item GLMM-MiRKAT analyses are consistently powerful throughout all different association scenarios (i.e., they are powerful for some scenarios to which they are well-suited, but they are underpowered for the other scenarios to which they are not well-suited) (**Figures 1**, **2**). On the contrary, we estimate that the adaptive test of GLMM-MiRKAT, aGLMM-MiRKAT, is robustly powerful (closely reaching the highest power among the item-by-item GLMM-MiRKAT analyses) throughout all different association scenarios (**Figures 1**,**2**).

We additionally compare aGLMM-MiRKAT with the itemby-item cSKAT analyses for the random intercept Gaussian models as cSKAT can handle only the Gaussian traits based on the random intercept model (Zhan et al., 2018). Similar to the previous item-by-item GLMM-MiRKAT analysis outcomes, none of the item-by-item cSKAT analyses are consistently powerful throughout all different association scenarios (i.e., they are powerful for some scenarios to which they are well-suited, but they are under-powered for the other scenarios to which they are not well-suited) (**Figure 3**). Here again, we observe that aGLMM-MiRKAT maintains a high power throughout all different scenarios (**Figure 3**).

## Real Data Applications

#### A Family-Based Study on the Association Between Obesity and Gut Microbiota

Goodrich et al. (2014) have collected fecal samples from the United Kingdom twin population to study the roles of host genetics on gut microbiome, while addressing a breadth of associations between obesity indices and gut microbiota. Here, we analyze a small portion the original data to evaluate the association between BMI and microbial community composition. The raw sequence data are publicly available in the European Bioinformatics Institute (EBI) repository (Assess codes: ERP006339 and ERP006342). We processed them using the QIIME pipeline (Caporaso et al., 2010) with open referencebased OTU picking by targeting the V4 region of the 16S ribosomal RNA (rRNA) gene, and quantified OTUs at the 97% sequence similarity level and constructed a phylogenetic tree. Among the total of 1,024 measurements from 536 families, we focused on monozygotic twins. After excluding measurements with low sequencing depth (i.e., <10,000 total reads), 311 measurements from 145 families were included in our analysis. The data originally include 7,365 OTUs, but we removed OTUs with average relative abundance < 10−<sup>5</sup> , and then the data were rarefied to control unequal library sizes (Weiss et al., 2017); as such, 2,128 OTUs were included in our analysis.

We first visually check with principle coordinate analysis (PCoA) plots based on each distance measure to see if there is any disparity in microbial composition by BMI categories [i.e., under-weighted: BMI ( kg m<sup>2</sup> ) < 18.5; normal: 18.5 ≤ BMI ( kg m<sup>2</sup> ) < 25; over-weighted: 25 ≤ BMI ( kg m<sup>2</sup> ) < 30; obese: 30 ≤ BMI ( kg m<sup>2</sup> )] (**Figure 4**). It is not very clear in the visual inspection if there is any significant separation by BMI categories, and we observe the smallest separation based on weighted UniFrac distance (**Figure 4**).

We fitted GLMM-MiRKAT with random intercepts for BMI in continuous scale (Gaussian traits) adjusting for age. GLMM-MiRKAT using Jaccard dissimilarity (p-value: <0.001), Bray-Curtis dissimilarity (p-value: <0.001), unweighted UniFrac distance (p-value: <0.001) or generalized UniFrac distance (θ = 0.5) (p-value: 0.005) estimates significant association between BMI and microbial composition, while GLMM-MiRKAT using weighted UniFrac distance (p-value: 0.157) does not. This matches with our visual inspection of the smallest separation for the weighted UniFrac distance (**Figure 4**). This also indicates that the item-by-item GLMM-MiRKAT analyses are considerably sensitive to the choice of distance measure.

FIGURE 1 | Estimated statistical powers for GLMM-MiRKAT/aGLMM-MiRKAT based on the random intercept model with Gaussian, Binomial or Poisson responses (*n* = 50) (Unit: %). L: low within-cluster correlation (ρ *j*6=*j* ′ = 1 3 ); M: medium within-cluster correlation (ρ *j*6=*j* ′ = 1 2 ); H: high within-cluster correlation (ρ *j*6=*j* ′ = 3 5 ).*KJ* : Jaccard dissimilarity; *KBC*: Bray-Curtis dissimilarity; *KU*: Unweighted UniFrac distance; *K*0.5: Generalized UniFrac distance (θ = 0.5); *K<sup>W</sup>* : Weighted UniFrac distance; *adaptive*: adaptive GLMM-MiRKAT (aGLMM-MiRKAT). P1, P2, P3, and P4 represent the four different association scenarios: P1. A = {50 random OTUs in lower half of abundance}; P2. A = {50 random OTUs}; P3. A = {50 random OTUs in upper half of abundance}; P4. A = {A random cluster among 10 clusters partitioned by PAM}. (A) Gaussian (L); (B) Gaussian (M); (C) Gaussian (H); (D) Binomial (L); (E) Binomial (M); (F) Binomial (H); (G). Poisson (L); (H) Poisson (M); (I). Poisson (H).

aGLMM-MiRKAT estimates the significant association (p-value: <0.001).

For another demonstration, we fitted GLMM-MiRKAT with random intercepts for BMI in binary scale (Binomial traits) adjusting for age, comparing the normal and obese populations (i.e., 140 measurements from 85 families in the normal vs. 63 measurements from 41 families in the obese). However, we could not find any significant association by any itemby-item [i.e., Jaccard dissimilarity (p-value: 0.354), Bray-Curtis dissimilarity (p-value: 0.107), unweighted UniFrac distance (pvalue: 0.336), generalized UniFrac distance (θ =0.5) (p-value: 0.231), weighted UniFrac distance (p-value: 0.333)] or adaptive

FIGURE 2 | Estimated statistical powers for GLMM-MiRKAT/aGLMM-MiRKAT based on the random slope model with Gaussian, Binomial or Poisson responses (*n* = 50) (Unit: %). L: low within-cluster correlation (ρ *j*6=*j* ′ = 1 3 ); M: medium within-cluster correlation (ρ *j*6=*j* ′ = 1 2 ); H: high within-cluster correlation (ρ *j*6=*j* ′ = 3 5 ). *KJ* : Jaccard dissimilarity; *KBC*: Bray-Curtis dissimilarity; *KU*: Unweighted UniFrac distance; *K*0.5: Generalized UniFrac distance (θ = 0.5); *K<sup>W</sup>* : Weighted UniFrac distance; *adaptive*: adaptive GLMM-MiRKAT (aGLMM-MiRKAT). P1, P2, P3, and P4 represent the four different association scenarios: P1. A = {50 random OTUs in lower half of abundance}; P2. A = {50 random OTUs}; P3. A = {50 random OTUs in upper half of abundance}; P4. A = {A random cluster among 10 clusters partitioned by PAM}. (A) Gaussian (L); (B) Gaussian (M); (C) Gaussian (H); (D) Binomial (L); (E) Binomial (M); (F) Binomial (H); (G) Poisson (L); (H) Poisson (M); (I) Poisson (H).

[i.e., aGLMM-MiRKAT (p-value: 0.253)] analysis. This power loss, of course, is related to the reduced sample size in the selected comparison. This may also indicate that BMI in continuous scale is better informed than BMI in binary scale, which matches with our simulation result, where the Gaussian models are more powerful than the Binomial models (**Figures 1**,**2**).

(θ =0.5); *KW* : cSKAT for Weighted UniFrac distance; *adaptive*: adaptive GLMM-MiRKAT (aGLMM-MiRKAT). P1, P2, P3, and P4 represent the four different association scenarios: P1. A = {50 random OTUs in lower half of abundance}; P2. A = {50 random OTUs}; P3. A = {50 random OTUs in upper half of abundance}; P4. A = {A random cluster among 10 clusters partitioned by PAM}. (A) *n* = 20 (L); (B) *n* = 20 (M); (C) *n* = 20 (H); (D) *n* = 50 (L); (E) *n* = 50 (M); (F) *n* = 50 (H).

#### A Longitudinal Study on the Association Between the Frequency of Antibiotic Use and Gut Microbiota

Zhang et al. (2018a) collected fecal, cecal and ileal samples from non-obese diabetic mice for microbiome profiling studies based on a longitudinal study design to evaluate if the intestinal microbiota altered by early-life antibiotic exposure affects maturation of innate immunity. The raw sequence data are publicly available in the Qiita database (Identifier: 11242). We processed them using the QIIME pipeline (Caporaso et al., 2010) with open reference-based OTU picking by targeting the V4 region of the 16S rRNA gene, and quantified OTUs at the 97% sequence similarity level and constructed a phylogenetic tree. The original study (Zhang et al., 2018a) contains enormous amount of data for a number of sub-studies, but, for a demonstration of our proposed method, we only analyze a small portion of the data. To be specific, we focused on fecal samples to evaluate the disparity in microbial community composition by the frequency of antibiotic use (i.e., 0, 1, 2, and 3 course(s) of antibiotic use). After excluding measurements with low sequencing depth (i.e., <10,000 total reads), 229 measurements from 87 mice were included in our analysis. The study design is longitudinal and unbalanced in that each mouse has different numbers of repeated measurements: 61 mice have three measurements, 20 mice have two measurements and 6 mice have one measurement through different time points. Among the total of 229 measurements, 120 have had no antibiotic use, 43 have had one course of antibiotic use, 26 have had two courses of antibiotic use, and 40 have had three courses of antibiotic use.

Here, we first visually check with the PCoA plots based on each distance measure to see if there is any disparity in microbial composition by different numbers of antibiotic use (**Figure 5**). We observe a very clear visual separation, especially from no antibiotic use group to at least one course of antibiotic use group, based on any distance measures (**Figure 5**).

We fitted GLMM-MiRKAT with random intercepts for the number of antibiotic use (Poisson traits) (i.e.,

UniFrac distance (θ = 0.5); W. UniFrac: weighted UniFrac distance.

0, 1, 2, and 3 course(s) of antibiotic use) adjusting for gender. We found significant association between the number of antibiotic use and microbial composition by all the item-by-item analysis [i.e., Jaccard dissimilarity (p-value: <0.001), Bray-Curtis dissimilarity (p-value: <0.001), unweighted UniFrac distance (p-value: <0.001), generalized UniFrac distance (θ = 0.5) (p-value: <0.001), weighted UniFrac distance (p-value: <0.001)]. We also found the significant association for aGLMM-MiRKAT (p-value: <0.001).

## DISCUSSION

In this paper, we introduced a distance-based kernel association test based on the generalized linear mixed model, GLMM-MiRKAT, for correlated (e.g., family-based or longitudinal) microbiome studies. GLMM-MiRKAT can relate microbial community composition with any type of host traits that are distributed as an exponential family distribution. Thus, GLMM-MiRKAT can be regarded as an extension of cSKAT (Zhan et al., 2018) to handle non-Gaussian host traits. Furthermore, we developed aGLMM-MiRKAT to incorporate multiple kernels for a robustly high power. aGLMM-MiKRAT is especially useful in practice, where there are various types of host traits, but our knowledge about the true association pattern is limited.

We calculate the p-values for the item-by-item GLMM-MiRKAT and aGLMM-MiRKAT using a permutation approach. The permutation approach is robust to any small or large sample size without making distributional assumptions. GLMM-MiRKAT/aGLMM-MiRKAT can be implemented for either the random intercept model or the random slope model while cSKAT is only for the random intercept model. For the random intercept model, we permute both the whole exchangeable clusters and the measurements within each cluster. We can do so because the random intercept model assumes an exchangeable (a.k.a. compound symmetry) within-cluster correlation structure. Therefore, for the random intercept model, our permutation approach works in any study design with either balanced or unbalanced numbers of measurements per cluster. However, for random intercept model, we permute

only the whole exchangeable clusters. Therefore, for the random slope model, our permutation approach is limited to the balanced study design with a sufficient number of whole exchangeable clusters. In practice, the random intercept model has been more widely used for many prior tests (Min and Agresti, 2005; Schifano et al., 2012; Chen et al., 2013; Zhang et al., 2014; Chen and Li, 2016; Wang et al., 2017) because the random intercepts are usually sufficient to capture the within-cluster correlation structure in responses. The model selection procedures are beyond the scope of this study and we defer the details to popular longitudinal data analysis books.

Throughout this paper, we have surveyed the bacterial kingdom as the microbial community of interest because it is usually in our shared interest (bacteria make up most of the human microbiota). However, without loss of generality, the methods can be applied to any other microbial communities, such as the kingdom of yeasts, fungi or viruses, or the lower level microbial assemblages (e.g., phyla, classes) (Koh et al., 2017). We use OTUs as the sub-units consisting of the microbial community because they are often used as the surrogate microbial species. However, any other sub-units (e.g., phylum, species, genera) can be alternatively used by researchers' choice. We considered the ecological distance measures [i.e., Jaccard dissimilarity (Jaccard, 1912), Bray-Curtis dissimilarity (Bray and Curtis, 1957) or UniFrac distances (Lozupone and Knight, 2005; Lozupone et al., 2007; Chen et al., 2012)] due to their popularity in the microbiome research community. However, any other distance measures or kernel matrices can be alternatively used by researcher's choice. We also make no distinction between the 16S rRNA gene sequencing (Hamady and Knight, 2009; Caporaso et al., 2010) and the shotgun metagenomic sequencing (Thomas et al., 2012) for the use of our proposed methods.

## AUTHOR CONTRIBUTIONS

HK, NZ, and YL developed the method. HK performed the simulation experiments and real data analyses, and developed the software package. NZ, XZ, and JC contributed to simulations and real data analyses. HK and NZ wrote the manuscript. All authors read and approved the final manuscript.

### FUNDING

This study was supported in part by NIH for the Environmental Influences of Child Health Outcomes (ECHO) Data Analysis Center (U24OD023382) and Johns Hopkins University Center for AIDS Research (1P30AI094189).

#### REFERENCES


#### ACKNOWLEDGMENTS

The authors are grateful to the reviewers for their insightful observations and comments.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00458/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Koh, Li, Zhan, Chen and Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multi-Omic Analysis of the Microbiome and Metabolome in Healthy Subjects Reveals Microbiome-Dependent Relationships Between Diet and Metabolites

Zheng-Zheng Tang1,2† , Guanhua Chen<sup>1</sup>† , Qilin Hong<sup>3</sup> , Shi Huang<sup>4</sup> , Holly M. Smith<sup>5</sup> , Rachana D. Shah<sup>6</sup> , Matthew Scholz<sup>7</sup> and Jane F. Ferguson5,8 \*

#### Edited by:

Lingling An, The University of Arizona, United States

#### Reviewed by:

Zhigang Li, University of Florida, United States Alexander Alekseyenko, Medical University of South Carolina, United States

#### \*Correspondence:

Jane F. Ferguson jane.f.ferguson@vumc.org †These authors have contributed equally to this work

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

> Received: 29 January 2019 Accepted: 30 April 2019 Published: 17 May 2019

#### Citation:

Tang Z-Z, Chen G, Hong Q, Huang S, Smith HM, Shah RD, Scholz M and Ferguson JF (2019) Multi-Omic Analysis of the Microbiome and Metabolome in Healthy Subjects Reveals Microbiome-Dependent Relationships Between Diet and Metabolites. Front. Genet. 10:454. doi: 10.3389/fgene.2019.00454 <sup>1</sup> Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison, WI, United States, <sup>2</sup> Wisconsin Institute for Discovery, Madison, WI, United States, <sup>3</sup> Department of Statistics, University of Wisconsin–Madison, Madison, WI, United States, <sup>4</sup> Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States, <sup>5</sup> Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN, United States, <sup>6</sup> Division of Pediatric Endocrinology, Children's Hospital of Philadelphia, Philadelphia, PA, United States, <sup>7</sup> Vanderbilt Technologies for Advanced Genomics (VANTAGE), Vanderbilt University Medical Center, Nashville, TN, United States, <sup>8</sup> Vanderbilt Translational and Clinical Cardiovascular Research Center (VTRACC), Vanderbilt University Medical Center, Nashville, TN, United States

The human microbiome has been associated with health status, and risk of disease development. While the etiology of microbiome-mediated disease remains to be fully elucidated, one mechanism may be through microbial metabolism. Metabolites produced by commensal organisms, including in response to host diet, may affect host metabolic processes, with potentially protective or pathogenic consequences. We conducted multi-omic phenotyping of healthy subjects (N = 136), in order to investigate the interaction between diet, the microbiome, and the metabolome in a cross-sectional sample. We analyzed the nutrient composition of self-reported diet (3-day food records and food frequency questionnaires). We profiled the gut and oral microbiome (16S rRNA) from stool and saliva, and applied metabolomic profiling to plasma and stool samples in a subset of individuals (N = 75). We analyzed these multi-omic data to investigate the relationship between diet, the microbiome, and the gut and circulating metabolome. On a global level, we observed significant relationships, particularly between long-term diet, the gut microbiome and the metabolome. Intake of plant-derived nutrients as well as consumption of artificial sweeteners were associated with significant differences in circulating metabolites, particularly bile acids, which were dependent on gut enterotype, indicating that microbiome composition mediates the effect of diet on host physiology. Our analysis identifies dietary compounds and phytochemicals that may modulate bacterial abundance within the gut and interact with microbiome composition to alter host metabolism.

Keywords: microbiome, diet, metabolome, multi-omics analysis, mediation, interaction

# INTRODUCTION

fgene-10-00454 May 16, 2019 Time: 14:40 # 2

The human microbiome is a complex ecosystem of bacteria, viruses, fungi, and bacteriophages, which interact with each other and their host (Sears, 2005; Goodman and Gordon, 2010; Minot et al., 2011). Microbiome composition is unique to an individual, is established early in life, and plays a crucial role in lifelong health (Kau et al., 2011; Minot et al., 2011; Maynard et al., 2012; Koren et al., 2013; Mohammadkhah et al., 2018). Recent discoveries implicating the microbiome in disease have been paradigm-shifting. However, we do not yet understand the molecular mechanisms linking microbiota to health status.

There is considerable site-specificity in microbiome composition, with distinct populations residing within each body site of an individual (Faust et al., 2012; Ding and Schloss, 2014). The relative contributions of the microbiota at each body site to overall host health are not yet clearly defined, but are likely be depend on both the nature of the disease, and the overall health of the host (Zhang et al., 2015). The microbiome composition of the gut is of particular interest, given its location at the crucial interface between exogenous dietary intake and internal nutrient metabolism. Translocation of microbes and microbial metabolites from the intestine to the bloodstream may occur in the absence of intestinal disease, for example during diet-induced post-prandial metabolic endotoxemia (Moreira et al., 2012; Pendyala et al., 2012; Piya et al., 2013). The gut microbiome, in combination with habitual diet, is likely to play a major role in determining gut mucosal membrane permeability and influencing systemic inflammation (Moreira et al., 2012; Pendyala et al., 2012).

Numerous factors determine the specific population of microbiota in humans, with diet being a key contributor (Zeevi et al., 2015; Ferguson et al., 2016). Specific dietary components act as substrates for microbial metabolism, shaping microbiome composition and function. Multiple macronutrient-microbiome associations have been reported, including carbohydrate intake and Prevotella abundance (Wu et al., 2011), saturated fat intake and Bacteroides and Faecalibacterium prausnitzii, and animal protein intake and Bacteroides and Alistipes (De Filippo et al., 2010; Cotillard et al., 2013; David et al., 2014). Microbiome composition has been linked to disease through modulation of specific metabolites and signaling pathways (Wang et al., 2011; Koeth et al., 2013; Marcobal et al., 2013; Tang et al., 2013). Gut microbial metabolism of animal-product-derived carnitine to the pro-atherogenic metabolite trimethylamine N-Oxide (TMAO) has been found to associate with increased atherosclerotic risk (Wang et al., 2011; Koeth et al., 2013). Many other dietary components may modulate disease risk through parallel mechanisms.

We hypothesized that habitual diet is associated with microbiome composition in healthy humans, and that microbiome composition is associated with gut and plasma metabolites. Using multi-omic sample analysis in up to 150 healthy subjects we profiled the microbiome (16S rRNA; stool and saliva) and the metabolome (stool and plasma) to examine the interaction between diet, the microbiome, and systemic metabolism. Our results identify global relationships and highlight novel associations between specific dietary components and circulating metabolites, that are modulated by gut bacteria, and may have consequences on health status and future disease risk.

# MATERIALS AND METHODS

# Study Population

The ABO Glycoproteomics in Platelets and Endothelial Cells (ABO) Study recruited healthy volunteers (N = 150; men and non-pregnant/lactating women age 18–50) to a protocol at the University of Pennsylvania from 2012–2014. Exclusion criteria included known illnesses, history of organ transplant, tobacco, and prescription medication use (except oral contraceptives). Participants were instructed to avoid over-the-counter medications, supplements, and vitamins for the 2-week period prior to the scheduled visit. Subjects provided a fasting blood sample (following a 12-h overnight fast). As part of a diet and microbiome-focused sub-study, reported here, subjects provided a stool and saliva sample for microbiome analysis (N = 136 with stool samples). All subjects completed validated 3-day food records prior to the study visit (Trabulsi and Schoeller, 2001), including on the day directly before the visit, and a weekend day. Nutrient composition was analyzed using Food Processor 8.1 (ESHA Research, Salem, OR). In addition, all subjects completed food frequency questionnaires (FFQ) to assess habitual dietary intake, including serving size, of 134 food items over the previous year [the National Cancer Institute's Diet History Questionnaire (DHQ I)] (Subar et al., 2001, 2010). Completed subject responses were analyzed using Diet∗Calc version 1.5.1. Diet data were converted to nutrient intake values of 191 long-term dietary variables and 139 short-term dietary variables. All subjects provided written informed consent. The study was approved by the Institutional Review Boards of the University of Pennsylvania and Vanderbilt University.

# Sample Processing, DNA Extraction and Sequencing

Subjects collected a stool sample within the 24 h prior to the study visit, using a stool collection kit (Commode Specimen Collection System, Fisher Scientific, Pittsburgh, PA, United States) provided to them. Samples were stored at 4◦C and aliquots made within 36 h of sample collection. Processed samples were stored at −80◦C prior to nucleic acid extraction. Subjects were instructed to brush their teeth and floss if desired, but not to use mouthwash, following their final meal on the day before the visit (>12 h before visit). Subjects were further instructed not to brush their teeth or use floss or mouthwash on the morning of their visit. Saliva samples were collected using the OMNIGene Discover OM505 DNA/RNA collection kit (DNA Genotek). Following collection, samples were divided into aliquots, and stored at −80◦C prior to nucleic acid extraction. DNA was isolated from stool and saliva samples using the PSP Spin Stool DNA Plus Kit (Stratec, Germany). The 16S rRNA gene region was amplified using barcoded primers (Caporaso et al., 2012) (Eurofins Genomics, Louisville, KY, United States) and DNA libraries were cleaned

(MinElute PCR Purification kit, Qiagen, Germantown, MD, United States) prior to quantification and pooling. Pooled DNA libraries were sequenced on the MiSeq platform, 300 bp paired-end reads, at an average depth of 158,000 reads/sample (Illumina Inc., San Diego, CA, United States). Stool samples were sequenced in two batches, at the University of Pennsylvania Next-Generation Sequencing Center (UPenn NGSC, N = 107) and the Vanderbilt University Technologies for Advanced Genomics (VANTAGE) Core (N = 29). All saliva samples (N = 85) were sequenced in one batch at VANTAGE. DNA sequences in Fastq files were de-multiplexed, assembled, clustered, and phylogenetically classified using the Mothur pipeline (Schloss et al., 2009). Phylogenetic classification was performed against the Silva V123 16S database. Mothur was run using standard cutoffs, creating OTU clusters at 97% identity.

# Metabolomics

Samples for a subset of individuals (N = 75 plasma and N = 75 stool, matched subjects) were profiled at Metabolon (Metabolon Inc., Morrisville, NC, United States) using their global metabolomics platform, which can identify and quantitate >1,000 metabolites through multiple mass spectrometry methods. In our study, 812 metabolites were detected in plasma, and 770 in stool samples. For each metabolite, the raw peak intensity was rescaled to set the median across all samples equal to 1, and values below the limit of detection were imputed with the lowest observed value in the dataset. Metabolite pathway enrichment analysis was conducted using MetaboAnalyst (Xia and Wishart, 2011).

# Data Processing for Microbiome, Dietary and Metabolite Variables

Data processing and statistical analysis was performed in R. For the stool microbiome dataset, the OTUs were classified into 11 phyla, 20 classes, 21 orders, 32 families, and 130 genera. For the saliva microbiome dataset, the OTUs were classified into 13 phyla, 21 classes, 32 orders, 52 families, and 103 genera. We obtained two independent measures of dietary intake: 3-day food diaries (for short-term recent diet) and a food frequency questionnaire (FFQ, for long-term habitual diet). Dietary and metabolite variables were normalized using inverse normal transformation (INT) and transformed variables that did not follow a normal distribution (Shapiro–Wilk test p < 0.05) were removed (Maritz, 1995). These removed variables had very small variability and/or had many tied observations. The remaining dietary variables were further normalized using the residual method to adjust for total caloric intake and gender, and standardized to have mean of 0 and SD of 1. Since some dietary variables were almost identical, we chose one representative for each highly correlated cluster (Spearman correlation > 0.9), resulting in 91 long-term dietary variables and 82 short-term dietary variables in the final dataset for the downstream analysis. The complete list mapping dietary variables to the selected representative variables are available in **Supplementary Tables S1**, **S2**. In order to group metabolites that were highly correlated, we defined metabolic modules using weighted correlation network analysis WGCNA (Langfelder and Horvath, 2008). The WGCNA has been shown to be an efficient and robust method in grouping metabolomic data (McHardy et al., 2013) and allows us to summarize each module by its module eigenvalue. Using WGCNA, the gut metabolites were organized into 8 modules with 40 un-clustered metabolites, and plasma metabolites were organized into 16 modules with 169 un-clustered metabolites. The complete list of metabolites and their module organization are available in **Supplementary Tables S3**, **S4**. The abundance values of the unclustered metabolites were combined with standardized module eigenvalues in the downstream analysis.

# Distance Correlation Analysis

To evaluate the global association between pairs of high-dimensional variables among diet, microbiome and metabolomics, we used the distance correlation t-test (Székely and Rizzo, 2013) implemented in the R package "energy" to test the dependence among each pair of these three data types. Compared to Pearson correlation, the distance correlation (Székely et al., 2007; Székely and Rizzo, 2009) is a nonparametric approach (without distributional assumption) and has the power to detect general (non-linear) dependence between two sets of high- dimensional random variables. The distance correlation t-test allows the dimension of the random vectors to be larger than the sample size. The ability for detecting general dependence and handling highdimensionality of data makes distance correlation t-test suitable for analyzing this dataset.

# Microbial Enterotypes Analysis

We conducted distance-based clustering using the Partitioning Around Medoids (PAM) method (Kaufman and Rousseeuw, 1987) with the various distances including Euclidean, Bray– Curtis and Jaccard, and identified two enterotypes. To evaluate if diet-metabolite associations are modulated by microbial enterotype, we tested diet-enterotype interaction through linear regression for each pair of diet-metabolite variables, with the metabolite as the outcome, using the individual metabolites rather than metabolite modules.

# Sparse Linear Log-Contrast Model

To further narrow down the interplay between diet/metabolome and microbiome, we used the sparse linear log-contrast model (Lin et al., 2014) to pinpoint important genera that are associated with dietary or metabolite variables. In this model, a dietary or metabolite variable is the response and the top 50 most abundant genera are compositional covariates. For the diet-microbiome analysis, it makes intuitive sense to analyze microbiome variables as the dependent variables since we hypothesize that diet perturbs microbial compositions. Nevertheless, we selected the log-contrast model for several reasons. It is very challenging to find a suitable probabilistic distribution for the microbial composition due to its unique features, such as zero-inflation, over-dispersion, and complex correlation structure (Li, 2015; Tang and Chen, 2018). Further, it has been demonstrated in genetic association studies that such inverse regression (treating

dependent variables as covariates) is advantageous if there are multiple dependent variables and the distribution is difficult to specify (Majumdar et al., 2016). Alternative methods that treat microbiome as dependent variables include sparse Dirichlet-Multinomial (DM) method (Chen and Li, 2013) and multivariate zero-inflated logistic-normal method (Li et al., 2018), however, we determined that the log-contrast model was the most suitable currently available model for our study. For the taxa that are unclassified at the genus level, their identities at higher levels were used. Because of the unit-sum constraint of the microbial relative abundance, the components of a composition cannot vary freely. The sparse linear log-contrast model respects the compositional nature of the microbiome data, in which the unit-sum constraint on the compositional vector is translated into the zero-sum constraint on the association coefficients across taxa in log-ratio scale (Lin et al., 2014). The zero-sum constraint is crucial for the resulting estimator to enjoy interpretive advantages over a standard lasso estimator (Tibshirani, 1994). In our analysis, we used 10-fold cross validation to choose the tuning parameter. To obtain stable selection results, we generated 100 bootstrap samples and used the same cross-validation procedure to select the genera. The genera that were selected over 70 times out of 100 were considered associated with the dietary or metabolite variable.

## Microbiome Mediation Analysis

We considered how the effect of a dietary nutrient on a metabolite is transmitted through the microbial communities. Specifically, we were interested in identifying microbial taxa that mediate the diet-metabolite pathway. We focused on pairs of diet-metabolite variables linked to at least one common genus identified by the log-contrast model in section 2.7, and applied mediation analysis to the diet-gut microbiomemetabolite triplet. The top 50 most abundant genera were used as candidate microbiome mediators. To handle the compositional and high-dimensional nature of microbiome mediators, we utilized the state-of-the-art compositional mediation analysis for microbiome data (R Package ccmm) (Sohn and Li, 2019). Certain assumptions are required to make casual interpretation of the mediation effects (Imai et al., 2010; Sohn and Li, 2019). In particular, the key assumption assumes that there is no unmeasured confounding variable after controlling covariates. The method enables us to estimate the total mediation effects of microbiome composition, as well as to select important microbial taxa mediating the diet-metabolite association and estimate taxon-specific mediation effects.

## RESULTS

We conducted multi-omic phenotyping of up to 150 healthy subjects to probe diet, microbiome, and metabolome relationships in a cross-sectional sample. The overall study design, sample availability and subject characteristics are shown in **Figure 1**. By design, participants were healthy with no overt disease, consuming diets broadly representative of a standard American diet. Dietary variables calculated from the short and long-term diet questionnaires were significantly correlated with each other, suggesting that subjects' diets immediately prior to microbiome sampling were broadly representative of their diets over the past year. Of 150 enrolled subjects who completed a dietary questionnaire, 136 subjects provided a stool sample for microbiome analysis. We conducted metabolomic profiling in matched stool and plasma samples in a subset of these individuals (N = 75) and collected saliva samples for microbiome analysis in a separate subset (N = 85). No global associations were detected between diet, the microbiome, or metabolome, and demographic variables (age, sex, race, and BMI; PERMANOVA p > 0.1). We observed a difference in gut microbiome composition by batch (p = 0.04, UPenn vs. VANTAGE, see section "Sample Processing, DNA Extraction and Sequencing"). There were no differences in metabolite or nutrient profiles between the batches (p > 0.1), or in enterotype distribution (chi-square test p = 0.86). To assess whether the batch effect had any effect on our results, we repeated all the relevant analyses using only batch 1 samples (N = 107) and confirmed the conclusions remained the same. As the overall results did not differ, we report here the results from the analyses of the entire sample.

# The Gut Microbiome Is Related to Diet and Metabolites on a Global Level

We ran a global analysis using distance correlation t-test to obtain an integrated view of the relationships and relative importance of dietary measures (short-term and long-term diet), microbiome body site samples (stool and saliva), and metabolites (stool and plasma). As shown in **Figure 2**, there were considerable inter-relationships, with particularly strong associations between the gut microbiome and the gut metabolome (p = 2.2 × 10−10), and between long-term diet and the gut microbiome (p = 7.8 × 10−<sup>4</sup> ). Short-term diet was significantly associated with the gut and plasma metabolome (p < 1 × 10−<sup>3</sup> ), but not the microbiome. We found no global associations between the saliva-derived oral microbiome and other data types. Within data types, there was very strong global correlation between short- and long-term diet (p < 1 × 10−15), and between stool and plasma metabolites (p = 2.1 × 10−<sup>8</sup> ), but not between the gut and oral microbiome (p = 0.7). Based on the evidence in the global analysis, we decided to focus our remaining analyses on the gut microbiome and long-term diet, and to evaluate their interplay with gut and circulating metabolites.

# Dietary Nutrients Are Associated With Gut Microbes

We hypothesized that gut microbiome composition would vary based on the intake of specific nutrients. From the sparse log-contrast model, we identified 61 (67%) long-term dietary nutrients associated with at least one bacterial genus (**Figure 3**). Several nutrients associated with three or more genera, as shown in **Table 1**. These dietary nutrients were

predominately found in plant-derived foods and dairy products, suggesting that inclusion or exclusion of these food groups in the diet may be particularly important in the modulation of gut microbiome composition.

# Circulating and Gut Metabolites Are Associated With Gut Microbes

We hypothesized that gut microbiome composition would associate with specific metabolites in the gut and circulation, reflecting taxon-specific metabolism. We identified 123 (66%) circulating metabolite variables and modules and 34 (71%) gut metabolite variables and modules that associated with at least one bacterial genus (**Figures 4**, **5**). Several metabolites were associated with multiple genera, as shown in **Table 2**. Of these highly bacterial-related metabolites, many have known functions in bile acid metabolism, lipid and amino acid metabolism, or metabolism of xenobiotics, highlighting the important role of microbes in modulating host metabolism in key pathways.

# Gut Bacterial Taxa Mediate the Association Between Dietary Nutrients and Metabolites

We were interested in whether gut bacterial taxa mediate the relationship between diet and metabolites. Mediation analysis revealed multiple taxa influencing the association between dietary intake and metabolites in plasma or stool. Given the interrelationships between metabolic variables, we were interested in which pathways were most affected by microbiome mediation. We identified metabolic pathways with evidence for strong dietmicrobiome effects, defined as having 3 or more metabolites in a sub-pathway with significant diet associations mediated by the microbiome, or association with a metabolite module (**Table 3**). These included amino acid metabolism (histidine, phenylalanine, and tyrosine), lipid metabolism (fatty acids, bile acids, and

steroids), and xenobiotics (benzoate, and food components). Of the dietary variables, plant-derived nutrients (vitamins and phytochemicals) and metals were strongly represented. Our data suggest that metabolic flux through these pathways is particularly susceptible to interaction between dietary intake and microbiome composition.

# Differences in Abundance of Metabolites by Gut Microbial Enterotype

We identified two gut microbiome enterotypes in our sample, with good separation of the sub-groups by Principal Coordinates Analysis (PCoA) using the Jaccard distance (see **Supplementary Figure S1**). There were 54 individuals categorized as Enterotype 1, and 82 individuals categorized as Enterotype 2. There was no difference in age or race distribution across enterotypes, or in sequencing batch, although there was a trend toward a higher proportion of women in enterotype 2 (52% vs. 69% female, chi-square test p = 0.054). Individuals in enterotype 2 had lower BMI (26.9 vs. 24.5, p = 0.01). The primary differentiating

dietary variables and taxa.

TABLE 1 | Long term intake of dietary nutrients associated with at least three gut microbial taxa.


<sup>∗</sup>Bacterium reported as genus, unless otherwise specified (f, family; c, class; o, order; p, phylum; k, kingdom).

characteristic between the two gut enterotypes was in the abundance of family Ruminococcaceae, with significantly higher proportion of Ruminococcaceae in enterotype 2 (**Supplementary Figure S2**). Analysis of metabolites by enterotype revealed striking differences between the groups: 112 plasma metabolites and 122 stool metabolites were significantly different by enterotype (unadjusted p < 0.05, **Supplementary Tables S5**, **S6**). Unadjusted p-values are reported in the enterotype analysis because the analysis used individual metabolites rather than metabolite modules and many metabolites are highly correlated. While the enterotype-associated metabolites spanned many biological pathways, they were enriched in certain categories. We selected all nominally associated metabolites for pathway enrichment analysis. Plasma metabolites that differed by enterotype were significantly enriched for amino acid metabolism (p < 0.05), particularly the essential amino acids phenylalanine, tryptophan, and tyrosine, the essential branched-chain amino acids valine, leucine and isoleucine, as well as arginine and proline. Stool metabolites differing by enterotype were enriched in taurine and niacin (vitamin B3) metabolism (p < 0.05). Individuals in Enterotype 1 had slightly higher alcohol and cholesterol consumption than Enterotype 2 (p < 0.05), but there were otherwise limited differences in dietary intake by enterotype, suggesting that the metabolite differences were not solely attributable to differences in diet.

## Gut Microbial Enterotype Modulates the Relationship Between Diet and Metabolites

As observed in the mediation analysis for individual taxa, microbiome composition mediates the association between dietary nutrients and metabolites. We hypothesized that gut enterotype, as a composite measure of microbiome differences, would modify the relationship between dietary nutrient intake and downstream metabolism. We found evidence for significant interaction between habitual dietary intake and gut enterotype on plasma and stool metabolites across many classes of nutrients and metabolites. Of diet-metabolite pairs that were enterotypedependent, the most frequent dietary components, which associated with >100 metabolites each, included plant-derived nutrients (fiber, carotenoids, and isoflavones) and artificial sweeteners (saccharin, mannitol, aspartame, and xylitol), as well as animal protein, trans fatty acids, caffeine, and alcohol. The diet- and enterotype-dependent metabolites spanned many

metabolites and taxa.

pathways, but the metabolites with the most frequent associations with dietary variables (>30 dietary variables each) were predominately bile acids and xenobiotic metabolites in plasma, and xenobiotic and amino acid metabolites in stool.

Given the importance of bile acids in both gut metabolism and cardiometabolic disease risk, we were particularly interested in the observed microbiome-mediated effects of diet on bile acid signaling. As shown in **Figure 6**, habitual intake of dietary fiber was associated with higher plasma ursodeoxycholate in individuals with enterotype 1, but there was no relationship between diet and ursodeoxycholate in enterotype 2. Conversely, high dietary fiber was associated with decreased plasma taurodeoxycholate in individuals with enterotype 1, and slightly increased levels in enterotype 2. Many of the circulating bile acids were highly correlated with each other, and as such the results for taurodeoxycholate represent similar significant associations for dietary fiber with taurocholate, taurolithocholate 3 sulfate, glycolithocholate, glycolithocholate sulfate, taurochenodeoxychlate, glycodeoxycholate, glycocholate, and glycodeoxycholate sulfate, (Spearman correlation > 0.5 for metabolite pair, and p < 0.05 for enterotype-mediated association with diet). Of note, dietary choline was highly correlated with dietary fiber (Spearman correlation 0.7), reflecting some overlapping food sources and dietary patterns, and similar patterns of association with bile acids were also observed for choline. Interestingly, there was a modest positive relationship between plasma ursodeoxycholate (p < 0.05), but not plasma taurodeoxycholate, and plasma C-Reactive Protein (CRP) and BMI in individuals with enterotype 2, but not in enterotype 1 (**Figure 7**). These data suggest that individuals with enterotype 1 have bile acid metabolism that is highly diet-responsive, whereas individuals with enterotype 2 have bile acid production which is less sensitive to differences in dietary intake, but may be more likely to relate to poor metabolic health.

## DISCUSSION

The gut microbiome is recognized as a key intermediate between environmental inputs and host metabolism, however, the specific

TABLE 2 | Plasma and stool metabolites associated with three or more gut microbial taxa.


#### TABLE 2 | Continued

fgene-10-00454 May 16, 2019 Time: 14:40 # 11


<sup>∗</sup>Bacterium reported as genus, unless otherwise specified (f, family; c, class; o, order; p, phylum; k, kingdom).

relationship between dietary nutrients, microbiome composition, and host metabolism remains poorly understood. We conducted multi-omic profiling to probe the relationship between diet, the microbiome, and metabolism in healthy adults. We identified associations between diet, the gut microbiome and the gut and plasma metabolome at a global level and identified specific microbiome-mediated associations between diet and metabolites. Our data suggest that gut microbiome composition, both at the taxon and the enterotype level, modulates how dietary nutrients are metabolized, impacting systemic host metabolism with potential downstream consequences on metabolic health.

Diet, the microbiome, and the metabolome are complex, composed of multiple inter-dependent variables, which have independent and combinatorial effects. We first examined these multi-omic datasets on a global level, to understand the interrelationships on a broad scale. Consistent with our hypothesis, diet, the gut microbiome, and the metabolome were all related to each other. We found minimal evidence of an association between the gut and oral microbiota in the same individuals, which is consistent with previous studies, which have also reported limited overlap between different body sites (Caporaso et al., 2011; Ding and Schloss, 2014). The salivary microbiome in our sample was also not strongly related to diet, or to metabolites. This may reflect both the smaller sample size for the oral microbiome, and distal relationships between the mouth and intestinal or whole-body metabolism.

We assessed subjects' diet using two independent methods, to identify the nutrients consumed shortly before microbiome sampling, and to identify habitual long-term food consumption. There was relatively high correlation between analogous dietary variables from short and long-term estimates within subjects, suggesting that participants' diets at the time of sampling were consistent with their longer-term dietary patterns. We were interested in the relative importance of day-to-day fluctuations in dietary intake compared with longer-term patterns. We found that long-term diet as assessed by FFQ was more strongly associated with the gut microbiome than the diet consumed immediately prior to sampling (generally the 3 days prior to stool elimination). This suggests a core gut microbial population, shaped by habitual diet, that remains relatively constant despite short-term dietary fluctuations. This is supported by findings from others, who have observed relative stability in gut microbiome profiles over time, particularly in adults (Yatsunenko et al., 2012; Ding and Schloss, 2014; Dubois et al., 2017; Ruggles et al., 2018). Although large shifts in diet acutely alter microbiome composition (David et al., 2014), dietary habits over time appear to be more influential in shaping the gut microbial community. Short-term diet was more strongly associated with the gut and plasma metabolome than long-term diet, independent of the microbiome. This is consistent with a model where recently-consumed nutrients are rapidly metabolized by the host, influencing what is present in the gut and circulation at any given time. However, whether these short-term dynamic changes impact longer-term health outcomes is unknown. It is likely that repeated exposures to diet and microbiome derived metabolites over longer time frames have greater impact on lifelong health status.

Of dietary variables associated with microbiome composition and exhibiting microbiome-mediated relationships with metabolites in our sample, a large proportion are derived from plant-based foods. This is consistent with our knowledge of microbiome-mediated digestion. Plants are complex food sources, and contain many diverse nutrients, some of which are already known to interact with the microbiome. Fiber is metabolized by bacteria for production of short-chain fatty acids, which not only provide energy and selective advantages to microbes, but can affect host metabolism and immunity (Furusawa et al., 2013; Vital et al., 2014; Koh et al., 2016; Maier et al., 2017). Individuals consuming diets high in plant-derived fiber have greater microbiome diversity (Schnorr et al., 2014), while diets low in fiber lead to reduced bacterial diversity (Sonnenburg et al., 2016). Many phytochemicals are selectively metabolized by gut microbiota including isoflavones (Rowland et al., 2000; Fernandez-Raudales et al., 2012), while plants are rich sources of many vitamins, including those with known microbial interaction such as Vitamin B3/Niacin (Singh et al., 2014). Symbiotic relationships between the host and the microbiome, and optimal functioning of the holobiont, are dependent on environment, with diet being the archetypal environmental variable (Postler and Ghosh, 2017). In addition to plant foods, which have long been consumed by humans, we observed inter-relationships with artificial sweeteners, which have entered the human diet in relatively recent time. Our data do not resolve whether these have positive or negative consequences on health, but indicate that shifts toward higher consumption of processed foods and lower consumption of complex plant-based foods, common to the Western diet, have potential consequences on the gut microbiota and metabolite production.

We identified many metabolites in plasma and stool that differed by microbiome composition; indeed the majority of


TABLE 3


 mediated metabolites. (Continued)

fgene-10-00454 May 16, 2019 Time: 14:40 # 12


(Continued)

TABLE 3


Continued

**44**



metabolites appeared to be influenced by diet, the microbiome, or both. These spanned many biological pathways, but metabolites that were particularly microbiome-sensitive were pathways related to bile acid metabolism, amino acid metabolism, lipid and steroid metabolism, and metabolism of xenobiotics. While direct effects on diet or microbe-derived metabolites (e.g., xenobiotics) are to be expected, our data highlight that the microbiome also modulates key host metabolic pathways of importance not only for energy metabolism, but overall host health status including immune function. The consequences of alterations in these circulating metabolites are not fully known. Microbiome metabolites have been shown to affect inflammation and immune regulation (Levy et al., 2017; Haase et al., 2018), and we observed some association between enterotype-mediated metabolism and plasma CRP. However, further studies are needed to establish consequences of chronic alterations in metabolite signaling.

Because different bacteria can have overlapping functionality, it can be helpful to collapse the taxonomic composition into related clusters, or enterotypes, to identify individuals within subgroups of similar composition. We observed many enterotypemediated associations, amongst them, a significant effect of gut enterotype on the relationship between dietary fiber and plasma bile acids. Bile acids are key regulators of hepatic and intestinal lipid metabolism, and have been linked to inflammation and metabolic disease (Joyce and Gahan, 2016; Chávez-Talavera et al., 2017). Microbiota contribute to bile acid metabolism, transforming host-synthesized primary bile acids to secondary bile acids, while microbiome composition may itself be shaped by bile acids (Wahlström et al., 2016; Long et al., 2017). While we present data for dietary fiber, very similar results were found for dietary choline, with fiber and choline intake strong correlated in our sample. Thus, it is not clear whether the effect is specific to fiber, choline, or another phytonutrient common to the same food source. Both fiber and choline can act as substrates or inhibitors of bile acid metabolism (Corbin and Zeisel, 2012; Dziedzic et al., 2015; Wang et al., 2017), and both have been linked to microbial metabolism (LeBlanc et al., 1998; Tang et al., 2013; Mayengbam et al., 2018; Tuncil et al., 2018), suggesting that either or both plausibly lie in a causal pathway linking diet to bile acid metabolism through microbiota.

Our study had several key strengths, but also some limitations. We recruited healthy adults, and conducted deep multi-omic phenotyping, with the goal of identifying relationships between diet, the microbiome and the metabolome independent of a disease background. While this allowed for metabolic analysis independent of disease confounding or reverse causation, it did not allow us to directly assess relationships with cardiometabolic disease. However, at least half of the participants in our study are likely to develop cardiometabolic disease in later life (Benjamin et al., 2018), suggesting that even mildly elevated risk factors may predict future disease. We measured plasma CRP, as a clinically-relevant marker of inflammation, which predicts future disease risk (Ridker, 2003), and used BMI as a proxy for obesity and future metabolic risk (Van Gaal et al., 2006). Despite our modest sample size, this is one of the

FIGURE 6 | Dietary Fiber has a gut enterotype-dependent association with plasma secondary bile acids including ursodeoxycholate and taurodeoxycholate.

largest studies of diet, the microbiome, and the metabolome conducted in humans. A pervasive limitation in nutritional studies is the difficulty in precise quantification of dietary intake in free-living humans. We used two independent validated dietary assessment methods, which were broadly consistent with each other, while allowing us to assess diet over different time frames. Because food is complex, and individual nutrients often co-occur in the same foods, in many cases we can not determine which food component is "causal" in a dietmicrobiome-metabolite relationship. Future detailed studies to isolate individual nutrients will be required, while recognizing that nutrients exist within a complex food structure, and that an isolated nutrient (e.g., in a single supplement) may not behave the same way as a nutrient derived in conjunction with other nutrients in a food source. An important limitation of our study is the use of a single time point for data collection. While we were able to identify diet-microbiome-metabolite associations in our cross-sectional analysis, we are unable to infer causality. Future interventional studies with longitudinal sampling are required to assess relationships over time, and to determine whether changes in diet associate with microbiome-mediated changes in metabolism.

### CONCLUSION

Through multi-omic analysis in a deeply-phenotyped human sample, we identified microbiome-mediated relationships between diet and circulating metabolites. Both individual microbial taxa, and microbial enterotype may relate to how dietary precursors are metabolized within the gut, and in the circulation. The potential mechanisms involved, and any long-term consequences on health status remain to be determined.

#### ETHICS STATEMENT

fgene-10-00454 May 16, 2019 Time: 14:40 # 16

This study was carried out in accordance with the recommendations of the University of Pennsylvania's clinical research standards that meet regulations relating to Good Clinical Practice (GCP). All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the Institutional Review Boards of the University of Pennsylvania and Vanderbilt University.

## AUTHOR CONTRIBUTIONS

JF designed the study. HS and JF performed laboratory analysis. Z-ZT, GC, QH, SH, MS, and JF performed statistical analysis. Z-ZT, GC, RS, and JF contributed to writing the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

#### REFERENCES


## FUNDING

The ABO Study was supported by U01-HL108636 and K24- HL10763 (PI: Reilly). The project was supported by an AHA Scientist Development Grant (15SDG24890015, PI: JF) and a P&F Award to JF from the Vanderbilt University Medical Center's Digestive Disease Research Center supported by NIH grant P30DK058404. The project was also supported by the Data Science Initiative Award (PI: Z-ZT) provided by the University of Wisconsin–Madison Office of the Chancellor and the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00454/full#supplementary-material


microbial community structures in human microbiome datasets. PLoS Comput. Biol. 9:e1002863. doi: 10.1371/journal.pcbi.1002863



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer AA declared a past co-authorship with several of the authors Z-ZT and GC.

Copyright © 2019 Tang, Chen, Hong, Huang, Smith, Shah, Scholz and Ferguson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Microbial Networks in SPRING - Semi-parametric Rank-Based Correlation and Partial Correlation Estimation for Quantitative Microbiome Data

#### Grace Yoon<sup>1</sup> , Irina Gaynanova<sup>1</sup> and Christian L. Müller <sup>2</sup> \*

<sup>1</sup> Department of Statistics, Texas A&M University, College Station, TX, United States, <sup>2</sup> Center for Computational Mathematics, Flatiron Institute, New York, NY, United States

#### Edited by:

Lingling An, University of Arizona, United States

#### Reviewed by:

Yuan Jiang, Oregon State University, United States Michelle Lacey, Tulane University, United States

> \*Correspondence: Christian L. Müller cmueller@flatironinstitute.org

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

> Received: 18 January 2019 Accepted: 13 May 2019 Published: 06 June 2019

#### Citation:

Yoon G, Gaynanova I and Müller CL (2019) Microbial Networks in SPRING - Semi-parametric Rank-Based Correlation and Partial Correlation Estimation for Quantitative Microbiome Data. Front. Genet. 10:516. doi: 10.3389/fgene.2019.00516 High-throughput microbial sequencing techniques, such as targeted amplicon-based and metagenomic profiling, provide low-cost genomic survey data of microbial communities in their natural environment, ranging from marine ecosystems to host-associated habitats. While standard microbiome profiling data can provide sparse relative abundances of operational taxonomic units or genes, recent advances in experimental protocols give a more quantitative picture of microbial communities by pairing sequencing-based techniques with orthogonal measurements of microbial cell counts from the same sample. These tandem measurements provide absolute microbial count data albeit with a large excess of zeros due to limited sequencing depth. In this contribution we consider the fundamental statistical problem of estimating correlations and partial correlations from such quantitative microbiome data. To this end, we propose a semi-parametric rank-based approach to correlation estimation that can naturally deal with the excess zeros in the data. Combining this estimator with sparse graphical modeling techniques leads to the Semi-Parametric Rank-based approach for INference in Graphical model (SPRING). SPRING enables inference of statistical microbial association networks from quantitative microbiome data which can serve as high-level statistical summary of the underlying microbial ecosystem and can provide testable hypotheses for functional species-species interactions. Due to the absence of verified microbial associations we also introduce a novel quantitative microbiome data generation mechanism which mimics empirical marginal distributions of measured count data while simultaneously allowing user-specified dependencies among the variables. SPRING shows superior network recovery performance on a wide range of realistic benchmark problems with varying network topologies and is robust to misspecifications of the total cell count estimate. To highlight SPRING's broad applicability we infer taxon-taxon associations from the American Gut Project data and genus-genus associations from a

**50**

recent quantitative gut microbiome dataset. We believe that, as quantitative microbiome profiling data will become increasingly available, the semi-parametric estimators for correlation and partial correlation estimation introduced here provide an important tool for reliable statistical analysis of quantitative microbiome data.

Keywords: absolute abundance, amplicon sequencing, association network, copula, graphical model, gut microbiome, zero inflation

# 1. INTRODUCTION

High-throughput sequencing techniques, including targeted amplicon-based sequencing (TAS) and metagenomic profiling, provide large-scale genomic survey data of microbial communities in their natural habitats. Collaborative efforts, such as the Human Microbiome Project (HMP) (Huttenhower et al., 2012), the Earth Microbiome Project (EMP) (Bahram et al., 2018), the TARA Ocean project (Sunagawa et al., 2015), and the American Gut Project (AGP) (McDonald et al., 2018) give an increasingly detailed picture of relative abundances of operational taxonomic units, their phylogenetic relationships, and gene abundances across diverse ecosystems, ranging from marine, soil, and fresh-water to human-associated habitats albeit at different scales and resolutions. Following the seminal work in Woese and Fox (1977), TAS protocols extract and amplify specific regions in marker genes, such as the 16S rRNA gene for bacteria and archea, the 18S rRNA gene for eukaryotes, and Internal Transcribed Spacer (ITS) regions for fungi, via universal primers followed by next-generation sequencing. These profiling efforts, together with elaborate bioinformatics processing and normalization work flows (Schloss et al., 2009; Caporaso et al., 2010; Edgar, 2013; Callahan et al., 2016; Lagkouvardos et al., 2017) allow low-cost determination of highly sparse relative counts of hundreds to thousands of operational taxonomic units (OTUs) or amplicon sequence variants (ASVs) (Edgar, 2016; Callahan et al., 2017) per sample across a large number of sample sites or participants. Metagenomic profiling (Handelsman, 2004) on the other hand provide unbiased samples of the majority of genes of the sampled habitat by high-throughput shotgun sequencing. Sophisticated reference-guided as well as referencefree metagenomic read assembly, binning, and taxonomic profiling pipelines (Alneberg et al., 2014; Sczyrba et al., 2017; Sedlar et al., 2017) can, under suitable conditions on read coverage, disentangle the complex mixture of sequencing reads into entire genomes of the underlying microbes and estimate, as a high-level by-product, relative microbial abundances.

Microbiome community-level analysis tasks, such as quantifying community composition shifts across conditions or associating high-dimensional species compositions and their taxonomic profiles to each other and to environmental or host-associated covariates, require statistical estimation procedures that can handle the restrictive nature of such sparse proportional (or compositional) microbiome datasets (Li, 2015). Important examples include differential abundance techniques (McMurdie and Holmes, 2014; Mandal et al., 2015), proportionality estimation (Quinn et al., 2017), regression models with compositional covariates (Holmes et al., 2012; Lin et al., 2014), composition-adjusted correlation estimation techniques (Friedman and Alm, 2012; Cao et al., 2018), and sparse graphical models for microbial association networks (Kurtz et al., 2015; Tipton et al., 2018).

Recent advancements in microbiome profiling protocols, however, promise to alleviate the experimental shortcomings of standard TAS or metagenomic experiments by enabling a more quantitative picture of microbial communities. The experimental protocols in Gifford et al. (2011) and Satinsky et al. (2013), originally introduced for marine microbiome profiling, establish quantitative count measurements of environmental metatranscriptomic or metagenomic data by adding orthogonal internal genomic mRNA or DNA standards (of known quantity) to the environmental sample prior to sequencing. A similar spike-in approach has been proposed for gut microbiome studies in Stämmler et al. (2016). Recent quantitative approaches combine TAS techniques with robust measurements of microbial cell counts, in particular flow cytometry (Props et al., 2017; Vandeputte et al., 2017). These tandem measurements provide absolute microbial count data albeit with a large number of zero measurements due to limited sequencing depth (see **Figure 2** for an overview). Thus far, however, statistical analysis methods for these novel quantitative microbiome data remain largely elusive.

In this contribution, we consider the statistical problem of correlation and partial correlation estimation for sparse quantitative microbiome count data. To this end, we first revisit a novel semi-parametric rank-based (SPR) approach to correlation estimation that can naturally deal with the large number of zeros in the data. The SPR estimator is easy to compute and can readily replace the naïve Pearson or rank-based sample correlation estimator which are often used as a first step in downstream statistical analysis tasks, including principal component analysis, principle coordinate analysis, discriminant analysis, or canonical correlation analysis (Yoon et al., 2018). Here we use the semi-parametric rank-based estimator as a starting point for sparse partial correlation estimation and introduce the Semi-Parametric Rank-based approach for INference in Graphical model (SPRING). SPRING follows the neighborhood selection methodology outlined in Meinshausen and Bühlmann (2006) to infer the conditional dependency graph and uses stabilitybased model selection (Liu et al., 2010; Müller et al., 2016) to identify a sparse set of stable partial correlation estimates from quantitative microbiome data (section 2). These partial correlations can be interpreted as direct (i.e., conditionally independent) statistical microbe-microbe associations and can serve as an initial community-level description of the underlying

**51**

microbial ecosystem (Fuhrman et al., 2015; Sunagawa et al., 2015; Ruiz et al., 2017).

To evaluate our new methodology, we introduce a data generation mechanism that produces synthetic amplicon samples which exactly follow the empirical marginal cumulative distributions of measured amplicon count data while simultaneously obeying user-specified (partial) correlation dependencies among the variables and closely following userdefined total cell counts (see **Figure 2** for a summary). As ground-truth data for microbial associations remain largely elusive in current literature, our data generation mechanism might be of independent interest for testing other statistical inference schemes. We highlight SPRING's superior performance compared to standard sparse partial correlation estimation methods on a wide range of quantitative microbiome benchmark problems with varying prescribed network topologies. We also quantify, in the context of association network inference, the potential gains of quantitative over purely relative data even under misspecified totals. To showcase SPRING's broad applicability (see section 4), we first infer taxon-taxon associations from relative abundance data collected in the AGP using a pseudo-count-free log-ratio transform that can handle zero counts. Our key application is a genus-level analysis of the quantitative gut microbiome dataset put forward in Vandeputte et al. (2017). We discuss the inferred quantitative association network structure, compare it to published results, and assess, for the first time, the differences between inferred associations from measured absolute and relative abundance data in a consistent statistical framework. While we focus here on TAS-related applications, our methodology is broadly applicable to other data types with excess zeros, including quantitative metagenomics, single-cell RNA-seq, and mass spectrometry data, and thus provides a promising route toward a coherent statistical framework for correlation and partial correlation analysis of multi-omics biological data.

# 2. SEMI-PARAMETRIC RANK-BASED CORRELATION AND PARTIAL CORRELATION ESTIMATION

## 2.1. Rank-Based Estimation of Correlation Matrix for Zero-Inflated Data

A great number of multivariate statistical methods, such as principal component analysis, discriminant analysis, canonical and partial correlation analysis, to name a few, require the estimate of a covariance or correlation matrix of variables as one of the inputs. The overwhelming number of methods are based on the Pearson sample covariance matrix, which works well at capturing dependencies between variables that are normally distributed. One of the key challenges in analyzing TAS-based microbial abundance data is that it is far from normal: TASbased measurements are inherently proportional, extremely right skewed, overdispersed, and comprise a large number of zero values. Furthermore, the zeros are not always indicative of the absence of the species, but rather a result of limited sequencing depth or primer bias. For these reasons, the sample covariance matrix is not appropriate for capturing dependencies present in microbiome data. Several methods use techniques from compositional data analysis (Aitchison, 1983), including logratio transforms, to adjust the data prior to any estimation, and enforce different structural constraints on the correlation or inverse correlation matrix (Friedman and Alm, 2012; Kurtz et al., 2015; Cao et al., 2018). The problem of excess zeros is typically dealt with by adding a small pseudo-count or, more recently, estimating pseudo-counts from multiple samples (Cao et al., 2017). For quantitative microbiome data, however, correlation and inverse correlation estimators are not yet available. In this work we propose to take a different approach relying on the recently proposed truncated Gaussian copula framework (Yoon et al., 2018).

First, we review the Gaussian copula model, which is sometimes referred to as non-paranormal (NPN) model (Liu et al., 2009).

**Definition 1.** A random vector **x** = (x1, . . . , xp) <sup>⊤</sup> satisfies the Gaussian copula model if there exists a set of monotonically increasing transformations f = (fj) p j=1 satisfying f(**x**) = {f1(x1), . . . , fp(xp)} <sup>⊤</sup> ∼ N(**0**, 6) with σjj = 1. We denote **x** ∼ NPN(**0**, 6, f).

The Gaussian copula model is commonly used in undirected graphical models (Liu et al., 2012; Fan et al., 2017) because it models the dependency between variables through the correlation matrix 6, and thus enjoys the mathematical simplicity of Gaussian multivariate distribution while relaxing the normality assumption. While the original model is only appropriate for modeling continuous variables, it has also been generalized to binary variables by adding an extra dichotomization step (Fan et al., 2017). The estimation of graphical models only requires the knowledge of the correlation matrix 6, and it has been shown (Fan et al., 2017) that consistent estimates of 6 could be easily obtained from sample Kendall's τ without the need to estimate unknown transformations f<sup>j</sup> .

The Gaussian copula model is, however, not appropriate for quantitative microbiome data as (i) it does not take into account zero inflation, and (ii) it models continuous rather than count variables. To address (i), we take advantage of the model proposed in Yoon et al. (2018).

**Definition 2** (Truncated Gaussian copula model of Yoon et al. (2018))**.** A random vector **x** = (x1, . . . , xp) <sup>⊤</sup> satisfies the truncated Gaussian copula model if there exists a p-dimensional random vector **u** = (u1, . . . , up) <sup>⊤</sup> ∼ NPN(**0**, 6, f) such that

$$\mathbf{x}\_{\circ} = I(u\_{\circ} \succ c\_{\circ})u\_{\circ} \quad (\circ = 1, \ldots, p),$$

where I(·) is the indicator function and **c** = (c1, . . . ,cp) is a vector of positive constants.

In other words, the model truncates a Gaussian copula variable so it is either zero or positive continuous. This model does not take into account that quantitative microbiome data have zeros or positive counts, but we found the continuous approximation to positive counts to work well in our simulation results (section 3).

To construct graphical models for the truncated Gaussian copula model, the estimation of the latent correlation matrix 6 is required. Yoon et al. (2018) develop a rank-based estimator for 6 by deriving the explicit form of the so-called bridge function F that connects the sample Kendall's τ estimates to the elements of 6. Given observed data (xj1, xk<sup>1</sup> ), . . . , (xjn, xkn) for variables j and k, the sample Kendall's τ estimate is defined as

$$\widehat{\pi}\_{jk} = \frac{2}{n(n-1)} \sum\_{1 \le i < i' \le n} \text{sign}(\boldsymbol{\chi}\_{ji} - \boldsymbol{\chi}\_{ji'}) \text{sign}(\boldsymbol{\chi}\_{ki} - \boldsymbol{\chi}\_{ki'}).$$

The bridge function <sup>F</sup> is defined so that <sup>E</sup>(bτjk) <sup>=</sup> <sup>F</sup>(σjk), where σjk is the corresponding latent correlation between variables j and k. The explicit form of F for the truncated Gaussian copula model is given below.

**Theorem 1** (Yoon et al. (2018))**.** Let random variables x<sup>j</sup> , xk follow truncated Gaussian copula with corresponding latent correlation <sup>σ</sup>jk. Then <sup>E</sup>(bτjk) <sup>=</sup> <sup>F</sup>(σjk), where

$$\begin{aligned} F(\sigma\_{jk}) &= F(\sigma\_{jk}; \delta\_j, \delta\_k) = -2\Phi\_4(-\delta\_j, -\delta\_k, 0, 0; \,\Sigma\_{4a}), \\ &+ 2\Phi\_4(-\delta\_j, -\delta\_k, 0, 0; \,\Sigma\_{4b}), \end{aligned}$$

δ<sup>j</sup> = fj(cj), δ<sup>k</sup> = f<sup>k</sup> (ck ), 84(. . .; 64) is the cumulative distribution function (cdf) of the four dimensional standard normal distribution with correlation matrix 64,

$$
\boldsymbol{\Sigma}\_{4a} = \begin{pmatrix}
1 & 0 & 1/\sqrt{2} & -\sigma\_{jk}/\sqrt{2} \\
0 & 1 & -\sigma\_{jk}/\sqrt{2} & 1/\sqrt{2} \\
1/\sqrt{2} & -\sigma\_{jk}/\sqrt{2} & 1 & -\sigma\_{jk} \\
\end{pmatrix}
$$

and

$$
\boldsymbol{\Sigma}\_{4b} = \begin{pmatrix}
1 & \sigma\_{jk} & 1/\sqrt{2} & \sigma\_{jk}/\sqrt{2} \\
\sigma\_{jk} & 1 & \sigma\_{jk}/\sqrt{2} & 1/\sqrt{2} \\
1/\sqrt{2} & \sigma\_{jk}/\sqrt{2} & 1 & \sigma\_{jk} \\
\sigma\_{jk}/\sqrt{2} & 1/\sqrt{2} & \sigma\_{jk} & 1
\end{pmatrix}.
$$

Moreover, F(σjk) is strictly increasing, so the inverse function F −1 (σjk) exists.

**Remark 1.** To give more intuition for the form of the bridge function, we provide a brief summary of the underlying derivations here. The central part is the calculation of E sign(xji − xji′)sign(xki − xki′) . Due to the effect of truncation, this calculation requires separation of events leading to zero or continuous realization of x<sup>j</sup> before the equivalence sign{xji − xji′} = sign{f1(xji) − f1(xji′)} can be applied. This separation leads to the intersection of four events concerning normal variables (two events for continuous realization of x<sup>j</sup> and xk , and two events corresponding to each of the sign terms), thus explaining the appearance of the four-dimensional normal cdf in the form of the bridge function.

Theorem 1 provides a closed-form expression of the bridge function F up to the values of thresholds δ<sup>j</sup> , which we replace with moment-based estimators <sup>b</sup>δ<sup>j</sup> . Let n0<sup>j</sup> be the observed number of exact zeros across n realizations of variable x<sup>j</sup> . By Definitions 1 and 2,

$$\mathbb{E}(n\_{0j}/n) = P(\mathbf{x}\_{\mathfrak{j}} = \mathbf{0}) = P(\boldsymbol{\mu}\_{\mathfrak{j}} \le \boldsymbol{c}\_{\mathfrak{j}}) = P(f(\boldsymbol{\mu}\_{\mathfrak{j}}) \le \boldsymbol{\delta}\_{\mathfrak{j}}) = \Phi(\boldsymbol{\delta}\_{\mathfrak{j}}).$$

We use <sup>b</sup>δ<sup>j</sup> <sup>=</sup> <sup>8</sup>−<sup>1</sup> (n0j/n) instead of δ<sup>j</sup> and can thus calculate <sup>b</sup>σjk <sup>=</sup> <sup>F</sup> −1 (bτjk). In practice, the inverse of the bridge function F −1 (bτjk) is determined numerically by finding the minimizer of the quadratic function {F(σjk) <sup>−</sup>bτjk} 2 , which is unique due to the strict monotonicity of the function F(σjk).

The resulting <sup>b</sup>σjk are used to construct an element-wise estimator <sup>6</sup>b. Since element-wise estimation does not guarantee positive semidefiniteness of <sup>6</sup>b, we follow the suggestion of Fan et al. (2017) and replace <sup>6</sup><sup>b</sup> with its projection onto the cone of positive semidefinite matrices. We use the nearPD function in Matrix R package to perform this projection. For numerical stability, we also include an additional shrinkage step of the form <sup>6</sup><sup>e</sup> <sup>=</sup> (1 <sup>−</sup> <sup>ρ</sup>)6<sup>b</sup> <sup>+</sup> <sup>ρ</sup><sup>I</sup> with <sup>ρ</sup> <sup>=</sup> 0.01, which guarantees strict positive definiteness of the final estimate. In simulations, we found that the method performs well across a wide range of small ρ values (see **Supplementary Material** for a sensitivity analysis of the parameter ρ). The described estimation procedure for 6 is implemented within the R package mixedCCA (Yoon and Gaynanova, 2018), and we refer the reader to Yoon et al. (2018) for more detailed derivations.

We refer to the proposed estimator <sup>6</sup><sup>e</sup> of the correlation matrix 6 of truncated Gaussian copula variables as the Semi-Parametric Rank-based (SPR) correlation estimator. The SPR estimator forms the basis for the undirected graphical model framework outlined below.

#### 2.2. Sparse Graphical Models and SPRING

We next introduce the Semi-Parametric Rank-based approach for INference in Graphical model (SPRING). SPRING relies on the estimation of an undirected graphical model from data. Undirected graphical models are typically used to represent the conditional independence relationship between the variables of random vector **x** ∈ R p , so that

> no edge between x<sup>j</sup> and x<sup>k</sup> ⇐⇒ x<sup>j</sup> ⊥ x<sup>k</sup> |**x**−j,−<sup>k</sup> ,

where **x**−j,−<sup>k</sup> means all components in **x** except component j and k. If the vector **x** follows a normal distribution, then conditional independence between x<sup>j</sup> and x<sup>k</sup> is equivalent to zero partial correlation between variables j and k. Therefore, sparse estimates of partial correlations lead to sparse conditional independence graphs. There is a rich literature on sparse estimation of partial correlations, with perhaps the most popular methods being the neighborhood selection of Meinshausen and Bühlmann (2006) (denoted by MB from here on) and the graphical lasso (Friedman et al., 2008). While the SPR estimator of the correlation matrix proposed in section 2.1 can be used in both approaches, we found the MB method to perform better than graphical lasso in numerical simulations and therefore focus on the MB method in the remainder of the paper.

The MB method takes advantage of the connection between partial correlations and regression coefficients and performs sparse estimation of partial correlations by regressing each of the p variables on the rest, thus finding each nodes' immediate neighbors by solving a lasso problem (Tibshirani, 1996). Given column-centered and scaled data matrix **X** ∈ R <sup>n</sup>×<sup>p</sup> with columns **x** j , the MB method solves for each variable j

$$\mathcal{B}^j = \underset{\mathcal{B} \in \mathbb{R}^{\mathcal{P}}, \beta\_{\mathcal{I}} = 0}{\text{argmin}} \left\{ n^{-1} \|\mathbf{x}^{\mathcal{I}} - \mathbf{X}\mathcal{B}\|\_2^2 + \lambda \|\mathcal{B}\|\_1 \right\}.$$

Rewriting the objective function leads to

$$\begin{split} \boldsymbol{\mathcal{B}}^{j} &= \underset{\boldsymbol{\mathcal{B}} \in \mathbb{R}^{\mathcal{P}}, \boldsymbol{\beta}\_{\boldsymbol{\mathcal{B}}} = 0}{\operatorname\*{argmin}} \left\{ \boldsymbol{\mathcal{B}}^{\top} \boldsymbol{n}^{-1} \mathbf{X}^{\top} \mathbf{X} \boldsymbol{\mathcal{B}} - 2 \boldsymbol{n}^{-1} \boldsymbol{\mathcal{B}}^{\top} \mathbf{X}^{\top} \mathbf{X}^{\top} \mathbf{x}^{j} + \lambda \, \|\boldsymbol{\mathcal{B}}\|\_{1} \right\}, \\ &= \underset{\boldsymbol{\mathcal{B}} \in \mathbb{R}^{\mathcal{P}}, \boldsymbol{\beta}\_{\boldsymbol{\mathcal{B}}} = 0}{\operatorname\*{argmin}} \left\{ \boldsymbol{\mathcal{B}}^{\top} \boldsymbol{\mathcal{S}} \boldsymbol{\mathcal{B}} - 2 \boldsymbol{\mathcal{B}}^{\top} \boldsymbol{\mathbf{s}}^{j} + \lambda \, \|\boldsymbol{\mathcal{B}}\|\_{1} \right\}, \end{split}$$

where, given the centering and scaling of **X**, **S** = n <sup>−</sup>1**X** <sup>⊤</sup>**X** is the sample correlation matrix with columns **s** j . Since the standard sample correlation matrix is not suited for capturing dependencies in sparse quantitative microbiome data, SPRING replaces the sample correlation **S** in the MB method with the SPR estimator <sup>6</sup><sup>e</sup> from section 2.1. The MB method comprises the regularization parameter λ which balances the trade-off between sparsity of the neighborhood and goodness of fit, and thus requires data-driven tuning. We here consider a stability-based model selection method, the Stability Approach to Regularization Selection (StARS) (Liu et al., 2010), which has been previously proven to be suitable for graphical model selection on microbiome data (Kurtz et al., 2015; Müller et al., 2016). The StARS method selects the optimal tuning parameter by repeatedly taking subsamples of the original data, estimating the graphical model for each subsample at each λ value along a prescribed regularization path, and then calculating empirical edge selection probabilities from the subsamples. The StARS edge stability criterion uses these probabilities to assess the sum of edge variabilities for each graph along the regularization path. The optimal λ is selected based on the supplied threshold tS, with standard values being t<sup>S</sup> = 0.05 and t<sup>S</sup> = 0.1 (Liu et al., 2010; Kurtz et al., 2015). The threshold value represents a bound on the allowed overall edge variability over the entire graph. Lower thresholds lead to sparser, more robust graphs. Using the selected λ value, the final graphical model is refitted on the full dataset.

In summary, SPRING comprises three major components: (i) a semi-parametric rank-based correlation estimator for zero-inflated count data, (ii) the MB method to infer sparse conditional dependencies from the estimated correlation, and (iii) a stability-based approach (StARS) for sparse and robust neighborhood selection.

#### 2.3. Extensions to Compositional Data

An important prerequisite for SPRING to be applicable to zeroinflated data is that individual count values across samples are comparable. For TAS-based microbial abundance data this condition is not satisfied because the total read count of a sample is not related to the total number of bacteria in the sample (Vandeputte et al., 2017), thus making the counts inherently proportional quantities. While this drawback is alleviated with the novel experimental techniques for quantitative microbiome data, as discussed earlier, a large number of available datasets, including the HMP and the AGP data, are only available as proportional (or compositional) data. To make SPRING amenable to statistical association inference from relative abundance data, we rely on a novel data transformation.

One of the key challenges in working with compositional data is the presence of unit-sum constraint. For correlation estimation, a common approach (see e.g., Aitchison, 1983; Kurtz et al., 2015; Cao et al., 2018) is to first apply the centered logratio transform (clr) to the compositional vector of each sample **x**<sup>i</sup> ∈ S p

$$\mathbf{z}\_{i} = \text{clr}(\mathbf{x}\_{i}) = [\log \{ \mathbf{x}\_{i1} / \mathbf{g}(\mathbf{x}\_{i}) \}, \log \{ \mathbf{x}\_{i2} / \mathbf{g}(\mathbf{x}\_{i}) \}, \dots, \log \{ \mathbf{x}\_{i p} / \mathbf{g}(\mathbf{x}\_{i}) \}],\tag{1}$$

where g(**x**i) = ( Qp j=1 xij) 1/p is the geometric mean of **x**<sup>i</sup> . A correlation matrix is then estimated based on the transformed **z**<sup>i</sup> , i = 1, . . . , n, rather than directly on **x**<sup>i</sup> (Aitchison, 1983). Since TAS-based microbiome profiling data have a large number of zeros, the addition of a large number of pseudo-counts is required to modify the vector of compositions to only have non-zero proportions. Adding such pseudo-counts changes the measured non-zero proportions and masks the zeros in the data, leading to zeros and non-zeros being treated equally in subsequent analysis. In addition, the choice of the actual value of the pseudo-count can influence downstream analysis results, and mere addition of extra zero components to the compositional vector would also change the transformation.

To avoid these drawbacks and to play on the strengths of SPRING in handling excess zeros, we propose a modified clr transform (mclr) that does not require the use of pseudo-counts. The key steps of the mclr transform are described below and visualized in **Figure 1**.

Contrary to recent efforts in data-driven inference of pseudocounts (see e.g., Cao et al., 2017; de la Cruz and Kreft, 2018 and references therein), we compute the geometric mean of each sample from positive proportions only, normalize and logtransform all non-zero proportions by using that geometric mean, and apply an identical shift operation to all non-zero components in the dataset. Specifically, let **x**<sup>i</sup> ∈ S <sup>p</sup> be the vector of compositions for sample i, and for simplicity of illustration, assume that the first q elements of **x**<sup>i</sup> are zero, and the other elements are non-zero. Then we propose to apply

$$\mathbf{z}\_{i} = \text{mclr}\_{\varepsilon}(\mathbf{x}\_{i}) = [0, \dots, 0, \log \{ \mathbf{x}\_{i(q+1)} / \tilde{\mathbf{g}}(\mathbf{x}\_{i}) \} + \varepsilon, \dots,$$

$$\log \{ \mathbf{x}\_{ip} / \tilde{\mathbf{g}}(\mathbf{x}\_{i}) \} + \varepsilon \}, \tag{2}$$

where g˜(**x**i) = ( Qp j=q+1 xij) 1/(p−q) is the geometric mean of the non-zero elements of **x**<sup>i</sup> . When ε = 0, mclr<sup>0</sup> corresponds to clr transform applied to non-zero proportions only (**Figure 1**, middle panel). When ε > 0, mclr<sup>ε</sup> applies a positive shift to all non-zero compositions. To make all non-zero values strictly positive, we use the data-driven shift ε = |zmin| + c, where zmin = minij log{xij/g˜(**x**i)} and c a positive constant with the default value c = 1. Alternative choices are discussed in the **Supplementary Material**. The ultimate rationale for the shift is to preserve the original ordering of the entries of the compositional vector **x**<sup>i</sup> (with zeros being the smallest) in the transformed vector **z**i . The constraint ε > |zmin| ensures that zi(q+1), . . . , zip are

strictly positive for all i. The modified clr transform is invariant to the addition of extra zero components, preserves the original zero measurements, and is overall rank-preserving.

If a practitioner intends to infer microbial associations from relative abundance data using SPRING, we suggest to first use the mclr<sup>ε</sup> transform on relative abundance data and then apply SPRING to the transformed data. While SPRING is completely invariant to the choice of ε in mclr<sup>ε</sup> for any value of ε within the constraint due to the rank-based estimation of correlation, it does not take into the account the compositional nature of the data. Alternative ways of measuring associations between compositional components include Aitchison's variation (Aitchison, 2003), linear compositional associations (Egozcue et al., 2018), and proportionality (Quinn et al., 2017), which take the compositional constraints directly into account. Here, we will focus on correlation-based approaches and present an application of SPRING to the compositional AGP data in section 4.1.

# 3. SIMULATION STUDIES

# 3.1. Generation of Synthetic Quantitative Microbial Abundance Data

We first describe generating mechanisms for synthetic microbial abundance data with prescribed correlation or inverse correlation matrices that emulates as close as possible quantitative microbial abundance data. We closely follow ideas presented in Kurtz et al. (2015) for synthetic data generation with several important differences. The work flow of our data generation mechanism is summarized in **Figure 2**.

We propose two constructions for correlation matrices. The first construction takes directly into account the covariance of measured quantitative microbial abundance data. Given a set of n quantitative abundance samples on p taxa **X** ∈ R n×p , we compute the SPR estimator <sup>6</sup><sup>e</sup> proposed in section 2.1 from the data and consider the resulting correlation matrix as the ground truth correlation matrix 6. The generation of synthetic samples given this correlation matrix estimate is then outlined below. Note that we do not impose any particular properties on the correlation matrix estimate, such as bounded condition number or sparsity. This construction is thus only useful for benchmarking different correlation estimation techniques.

An alternative way of generating a correlation matrix 6 is through explicitly controlling certain properties of the inverse correlation matrix. Let p be the number of nodes, i.e., the number of taxa or OTUs, and let 2 be the p by p symmetric adjacency matrix such that θij = 1 if there is an edge between nodes i and j, i 6= j, and θij = 0 otherwise. We assume that the induced graph has no self-loops, i.e., θii = 0. We control the topology of the graph by considering three types of graph topologies: band graphs, cluster graphs, and scale-free graphs. The number of edges in the graph is denoted by e. The default value considered here is equal to twice the number of nodes (e = 2p), resulting in sparse graphs. Given this fixed sparsity level and the graph type, we use the R package SpiecEasi (Kurtz et al., 2017) to generate a precision matrix with the pattern of zeros corresponding to 2. The non-zero entries of the lower triangular elements of , ωij with i > j, are sampled uniformly at random from the intervals [−3, −2] and [2, 3], and the upper triangular elements are set to ωji = ωij. The diagonal elements are set to a constant such that the final precision matrix has a default condition number κ = 100. Using , we generate the correlation matrix 6 by taking the inverse of the precision matrix, followed by scaling. This construction thus allows to benchmark different sparse inverse or partial correlation estimation techniques.

Given a correlation matrix 6 from either of the two constructions, we follow Kurtz et al. (2015) and use the "Normal to Anything" (NorTA) approach to generate synthetic abundance data. The NorTA method allows to generate variables with arbitrary marginal distributions from multivariate normal variables with given correlation structure. Specifically, we first generate n × p matrix **Z** with independent normal rows **z**<sup>i</sup> ∼ N(**0**, 6) with given correlation matrix 6, then get uniform random vectors by applying standard normal cdf transformation to each column of **Z**, **u** <sup>j</sup> = 8(**z** j ) element-wise, and then apply the quantile functions of the target marginal distributions to each **u** j . In Kurtz et al. (2015), the zero-inflated negative binomial distribution (zinegbin) from VGAM package (Yee, 2010) is used, where the marginal distributional parameters are estimated from measured amplicon data. However, we

found that the zinegbin distribution does not emulate well the overdispersion and skewness present in real data. This is evident by comparing the summary statistics between, e.g., the AGP data and corresponding synthetic data generated using the zinegbin, as shown in **Table 1**. To better match real amplicon data, we propose to take a different approach by using the inverse of the empirical cumulative distribution function (ecdf) of each OTU. This inverse can be calculated numerically by using the uniroot.all function in rootSolve package in R (Soetaert, 2009). As is evident from **Table 1**, the ecdf approach works well in mimicking the summary statistics of real TAS-based data. The match across all counts is considerably better than the match across sample abundances since the ecdf transformation is applied separately to each OTU. Although the within-sample counts are affected by the imposed correlation structure 6, the values of the sample total abundance of synthetic data with the ecdf are much closer to the measured ones than those with zinegbin. In terms of count summary statistics, the synthetic data is nearly indistinguishable from the measured data.

#### 3.2. Estimation of Pairwise Correlations 3.2.1. Synthetic Data Generation and Methods for Comparison

We first benchmark estimation of pairwise correlations from synthetic quantitative microbial abundance data. For this purpose, we generate synthetic count data based on the quantitative microbiome profiling data, put forward in Vandeputte et al. (2017) and referred to as QMP data, and consider genus-level correlations. As the processed data used in Vandeputte et al. (2017) are not publicly available, we apply the work flow outlined in **Figure 2**. We reprocessed the available amplicon sequencing data using the standard QIIME protocol with closed-reference OTU picking (Caporaso et al., 2010), adjusted for copy number variations of the 16S rRNA gene using PICRUSt (Langille et al., 2013), filtered the data using the following three steps: (i) exclude samples whose sequencing depths (total read abundances) are ≤ 10000; (ii) exclude all taxa present in <30% of samples; and (iii) exclude samples whose abundance is less than the first percentile of all sequencing depths. We then combined the resulting samples with the corresponding measured total cell counts (Vandeputte et al., 2017). We next pooled n = 106 healthy subjects from the two available cohorts and merged all OTUs on the genus level, resulting in p = 91 genera. To generate synthetic data based on the QMP data with realistic correlation structure, we use the first construction method of the correlation matrix, outlined in section 3.3.1, thus considering the SPR correlation estimate on the QMP data as the ground-truth correlation matrix 6. We then generate n = 91 synthetic genus-level quantitative microbial abundance data that mimic the original QMP data both in terms of marginal genus distributions and correlation structure.


TABLE 1 | Comparison of summary statistics for all the counts and sample total abundance values between AGP data and two synthetic data generators.

The sample size is n = 2000, the number of OTUs is p = 200, and the synthetic data is based on scale-free graph type.

In addition to the SPR correlation estimation (section 2.1) on the quantitative data, we consider three compositional correlation estimation approaches: (i) Pearson sample correlation on clr-transformed data with pseudocount addition [as used in SPIEC-EASI (Kurtz et al., 2015)], (ii) SparCC estimation from log-transformed compositions with pseudocount addition (Friedman and Alm, 2012), and (iii) SPR estimation on mclrεtransformed data (as described in section 2.3).

#### 3.2.2. Results

We measure the performance of the different estimators in terms of absolute differences <sup>|</sup>σjk <sup>−</sup> <sup>b</sup>σjk|, where <sup>σ</sup>jk is the ground-truth correlation between genera <sup>j</sup> and <sup>k</sup>, and <sup>b</sup>σjk is the estimated correlation for each of the four methods. **Figure 3** shows box plots of absolute differences for the different methods. We observe that the SPR correlation estimates from the synthetic quantitative data outperform all other estimates, closely followed by the SPR estimates from mclrε-transformed data. SparCC and Pearson correlation on clr-transformed compositions are considerably outperformed by the SPR-type methods. The superiority of SPR-type methods is likely due to the preservation of the zero counts as zeros, thus avoiding distortions through the use of pseudo-counts, and the effective handling of the non-normality of the samples (as visible in the histogram of mclrε-transformed data in **Figure 1**). **Figure 4** shows the corresponding scatter plots of estimated and true pairwise correlations. We observe that SPR estimates on quantitative data are unbiased and have the smallest variance among all methods. SPR estimates on mclrε-transformed data have a slight downward bias and higher variance. SparCC and Pearson correlation on clr-transformed data have the worst performance both in terms of bias and variance.

#### 3.3. Estimation of Microbial Association Networks

#### 3.3.1. Synthetic Data Generation and Methods for Comparison

We next consider the estimation of microbial association networks. For this purpose, we generate synthetic counts from a large subset of the American Gut Project (AGP) data (McDonald et al., 2018), which comprises p = 27116 taxa across n = 8440 samples. The high dimensionality and the large sample size of the AGP data enable a more comprehensive and realistic investigation of the effects of dimensionality and sample size on the estimation of microbial associations than the QMP data. We consider the same data filtering steps as used in section 3.2.1: we (i) exclude samples whose sequencing depths (total read abundances) are ≤ 10000; (ii) exclude all taxa present in <30% of samples; and (iii) exclude samples whose abundance is less than the first percentile of all sequencing depths. This leads to a reduced dataset with p = 481 taxa across n = 6482. We consider two scenarios for the simulation studies: a large and a small sample size setting. For the large sample size setting, we randomly pick n = 2000 samples with total abundance at least 10,000, and then select p = 100 OTUs with largest abundances leading to 2000 × 100 matrix of synthetic counts. For the small sample size setting, we use the same strategy with n = 500 and p = 200. In the synthetic benchmarks, we treat the total observed read abundances as quantitative microbiome profiling abundances and impose sparse conditional dependencies on these counts by using the second correlation construction method, outlined in section 3.1. We refer to these samples as "True data" in the simulations. To investigate the robustness of SPRING to misspecifications of the assumed total, we also generate "Distorted data" by multiplying counts in every sample with an individual scale factor chosen uniformly at random from the interval [0.5, 3]. The scale factor does not affect a sample's compositional data but does distort the total abundances. The scale factor interval [0.5, 3] represents a realistic distortion scenario in gut microbiome samples (see e.g., in Vandeputte et al., 2017, **Figure 2**) and is on the same order as typical fold changes of observed image-based total species counts in marine ecosystems (Ducklow, 2000). We study the performance of SPRING both on the "True" and "Distorted" synthetic data in order to assess how strongly a misspecification of the total affects association network inference.

Along with SPRING, we consider three methods for comparison. To study the influence of the sample correlation estimation, we consider the standard MB method using the Pearson sample correlation (Meinshausen and Bühlmann, 2006) [implemented in the R package huge (Zhao et al., 2012)]. We also consider two popular methods for microbial association inference from relative abundance data: SPIEC-EASI in the MB mode (Kurtz et al., 2015) and SparCC (Friedman and Alm, 2012)

FIGURE 3 | Absolute differences between true and estimated correlation coefficients, <sup>|</sup>σjk <sup>−</sup> <sup>b</sup>σjk|, for four methods: SPR correlation estimation on quantitative data (green), SPR correlation estimation on mclrε -transformed compositional data (brown), SparCC estimation (Friedman and Alm, 2012) (purple), and Pearson sample correlation on clr-transformed data (red) [as used in SPIEC-EASI (Kurtz et al., 2015)].

(both implemented in the R package SpiecEasi). The original SparCC method, however, is used for inferring marginal rather than conditional dependencies. For fair comparison with the other methods, we therefore introduce a modification of SparCC, termed invSparCC. The invSparCC method estimates the correlation matrix using the default SparCC method (as implemented in the R package SpiecEasi), and then uses the SparCC correlation estimator as input to the MB method, described in section 2.2. All considered methods use the neighborhood selection principle to derive a sparse graphical model, see **Table 2** for summary of all methods. The inferred adjacency and coefficient matrices are thus not guaranteed to be symmetric. We use the "or" rule and the "maxabs" rule to symmetrize the estimated adjacency and coefficient matrices, respectively. The "or" rule assigns an edge between nodes i and j if either node i is selected as a neighbor of j or node j is selected as a neighbor of i. The "maxabs" rule symmetrizes the coefficient matrix by taking the coefficient with maximum absolute value. For tuning parameter λ selection, we use the R package pulsar with "StARS" edge stability criterion and use 50 subsamples with subsampling ratio being fixed at 10<sup>√</sup> n/n, where n is the sample size.

#### 3.3.2. Results

We first compare the methods in terms of the Hamming distance between the true and the estimated graph. The Hamming distance is calculated as the number of edges that disagree with the true graph at each value of tuning parameter λ. The comparison of Hamming distance curves across the values of λ allows us to check the best achievable Hamming distance value that is agnostic to tuning parameter selection scheme. We consider 50 values of λ for all methods equally spaced on a logarithmic scale, with λmax corresponding to no edges in the estimated graph, and λmin = 0.01λmax. For more accurate comparison, we consider 50 replications of the data generating process for each specified combination of n and p. The mean Hamming distance values over 50 replications as functions of λ are plotted in **Figure 5**, with bands corresponding to ± two standard errors. The MB method is uniformly outperformed by all methods, confirming that standard sample correlation is not suitable for capturing dependencies in sparse quantitative microbiome data. SPIEC-EASI and invSparCC have comparable performance, with SPIEC-EASI achieving smaller mean values. SPRING performs best in all cases considered here. The most challenging scenario is the scale-free graph with low sample size, with SPRING, SPIEC-EASI, and invSparCC having comparable performance. As expected, the distortion of total abundances has no effect on the compositional methods SPIEC-EASI and invSparCC, but decreases the performance of MB and SPRING. Nevertheless, the minimum Hamming distance achieved by SPRING on distorted data is still comparable or better than the minimum distances achieved by other methods, thus suggesting that SPRING is robust to misspecification of total abundance values.

To gain further insights into the edge selection performance of the different methods, we analyze the overlapping sets of selected edges for all methods. We here focus on the cluster graph type in the low sample size regime (n = 500, p = 200). For each method we select the tuning parameter λ using StARS at t<sup>S</sup> = 0.1 and repeat the experiment over 50 replications. **Figure 6** shows the average number of edges that overlap across all methods as well as average proportions of true edges among the selected ones. Among all sets uniquely identified by an individual method, SPRING shows the highest true positive rate (0.72), followed by SPIEC-EASI (0.42), invSparCC (0.12), and MB (0.01). The edge set that is jointly selected by SPRING, SPIEC-EASI, and invSparCC shows the highest true positive rate (0.95) and highest number of selected edges (≈ 246), followed by the edge set jointly selected by all four methods (true positive rate 0.94 and ≈ 54 edges). This suggests that a promising strategy for a practitioner screening for true statistical associations is to apply SPRING, SPIEC-EASI, and invSparCC independently and select the overlapping edge set.

Next, we consider one data replication and compare the Hamming distances achieved by selecting the tuning parameter λ using StARS. The results are shown in **Figure 7** with two StARS thresholds considered (stars indicating 0.1 and circles indicating 0.05). As expected, smaller threshold corresponds to larger tuning parameter leading to sparser graph. At the same time, based on numerical results, the threshold of 0.1 tends to reach smaller Hamming distances for all methods except MB. In general, both thresholds lead to reasonable values of λ in terms of Hamming distance. As in the previous comparison, SPRING leads to smaller Hamming distance values for "True" data and is robust to misspecified total abundance values.

Finally, we compare the estimated graphs from all methods in terms of precision and recall curves, where

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \quad \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}};$$

FIGURE 5 | Hamming distance as a function of tuning parameter λ. The lines correspond to mean values across 50 replications, and the bands show ± two standard errors. True abundance and distorted abundance data are distinguished by the transparency level and the line type: true data are less transparent and have solid lines; distorted data are more transparent and have dotted lines.

over 50 replications with n = 500, p = 200, and cluster-type graph. Corresponding standard deviations are given in parentheses.

TP, FP, and FN indicate the number of True Positives, False Positives, and False Negatives, respectively. To construct the curves, we extract the edge selection probabilities based on 50 subsamples from pulsar corresponding to tuning parameter with t<sup>S</sup> = 0.1. We calculate precision and recall values by changing the threshold for edge selection probability from 1 to 0, interpolating the precision-recall values at the edges for no selection (recall = 0, precision = 1) and complete selection [recall = 1, precision = 4/(p − 1)]. Here 4/(p − 1) is the probability of choosing true edges (e = 2p) at random among all possible edges (p(p − 1)/2). The resulting curves are shown in **Figure 8**. For True data, SPRING achieves the highest precision-recall curves across all scenarios. The Area Under the Precision-Recall curve (AUPR) values are reported in **Table 3**. For the distorted data,

SPRING is still best or among the best methods for band and cluster graph types, and is outperformed by the compositional methods for scale-free graph type in the low sample size regime.

In conclusion, SPRING exhibits considerably better graph recovery performance than existing methods, and is robust to misspecification of total sample abundance. This suggests that incorporating quantitative abundance information in the analysis leads to more reliable graphical model inference.

# 4. STATISTICAL MICROBIAL ASSOCIATIONS IN GUT MICROBIOME DATA

We provide two applications of SPRING to TAS-based microbial abundance data: a subset of the relative abundance data from the American Gut Project (AGP) (McDonald et al., 2018) and the QMP data from Vandeputte et al. (2017).

## 4.1. Taxon-Taxon Associations From the American Gut Project Data

We first use SPRING to infer taxon-taxon associations from the relative abundance AGP data. After the pruning and filtering steps described in section 3.3.1, we arrive at p = 481 OTUs from n = 6482 samples. Prior to applying SPRING, we transform the compositions **X** ∈ S <sup>n</sup>×<sup>p</sup> using the mclr<sup>ǫ</sup> transform introduced in Equation (2). The minimum value of the mclr0-transformed data across all samples is zmin = −4.8142. To make all nonzero values strictly positive, we add an arbitrary constant c = 1 to |zmin| and use the shift ε = |zmin| + c = 5.8142 in the final mclr<sup>ε</sup> transform. We also consider SPIEC-EASI, MB, and invSparCC (see **Table 2**) for comparison. All four methods use the same parameterization for the regularization path and StARS model selection: 50 subsamples with the same seed number, subsampling ratio (10<sup>√</sup> n/n = 0.1242) and 50 tuning parameter values with the same ratio of the smallest to largest λ value (λmin/λmax = 0.01). For each method, λmax is set to the maximum value of the off-diagonal elements of the respective correlation matrix. All computations were performed in R using the R packages pulsar, SpiecEasi, huge, and mixedCCA, respectively.

We report summary statistics of the estimated association networks for two StARS stability thresholds: 0.05 (the standard setting in SpiecEasi) and 0.1 (the standard setting in Liu et al., 2010) in **Table 4**. For both stability thresholds, the MB method estimates the sparsest networks with highest percentage of positive edges (PEP) while invSparCC estimates the densest networks with the lowest percentage of positive edges. SPRING and SPIEC-EASI's association networks have similar edge densities while SPRING has a considerably higher percentage of positive partial correlation edges.

To get a bird's eye view of the topologies of the different association networks we visualize the four different networks at StARS threshold 0.05 in **Figure 9A**. The force-directed layout of all networks follows the optimal layout of the SPRING network. At the selected StARS threshold, all networks have one connected component. The overall network structure suggests a dense core with two peripheral network modules, similar to previous analysis (Müller et al., 2016). The networks of the compositionally-adjusted methods SPIEC-EASI and invSparCC connect the core and one of the modules by a large number of positive (shown in green) and negative (shown in red) associations. SPRING considerably sparsifies these connections, leaving only few positive and negative edges between the modules, and MB does not infer any negative associations. We assess the similarity among the estimated networks by analyzing their edge set overlap in **Figure 9B**. All methods share common core of 601 edges. As expected, SPIEC-EASI and invSparCC share the largest unique two-set overlap with 637. SPRING's network takes an intermediate role between MB and the compositionallyadjusted methods. It shares 833 edges with SPIEC-EASI and invSparCC, and 112 edges exclusively with MB. Each method by itself also comprises a considerable set of exclusive edges, ranging from 418 for SPIEC-EASI to 767 for SPRING.

# 4.2. Genus-Genus Associations From Quantitative Gut Microbiome Profiling Data

We next analyze the quantitative gut microbiome data put forward in Vandeputte et al. (2017). We focus on estimating genus-genus associations both from the quantitative and the relative microbiome profiles, referred to as QMP and RMP, and analyze the consistency among the inferred networks. We follow the processing steps outlined in section 3.2.1 leading to n = 106 subjects and p = 91 genera. To infer statistical genusgenus associations we use SPRING for the QMP data (without transformation), and SPIEC-EASI for the corresponding RMP data (using the standard clr transformation) with the same computational protocol as detailed in the previous section.

We first show the agreement of signed edges between the two association networks at StARS stability level 0.1 in **Table 5**. Overall, out of the 4095 possible genus-genus associations, SPRING infers a set of 237 stable edges with a PEP of 98%. SPIEC-EASI infers 220 edges with a PEP of 66%. From the


For all methods, the final graphical model is estimated based on combining neighborhood selection approach with pulsar tuning parameter selection. <sup>∗</sup>When absolute abundance data is not available, SPRING can be applied to relative abundance data following mclr transform described in section 2.3.

TABLE 3 | Area under the Precision-Recall curves (AUPR) of Figure 8.

quantitative data, SPRING is able to detect considerably more positive associations, 140 of which are missed by SPIEC-EASI from the relative abundance data. SPRING detects only four negative associations three of which are missed by SPIEC-EASI despite having a considerable larger set of negative edges (74 overall). However, both methods do agree on a set of 93 edges, 92 positive and one negative edge. Importantly, we do not observe any sign flips among the different inferred edge sets. Missed positive or negative edges are simply absent in the other method.

We next focus on the induced genus-genus sub-network which only includes genera that have an assigned taxonomy and have at least one strong association ≥ |0.2| in either the SPRING-inferred or SPIEC-EASI-inferred association network. The weighted adjacency of this sub-network includes 32 genera and is shown in **Figure 10**. Among the 14 genera with highest total abundance across all samples (Bacteroides to Odoribacter), we observe 50% agreement between the two estimated networks (six edges are the same across all networks, three edges are different in SPIEC-EASI, four are different in SPRING). Both networks include a strong negative association between Phascolarctobacterium and Dialister and exactly four positive associations of Bacteroides with Parabacteroides, Holdemania, Bilophila, and Odoribacter (first row and column in **Figure 10**). We also observe the absence of a negative association between Bacteroides and Prevotella genera in the quantitative data which is often reported in the literature and also present in the SPIEC-EASI network (see also Vandeputte et al., 2017 for a discussion).

# 5. DISCUSSION

Advances in experimental microbiome profiling protocols have combined high-throughput environmental sequencing techniques with robust measurements of microbial cell counts

TABLE 4 | AGP data: total number of partial correlation edges and percentage of positive partial correlation edges (PEP) (Faust et al., 2015) as estimated by MB, SPRING, SPIEC-EASI, and invSparCC for StARS stability thresholds t<sup>S</sup> = 0.05 and 0.1.



In each cell, AUPR of the True data and the Distorted data (given in parenthesis) are reported. AUPR value is based on edge selection probabilities using StARS with t<sup>S</sup> = 0.1.

TABLE 5 | QMP data: summary of agreement of signed genus-genus partial correlations, inferred by SPRING and SPIEC-EASI at StARS stability threshold t<sup>S</sup> = 0.1.


(Gifford et al., 2011; Satinsky et al., 2013; Stämmler et al., 2016; Props et al., 2017; Vandeputte et al., 2017; Tkacz et al., 2018), providing, for the first time, a more quantitative picture of the underlying microbial ecosystems in their natural habitat. To facilitate a high-level summary of the complex interplay between the constituents of the ecosystem, an important first exploratory analysis step is the estimation of statistical association networks between the identified operational taxonomic units or gene sets (Faust and Raes, 2012; Fuhrman et al., 2015; Sunagawa et al., 2015; Ruiz et al., 2017). In order to learn such association networks from sparse quantitative microbiome data, we have introduced the Semi-Parametric Rank-based approach for INference in Graphical model (SPRING). SPRING combines neighborhood selection (Meinshausen and Bühlmann, 2006) to infer the conditional dependency graph with stabilitybased model selection (Liu et al., 2010; Müller et al., 2016) to identify a sparse set of partial correlation estimates. The resulting network of partial correlations represents direct (i.e., conditionally independent) microbe-microbe associations and provides a statistical community-level description of the underlying microbial ecosystem. As ground truth microbial association networks are largely elusive in the literature, we have based our numerical simulation benchmarks on a novel synthetic quantitative microbiome data generation mechanism which might be of independent interest to researchers who want to test novel statistical techniques on such data.

Our benchmark test cases revealed a number of interesting observations. Firstly, we showed that, on synthetic quantitative microbiome data with prescribed ground-truth correlation structure, the SPR-type correlation estimates are considerably more accurate than SparCC and naive Pearson sample correlation on clr-transformed compositional data. Secondly, we showed that Pearson sample correlation estimation cannot be used to identify sparse partial correlations in quantitative microbiome data. Thirdly, SPRING outperformed sparse graphical modeling techniques that were designed with compositional data in mind, namely SPIEC-EASI (Kurtz et al., 2015) and the invSparCC estimator introduced here, which uses neighborhood selection with SparCC correlation estimation (Friedman and Alm, 2012). SPRING compared favorably to the other methods both in terms of achievability,

that is, in terms of minimum Hamming distance to the true underlying network achieved across the regularization path (see **Figure 7**), and in combination with stability-based model selection in terms of Precision-Recall (see **Figure 8**). We also quantified the robustness of SPRING to misspecification of the total by randomly distorting the counts of each sample up to a 6-fold change which represents a realistic distortion scenario in gut microbiome samples (see e.g., in Vandeputte et al., 2017, **Figure 2**) and is on the same order as typical fold changes of observed image-based total species counts in marine ecosystems (Ducklow, 2000). Even under these distortions SPRING's performance was on par or superior to SPIEC-EASI and invSparCC (which are scale-invariant by design). SPRING's robustness to total count misspecifications thus suggested to include an application of association inference from relative microbiome profiling data. In order to apply SPRING to relative abundance data we introduced a modified centered log-ratio (clr) transform that can seamlessly handle excess zeros without pseudo-count addition. Contrary to recent efforts in data-driven pseudo-count inference (see de la Cruz and Kreft, 2018 and references therein) we computed the geometric mean of each sample from positive proportions only, normalized and log-transformed all non-zero proportions by using that geometric mean, and applied an identical shift operation to all non-zero variables in the dataset. This transformation is rank-preserving while leaving the original zero proportions unchanged, thus enabling the application of the SPRING methodology without further modification to relative abundance data.

We applied SPRING to two prominent gut microbial datasets, the relative abundance data collected in the American Gut Project (AGP) (McDonald et al., 2018) and the quantitative gut microbiome profiling (QMP) data from Vandeputte et al. (2017). As the processed data from Vandeputte et al. (2017) was not publicly available, a reprocessing of the amplicon sequencing reads was necessary.

From the AGP data, we inferred taxon-taxon association networks across p = 481 taxa from n = 6482 samples using neighborhood selection (MB), SPIEC-EASI, invSparCC, and SPRING. In line with previous findings (Faust et al., 2015), the percentage of positive edges in the networks is > 75%, with MB and SPRING having even higher percentages than SPIEC-EASI and invSparCC. At both StARS stability levels 0.05 and 0.1 reported here, SPRING and MB tended to infer slightly sparser association networks than SPIEC-EASI and invSparCC. At StARS stability level 0.05, we analyzed the overlap of edge sets among the different methods (**Figure 9**). All methods share a common core of 601 edges. In addition, SPRING, SPIEC-EASI, and invSparCC shared the largest common edge set of size 833 among all three-set overlaps. As expected, the two compositionally-adjusted methods SPIEC-EASI and invSparCC shared the largest common two-set overlap of 637 edges. In the absence of verified taxon-taxon associations, our analysis suggests that a practitioner screening for coherent statistical associations among taxa can apply SPRING, SPIEC-EASI, and invSparCC independently and select the set of strongest edges out of the edge set these three methods inferred. This strategy is also supported by our synthetic benchmark results where the joint edge set of the three methods achieved a true positive rate of 0.95 for cluster graphs. For the analysis on the AGP data, this strategy would result in an edge set of size 1434, an average of about three associations per taxon. This core network can then be further studied in terms of modularity, network stability, and node centrality measures, as shown, e.g., in Ruiz et al. (2017); (Tipton et al., 2018).

For the QMP data, we used SPRING and SPIEC-EASI to estimate the genus-genus associations from the quantitative and the relative microbiome profiles, respectively. Our analysis revealed considerable differences to the published results in Vandeputte et al. (2017). The original study described dramatic differences between significant marginal genus-genus correlations from 66 healthy control samples in the QMP disease cohort when applying Spearman's ρ correlation to the relative and quantitative microbiome profiling data (see e.g., **Figure 3** in Vandeputte et al., 2017). Our results here showed more coherence of the statistical associations inferred from relative and absolute abundance data. Overall, 92 positive, 1 negative, as well as 3731 zero associations were in common among both association networks, while both networks differed in 280 associations (**Table 5**). Our analysis on the genus sub-network that comprised all genera with at least one strong association ≥ |0.2|, shown in **Figure 10**, verified a strong negative association between Phascolarctobacterium and Dialister inferred from both data types, as well as the absence of a negative association between Bacteroides and Prevotella genera in the quantitative data, both in agreement with published results. However, we recovered, for both data types, exactly four positive associations for Bacteroides, namely with Parabacteroides, Holdemania, Bilophila, and Odoribacter (First row and column in **Figure 10**). The latter two associations were previously reported only to be present in the quantitative data. Overall, more than 30% of the edges in the sub-network agreed which is in marked contrast to the results reported in Vandeputte et al. (2017). The higher network consistency reported here can be attributed to several factors. Firstly, our amplicon data processing framework may result in slight differences in terms of OTU picking and avoids a rarefaction step which was included previously. Secondly, we considered partial rather than marginal correlations among the genera to avoid any influence of indirect associations. Thirdly, we analyzed both data types within the same coherent statistical learning framework: sparse learning of partial correlations via neighborhood selection followed by stability-based model selection with the identical stability threshold (here 0.1). Finally, we considered a larger sample size of n = 106 representing healthy subjects from two different cohorts available in the QMP data as opposed to the n = 66 samples used in the original study. We conclude that differences in association networks from relative and absolute abundance data are not only attributable to the data themselves but also highly method-dependent.

In summary, we believe that, as quantitative microbiome profiling will become increasingly available, the semi-parametric rank-based estimators for correlation and partial correlation estimation discussed here provide an important tool for reliable statistical analysis of quantitative microbiome data. While we have focused here on targeted amplicon-based sequencing datasets, our methodology is broadly applicable to other biological high-throughput data with large excess of zero counts, including quantitative metagenomics (Satinsky et al., 2013), single-cell RNA-Seq data (see Risso et al., 2018 for a recent statistical analysis framework), and mass spectrometry proteomics data (Drew et al., 2017). Moreover, the concept of SPR-type correlation employed in SPRING can naturally generalize to joint analysis of multi-omics dataset when, on the same sample, several zero-inflated data types are measured in tandem. The approach in Yoon et al. (2018) already exploits this idea for RNA-seq and micro-RNA data in the context of canonical correlation analysis. Extending SPRING in a similar way to joint graphical modeling of mixed data types is a promising next step toward a consistent and coherent statistical analysis framework for sparse highthroughput biological datasets.

#### AUTHOR CONTRIBUTIONS

GY, IG, and CM developed the methodology. GY lead the numerical analysis. IG assisted GY in simulation studies. CM assisted GY in data analyses. GY, IG, and CM prepared the manuscript.

#### FUNDING

GY was supported by the National Cancer Institute grant T32-CA090301. IG was supported by the National Science

#### REFERENCES


Foundation grant DMS-1712943. CM work was supported by the Simons Foundation.

### ACKNOWLEDGMENTS

We are grateful to Dr. Zachary D. Kurtz, Lodo Therapeutics, for providing the processed American Gut data and the QMP data.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00516/full#supplementary-material


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Yoon, Gaynanova and Müller. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction

Yi-Hui Zhou<sup>1</sup> \* and Paul Gallins <sup>2</sup>

*<sup>1</sup> Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States, <sup>2</sup> Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States*

With the growing importance of microbiome research, there is increasing evidence that host variation in microbial communities is associated with overall host health. Advancement in genetic sequencing methods for microbiomes has coincided with improvements in machine learning, with important implications for disease risk prediction in humans. One aspect specific to microbiome prediction is the use of taxonomy-informed feature selection. In this review for non-experts, we explore the most commonly used machine learning methods, and evaluate their prediction accuracy as applied to microbiome host trait prediction. Methods are described at an introductory level, and R/Python code for the analyses is provided.

Keywords: disease, phenotype, modeling, machine learning, prediction

#### Edited by:

*Lingling An, University of Arizona, United States*

#### Reviewed by:

*Himel Mallick, Merck, United States Jun Chen, Mayo Clinic, United States*

> \*Correspondence: *Yi-Hui Zhou yihui\_zhou@ncsu.edu*

#### Specialty section:

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> Received: *13 January 2019* Accepted: *04 June 2019* Published: *25 June 2019*

#### Citation:

*Zhou Y-H and Gallins P (2019) A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction. Front. Genet. 10:579. doi: 10.3389/fgene.2019.00579*

# 1. INTRODUCTION

The microbiome is the collection of all microbes living in or on a host, including bacteria, viruses, and fungi (Robinson and Pfeiffer, 2014). The risk or severity of numerous diseases and disorders in a host are associated with the microbiome (Kinross et al., 2011), and accurate trait prediction based on microbiome characteristics is an important problem (Rothschild et al., 2018). The application of modern machine learning algorithms is proving to be valuable in this effort (Gilbert et al., 2018). This review/tutorial focuses on the bacterial component of the microbiome, although in principle many of the elements apply more generally.

With modern high-throughput sequencing, entire microbial communities can be profiled, revealing an extensive diversity of genes and organisms (Turnbaugh et al., 2007). A common strategy is to sequence only a highly specific region, such as 16S ribosomal RNA (rRNA), although the methods described below can also be applied to metagenomic shotgun methods (Mande et al., 2012). Due to the graded nature of sequence similarity, the data are often organized into operational taxonomic units (OTUs) (Schmitt et al., 2012), i.e., clusters of similar sequences, intended to represent the abundance of a particular bacterial taxon while avoiding excessive sparsity that would result if only identical sequences were grouped. Typical choices of similarity limits (e.g., grouping sequences with no more than 3% dissimilarity) produce taxa that are specific to bacterial species, or represent a further subdivision within species. Informatic methods for taxonomic classification use databases (McDonald et al., 2012), such as SILVA (Quast et al., 2012), and are beyond our scope, but we assume that such classification is available. The result after OTU grouping is a matrix (OTU table) of OTU features by the number of samples, where the number of features can vary dramatically across datasets due to stringency of grouping. Although methods that avoid OTU grouping have been described (Callahan et al., 2016), OTU tables remain common and are a practical starting point for most machine learning prediction methods. For additional discussion of levels of taxonomy, with intriguing thoughts about the interplay and use of molecular function descriptors vs. taxonomic descriptors, the reader is referred to Knights et al. (2011b) and Xu et al. (2014). However, many of the principles discussed here apply regardless of the feature type.

Several features of OTU tables present challenges. First, OTU tables are sparse, with a large proportion of zero counts (Hu et al., 2018). Investigators have often removed OTUs that were present in too few samples to be useful, or collapsed OTUs into the genus level, which is a simple form of "feature engineering" that we will explore further below. Second, the role of taxonomy in prediction is often unclear – similar sequences are often correlated across samples, which is a property that can be readily assessed directly without taxonomic knowledge. Third, as with many omics technologies, library sizes (essentially column sums of the OTU table) vary considerably, and normalization methods must be used to account for this variation (Weiss et al., 2017).

A number of excellent reviews have been published, covering experimental design and targeted amplicon vs. metagenomics profiling (Mallick et al., 2017), and a comprehensive overview of different experimental and interrogation methods and analyses (Knight et al., 2018). Other reviews have covered the remarkable advances in understanding that have resulted recently in understanding connections of, e.g., human gut microbiome populations to human health (Cani, 2018).

Recently, studies have begun to explore the power of machine learning to use microbiome patterns to predict host characteristics (Knights et al., 2011a; Moitinho-Silva et al., 2017). Existing studies often report disease-associated dysbiosis, a microbial imbalance inside the host, but such associations can have a wide range of interpretations. Individual studies have also suffered from small sample sizes, inconsistent findings, and a lack of standard processing and analysis methods (Duvallet et al., 2017). Prediction models have sometimes been difficult to generalize across studies (Pasolli et al., 2016). One approach to resolve these issues is by performing a meta-analysis, combining microbiome studies across common traits. Duvallet et al. (2017) have performed a cross-disease meta-analysis of published case-control gut microbiome studies spanning 10 diseases. They found consistent patterns characterizing disease-associated microbiome changes and concluded that many associations found in case-control studies are likely not disease-specific but rather part of a non-specific, shared response to health and disease. Pasolli et al. (2016) also performed a metaanalysis in a collection of 2,424 publicly available samples from eight large-scale studies. The authors remarked that addition of healthy (control) samples from other studies to training sets improved disease prediction capabilities. Nonetheless, any meta- or pooled analysis should rely on a solid foundation of effective per-study prediction. The use of multiple studies enabled Pasolli et al. (2016) to explore the use of external validation of models across truly separate datasets. Such external validation can in principle result in more robust and generalizable models for prediction than models that are validated internally only.

Sophisticated machine learning methods in microbiome analysis have been proposed considerably in recent years, including using deep neural networks (Ananthakrishnan et al., 2017), and leveraging methods for genomes and metagenomes (Rahman et al., 2018). However, the content-knowledge required to implement these methods is high, presenting a barrier to data scientists looking to get started in microbiome analysis and prediction. Moreover, there are few resources for biologists with intermediate statistical and computing background to "jump in" to analysis of the important trait prediction problem. The target audience of this paper is those seeking a brief review and tutorial for trait prediction, and who will benefit from accessible code. After digesting these basic building blocks of analysis, the reader may move to more advanced, such as dynamic systems modeling (Brooks et al., 2017).

The remainder of this paper is written in several sections. Section 2 reviews the steps of data preparation before machine learning implementation. Section 3 provides a quick overview of the most commonly-used machine learning (ML) methods, as well as the most commonly used performance criteria. Experienced modelers can skip this section. Section 4 summarizes the scope of the relevant literature and describes several real datasets and the trait of interest. Section 5 provides results, and the underlying code forms a tutorial of machine learning methods applied in this context.

# 2. DATA PREPARATION

Many machine learning methods have difficulty with missing features, and so we assume the OTU table is complete. A minor fraction of missing data can often be effectively handled using simple imputation procedures, such as kNNimpute (Crookston and Finley, 2008), or even simpler methods, such as feature-median imputation. The methods described in this section, including imputation and normalization, must be performed without using the host trait information, because otherwise they might be biased by this information. Feature selection methods that use host trait information belong in the next section, as they must be included inside a cross-validation procedure.

# 2.1. Notation and Sampling Considerations

Let X be an m × n matrix of microbiome count data, where m is the number of OTU features and n is the number of samples. Let y be a vector of length n with the microbiome host trait. Commonly a trait will be a binary outcome (e.g., case/control status, coded 1/0), or a continuous trait, such as body mass index (BMI). Here our use of microbiome features as predictive of a trait does not imply or assume causality. We note that case/control study designs often involve oversampling of one type (often cases) relative to the general population. A prediction rule might explicitly use this information, for example by a simple application of Bayes' rule (Tibshirani et al., 2003), with prior probabilities reflecting those in the general population. Such sampling considerations are beyond our scope, and we refer the reader to Chawla (2009). Here we consider our sample dataset to be representative of the population of its intended downstream use.

# 2.2. Transformation and Normalization

Normalization is an essential process to ensure comparability of data across samples (Weiss et al., 2017), largely to account for the large variability in library sizes (total number of sequencing reads across different samples). The basic issues are similar to those encountered in expression sequence normalization (de Kok et al., 2005), but less is currently known about sources of potential bias to inform microbiome normalization. Normalization methods assessed by Weiss et al. (2017) included cumulative sum scaling, variance stabilization, and trimmedmean by M-values. Randolph et al. (2018) utilized the centered log-ratio (CLR) transform of the relative abundance vectors, based on a method developed by Aitchison (1982), replacing zeros with a small positive value. As part of their motivation, Randolph et al. (2018) pointed out that standard cumulative sum scaling places the normalized data vectors in a simplex, with potential consequences for kernel-based discovery methods (Randolph et al., 2018).

# 2.3. Taxonomy as Annotation

Taxonomy is the science of defining and naming groups of biological organisms on the basis of shared characteristics. In our context, taxonomy refers to the evolutionary relationship among the microbes represented by each OTU, from general to specific: kingdom, phylum, class, order, family, genus and species, and OTU (Oudah and Henschel, 2018). For example, Kostic et al. (2012) summarized their findings in the study of microbiota in colorectal cancer using genera and phylalevel summaries, illustrating the importance of taxonomy in interpretation. Here we are highlighting the use of taxonomy in post-hoc interpretation of findings, providing important biological context. However, if the taxonomy is used in a supervised manner to improve prediction, it then becomes part of the formal machine learning procedure, as described in the next section.

## 3. REVIEW OF MACHINE LEARNING METHODS FOR PREDICTION

Machine learning deals with the creation and evaluation of algorithms to recognize, classify, and predict patterns from data (Tarca et al., 2007). Unsupervised methods identify patterns apparent in the data, but without the use of pre-defined labels (traits, in our context). These methods include (i) hierarchical clustering, which builds a hierarchy of clusters using a dendrogram, combining or splitting clusters based on a measure of dissimilarity between vectors of X; and (ii) k-means clustering, which involves partitioning the n vectors of X into k clusters in which each observation is classified to a cluster mean according to a distance metric. Unsupervised methods are important exploratory tools to examine the data and to determine important data structures and correlation patterns.

For the host trait prediction problem, we focus on supervised methods, in which labels (traits) of a dataset are known, and we wish to train a model to recognize feature characteristics associated with the trait. A primary difficulty in the problem is that the number of features (m rows) in the OTU table may greatly exceed the sample size n, so that over-fitting of complex models to the data is a concern.

# 3.1. Training and Cross-Validation

Training a model in supervised learning amounts to finding a parameter vector β that represents a rule for predicting a trait y from an m-vector x. This rule may take the form of a regression equation or other prediction rule. Prediction rules that use only a few features (n or fewer) are referred to as "sparse." A good prediction rule has high accuracy, as measured by quantities, such as the area under the receiver-operator characteristic curve, or the prediction correlation R, both described below. Many prediction methods proceed by minimizing an objective function obj(β) = L(β) + (β), which contains two parts: the raw training loss L and a regularization term . The training loss measures how predictive the model is with respect to the data used to train the model, and the regularization term penalizes for the complexity of the model, which helps to avoid overfitting.

An essential component of machine learning is the use of cross-validation to evaluate prediction performance, and often to select tuning parameters that govern the complexity of the model. One round of k-fold cross-validation involves partitioning the n samples into k subsets of roughly equal size, using each subset in turn as as the validation data for testing the algorithm, with the remaining samples as the training set. After a single round of cross-validation, each sample i has an associated predicted trait value yˆ<sup>i</sup> , where the prediction rule was developed without any knowledge of the data from sample i (or at least without knowledge of yi). The performance measure is computed by comparing the length-n yˆ vector to the true y. To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds to give an estimate of the predictive performance. Although the term "cross-validation" formally refers to the use of each sample i as both part of the training set and as testing set (i.e., crossing) during a single round, the term is often used more generically. For example, researchers sometimes use a simple holdout method in which a fraction 1/k of the data are randomly selected as a test set, the remainder as training, and repeat the process randomly with enough rounds to provide a stable estimate of accuracy.

# 3.2. Taxonomy and Structural Feature Extraction

Our Results section shows the results of prediction methods using all OTUs, as well as reduced-OTU selected or aggregated features. Several methods have been proposed to reduce the number of OTU features using correlation and taxonomy information, including Fizzy (Ditzler et al., 2015a), MetAML (Pasolli et al., 2016), and HFE (Oudah and Henschel, 2018). Aspects of the approaches are supervised and thus must be handled inside a cross-validation procedure.

For simplicity, here we focus on the hierarchical feature engineering (HFE) algorithm created by Oudah and Henschel (2018), which uses correlation and taxonomy information in X to exploit the underlying hierarchical structure of the feature space. Zhou and Gallins Host Trait Prediction

The HFE algorithm consists of four steps: (1) feature engineering: consider the relative abundances of higher taxonomic units as potential features by summing up the relative abundances of their respective children in a bottom-up tree traversal; (2) correlationbased filtering: calculate the correlation of values for each parentchild pair in the taxonomy hierarchy, and if the result is greater than a predefined threshold, then the child node is discarded; (3) information gain (IG) based filtering, reflecting association of features to the trait: construct all paths from the leaves (OTUs) to the root and for each path, calculate the IG of each node with respect to the trait values, and then calculate and use the average IG as a threshold to discard any node with a lower IG score; (4) IG-based leaf filtering: for OTUs with incomplete taxonomic information, discard any leaf with an IG score less than the global average IG score of the remaining nodes from the third phase. Steps (3) and (4) must be cross-validated, as they use the trait values. The python code for implementation is on our site (https://sites.google.com/ncsu.edu/zhouslab/home/software?).

The result is a set of informative features, perhaps including original OTUs along with higher-level aggregations of taxonomic features, that can be utilized for downstream machine learning (Oudah and Henschel, 2018). Standard feature selection algorithms, Fizzy and MetAML, which do not capitalize on the hierarchical structure of features, were also tested by Oudah and Henschel (2018) using several machine learning methods on real datasets. Since HFE was reported to outperform other methods (Oudah and Henschel, 2018) and resulted in higher prediction performance overall, we apply it in the real data analysis section to extract OTU features before applying machine learning methods of trait prediction. Note that feature selection can in principle be performed inside a grand cross-validation and prediction loop, or performed prior to prediction, as we have done for convenience here.

# 3.3. Supervised Learning Methods Commonly Used in Trait Prediction

Here we list the learning methods most commonly used in microbiome host trait prediction. The list is not exhaustive, but reflects our review of the methods in common use. In particular, neural networks have received considerable recent attention, but it is difficult to find quantitative evidence for the additional predictive ability in comparison to other methods. For several of the methods, it is common to center and row-scale X prior to application of the method, so each feature is given similar "weight" in the analysis.

#### 3.3.1. Regression

The use of linear models enables simple fitting of continuous traits y as a function of feature vectors. However, if m ≥ n then structural overfitting occurs, and even if m < n accuracy is often improved by using penalized (regularized) models. For the model y = Xβ +ǫ, the training loss is P i (y<sup>i</sup> − ˆyi) 2 the most commonlyused regularization methods are ridge regression (Hoerl and Kennard, 1970) and Lasso (Tibshirani, 1996) regression, which respectively use penalties λ P i β 2 i and λ P i |βi | (not including the intercept) to the training loss. For binary class prediction, the approach is essentially the same, applying a generalized linear (logit) model, with the negative log-likelihood as the training loss. Here λ is a tuning parameter that can be optimized as part of cross-validation. Both methods provide "shrunken" coefficients, i.e., closer to zero than an ordinary least-squares approach. The results for Lasso are also sparse, with no more than n non-zero coefficients after optimization, and thus Lasso is also a featureselection method. Another variant is the elastic net (Zou and Hastie, 2005), an intermediate version that linearly combines both penalties.

#### 3.3.2. Linear Discriminant Analysis (LDA)

For binary traits, this approach finds a linear combination of OTUs in the training data that models the multivariate mean differences between classes (Lachenbruch and Goldstein, 1979). Classical LDA assumes that feature data arise from two different multivariate normal densities according to y = 0 and y = 1, i.e., MVN(µ0, 6) and MVN(µ1, 6) (**Figure 1A**). The prediction value is the estimate of the posterior mean E(Y|x) = Pr(Y = 1|X), used because it minimizes mean-squared error.

#### 3.3.3. Support Vector Machines (SVM)

This is another approach in the linear classifier category (**Figure 1A**), but in contrast to LDA may be considered nonparametric. In SVM, the goal is to find the hyperplane in a highdimensional space that represents the largest margin between any two instances (support vectors) of two classes of trainingdata points, or that maximizes a related function if they cannot be separated. Non-linear versions of SVM are devised using a so-called kernel similarity function (Cortes and Vapnik, 1995).

#### 3.3.4. Similarity Matrices and Related Kernel Methods

Some applications of microbiome association testing have compared similarity matrices across features to similarity of traits (Zhao and Shojaie, 2016). A closely-related approach is to first compute principal component (PC) scores, which may be obtained from OTU sample-sample correlation matrices (Zhou et al., 2018), and to use these PC scores as trait predictors. Kernel-penalized regression, an extension of PCA, was utilized by Randolph et al. (2018). in their microbiome data analysis. They applied a significance test for their graph-constrained estimation method, called Grace (Zhao and Shojaie, 2016), to test for association between microbiome species and their trait. However, trait prediction is not available in their software.

#### 3.3.5. k-Nearest Neighbors (k-NN)

Training samples are vectors in a multi-dimensional space, each with a class label or continuous trait value. For discrete traits, a test sample is assigned the label which is most frequent among the k training samples nearest to that point (**Figure 1B**). Euclidean distance or correlation coefficients are the most commonly used distance metrics. For continuous traits, a weighted average of the k nearest neighbors is used, sometimes weighted (e.g., by the inverse of their distance from the new point).

#### 3.3.6. Random Forests

Random forests (Breiman, 2001) are an increasingly used method, extensively applied in many different fields, including computational biology and genomics (Statnikov et al., 2013)

The building block of a "forest" is a decision tree, which uses features and associated threshold values to successively split the samples into groups that have similar y values. This process is repeated until the total number of specified nodes is reached. An ensemble of decision trees (or regression trees for continuous y) is built by performing bootstrapping on the dataset and averaging or taking the modal prediction from trees (a process known as "bagging")(**Figure 1C**), with subsampling of features used to reduce generalization error (Ho, 1995). An ancillary outcome of the bootstrapping procedure is that the data not sampled in each bootstrap (called "out of bag") can be used to estimate generalization error, as an alternative to cross-validation.

#### 3.3.7. Gradient Boosting

Gradient boosting for decision trees refers to a process of ensemble modeling by averaging predictions over decision trees (learners) of fixed size (Friedman, 2001). As with other forms of boosting, the process successively computes weights for the individual learners in order to improve performance for the poorly-predicted samples. Following observations that boosting can be interpreted as a form of gradient descent on a loss function (such as P i (y<sup>i</sup> − ˆyi) 2 ), gradient tree boosting successively fits decision trees on quantities known as "pseudo-residuals" (Friedman, 2002) for the loss function (**Figure 1C**).

#### 3.3.8. Neural Networks

Neural networks refer to an interconnected feed-forward network of nodes ("neurons") with weights attached to each edge in the network, which allows the network to form a mapping between the inputs X and the outcomes y (Ditzler et al., 2015a). Each neuron j receiving an input pj(t) from predecessor neurons consists of the following components: an activation aj(t), a threshold θ<sup>j</sup> , an activation function f that computes the new activation at a given time t + 1, and an output function fout computing the output from the activation. These networks contain either one or many hidden layers, depending on the network type (**Figure 1D**). For microbiome data, the input layer is the set of OTUs, with separate neurons for each OTU. Hidden layers use backpropagation to optimize the weights of the input variables in order to improve the predictive power of the model. The total number of hidden layers and number of neurons within each hidden layer are specified by the user. All neurons from the input layer are connected to all neurons in the first hidden layer, with weights representing each connection. This process continues until the last hidden layer is connected

TABLE 1 | Review of published prediction accuracy comparisons.


TABLE 1 | Continued


*(Continued)*

#### TABLE 1 | Continued


to the output layer. A bias term is also added in each step, which can be thought of as analogous to the intercept of a linear model. The output layer are predictions based on the data from the input and hidden layers. In most cases, having just one hidden layer with one neuron is reasonable to fit the model.

#### 3.4. Measures of Prediction Accuracy: The AUC and Prediction R

For predictions yˆ of binary traits, the receiver operating characteristic (ROC) curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, or a probability of detection. The area under the ROC curve (AUC) is the most common measure of prediction accuracy for binary traits, and ranges from 0.5 (no better than chance) to 1.0 (perfect discrimination). In practice, the empirical AUC can be <0.5, in which case we conclude that the prediction procedure has no value. Note that the AUC is invariant to monotone transformations of yˆ.

The prediction Pearson correlation (R) between crossvalidated predicted and actual y values is a commonlyused standard of accuracy for continuous traits, although many procedures are designed to minimize the mean-squared prediction error P i (y<sup>i</sup> − ˆyi) 2 . R ≤ 0 corresponds to no predictive value, and R = 1 to perfect prediction. We advocate R as a criterion because it is simple and applicable to many prediction procedures. Some prediction procedures may have an offset or proportional bias in prediction that may harm the mean-squared error, even if R is favorable. A post-hoc linear rescaling of the

prediction to "fix" any such bias is straightforward, and we find it simplest to directly use R for comparison.

In the real data analyses below, the predicted yˆ represent average predictions over all cross-validation rounds, so the AUC and R values were computed directly on the resulting predictions. Importantly, the use of cross-validation provides for each dataset a measure of actual performance of a prediction method, without relying on theoretical considerations, simulations, or restrictive assumptions that may not be applicable with real data.

# 4. DATA USED FOR COMPARISONS

#### 4.1. A Literature Review

We conducted a literature review of published host-trait microbiome prediction studies that used cross-validation and reported a measure of prediction accuracy. We conducted a literature review of published host-trait microbiome prediction studies that used cross-validation and reported a measure of prediction accuracy. A full table appears in the **Supplement**, including links to each of the 18 studies with 54 reported datasets represented. As different studies used vastly different protocols for OTU generation and preprocessing, for this main paper we focused on the 17 reported datasets that compared at least two competing measures of prediction accuracy. As different studies used vastly different protocols for OTU generation and preprocessing, for this main paper we focused on the 17 reported datasets that compared at least two competing measures of prediction accuracy. All of the datasets were using human hosts, except for Rousk et al. (2010) (where pH in soil samples was the "trait") and Moitinho-Silva et al. (2017), where microbial abundance in sponges was the trait.

# 4.2. Analyses of Data Using Competing Methods

In addition, we evaluated the supervised learning methods ourselves using datasets from MicrobiomeHD (https://github.com/cduvallet/microbiomeHD), a standardized database of human gut microbiome studies in health and disease. This database includes publicly available 16S rRNA data from published case-control and other studies and their associated patient metadata. The MicrobiomeHD database and original publications for each of these datasets are described in Duvallet et al. (2017). Raw sequencing data for each study was downloaded and processed through a standardized pipeline.

For our analyses, we analyzed four traits (three binary and one continuous) from three datasets with varying sample sizes

and initial numbers of OTUs: (1) The Singh et al. (2015) data set, containing 201 EDD (enteric diarrheal disease) cases vs. 82 healthy controls with 1, 325 OTUs. (2) The Vincent et al. (2013) data set, with 25 CDI (Clostridium difficile infection) cases vs. 25 healthy controls and 763 OTUs. (3a) The Goodrich et al. (2014) dataset, which categorized the hosts into 135 obese cases vs. 279 controls, based on body mass index (BMI), with a total of 11, 225 OTUs. In this dataset, individuals came from the TwinsUK population, so we included only one individual from each twin-pair. (3b) The same Goodrich et al. (2014) dataset, but using BMI directly as a continuous phenotype for the same 414 individuals. The microbiome samples for each dataset were obtained from stool, and we analyzed one sample per individual throughout.

Following the filtering recommendations applied by Duvallet et al. (2017), we removed samples with fewer than 100 reads and OTUs with fewer than 10 reads. We also removed OTUs which were present in <1% of samples from the Vincent et al. (2013), Ross et al. (2015), and Singh et al. (2015) datasets, and <5% of samples from the Goodrich et al. (2014) datasets, since it contained many more OTUs. Then we scaled the datasets by calculating the relative abundance of each OTU, dividing its value by the total reads per sample.

In our primary analysis, we tested the relative abundances of the microbiome data at the OTU level. We also ran analyses in which OTUs were collapsed to the genus level by summing their respective relative abundances, discarding any OTUs which were un-annotated at the genus level. Finally, we ran the hierarchical feature engineering (HFE) algorithm introduced by Oudah and Henschel (2018) which results fewer informative features, including individual OTUs and aggregated elements of the taxonomy.

We performed 100 rounds of 5-fold cross-validation for each supervised method, using different random splits for each round. For binary traits, the estimated group probability Pˆ(Y = 1|X) was used to estimate the group assignment. These estimates were further averaged over the cross-validation rounds. Performance was evaluated using the AUC. For continuous traits, the direct estimate yˆ was used, averaged over cross-validations, with performance criterion R.

R code for the comparisons is available at https://sites. google.com/ncsu.edu/zhouslab/home/software?, and here we list the packages and settings used. Five-fold cross-validation was used throughout, and we additionally checked for plausibility. For example, the out-of-bag accuracy estimates from the random forest procedure were compared to our cross-validated estimates and shown to match closely. All machine learning methods were used for each dataset as applicable (for example, LDA was applicable only for the discrete trait datasets). All predictions used probability estimates for the discrete traits. The random forest method used randomForest with ntree=500, mtry=sqrt(ncol(X)). The gradient boosting (Gboost) decision-tree approach used xgboost, with nrounds=10 and objective= "binary:logistic" for the discrete trait. For the decision tree method, aspects, such as tree depth used default values. The Lasso, Ridge, and Elastic Net approaches used the package and method glmnet, with lambda=seq(0,1,by=0.1). The k-NN approach used caret with k = 5 and default (equal) neighbor weighting. The neural net used neuralnet with hidden=1, linear.output=F. Linear discriminant analysis used the lda package with tol=0.

## 5. RESULTS

**Table 1** shows the comparative results of 17 datasets analyzed with numerous prediction methods. The results for discrete traits were presented as AUC, accuracy, or balanced accuracy, but in all instances higher values reflect better performance. Although not all methods were represented in each study, some general conclusions can be made. When random forests were applied, they were either the most accurate or competitive [with the exception of Nakano (2018)] (Nakano et al., 2018). Various forms of neural networks often performed well, although there is some question whether the tuning complexity is warranted. An exception is Rousk (2010) as analyzed by Ditzler et al. (2015b), in which some neural networks (perceptions) performed especially well, but the sample size was small n = 22. In the datasets analyzed by Ditzler et al. (2015b), the complexity and number of nodes in neural networks showed little consistent relationship to performance. Most of the studies used some form of higher-level OTU aggregation, sometimes as high as the phylum level.

For the three discrete traits, we plotted one ROC curve from each machine learning method (**Figures 2A–C**). The size of each dataset (number of cases/controls X number of OTUs) is shown in the title. Random forest (RTF) and Gradient boosted trees (Gboost) performed well (AUC >0.85) in predicting cases and controls in the Singh and Vincent datasets. Lasso, ridge, elastic net (Enet), k-nearest neighbors (k-NN), and Neural Networks (Neural) performed well in the Singh dataset only. Generally, linear SVM and LDA performed less well, and SVM demonstrated close to chance performance in the Vincent dataset.

Summarizing the results after using BMI as a continuous trait in the Goodrich dataset, the bar graph (**Figure 2D**) shows the

average Pearson correlation between the predicted and actual BMI after 100 iterations of each method. Here again the two decision tree models performed best, although all correlations R were <0.4.

Performance was generally poor for the Goodrich dataset, which also included a large number of OTUs, which presents a challenge in feature selection. We computed the ROC curves for each dataset after collapsing the OTUs to the genus level (**Figure 3**) and after applying the HFE method to select a subset of informative features (**Figure 4**). Then we compared the AUCs between the datasets which used all OTUs and those that used only HFE-informative features (**Figure 5**).

As an overall summary, collapsing to the genus level brought some improvement to the poorer perform prediction methods in the Singh et al. (2015) dataset, and few other broad patterns were apparent. In contrast, the use of cross-validated HFE produced a great improvement in AUC in most instances (**Figure 4**). For the Goodrich et al. (2014) and Singh et al. (2015) datasets, most methods were improved and brought to similar AUC values. For the Vincent dataset, again most prediction methods were improved by HFE feature-reduction, but the results were less uniform. Another pattern that is apparent in the scatterplots, perhaps expected, is that HFE brought diminishing returns for methods that already perform well. The one prediction method that was not improved demonstrably by HFE was k-NN (with k = 5).

# 6. DISCUSSION

We have presented a tutorial overview of the most commonlyused machine learning prediction methods in microbiome host trait prediction. Although a large number of approaches have been used in the literature, some relative simple and clear conclusions can be made. Decision tree methods tended to perform well, and in the published literature similar results were achieved by neural networks and their variants. In our analysis, the HFE OTU feature reduction method brought a substantial performance improvement for nearly all methods. In addition, after such feature reduction most methods performed more similarly. We conclude that this finding accords with the fact that the distinction between sparse and non-sparse methods is less dramatic after feature reduction. We hope that the tutorial, review, and available code are useful to practitioners for host trait prediction.

## REFERENCES


For more advanced topics, we point the reader to analysis of microbiome time series data, using techniques, such as MDSINE (Bucci et al., 2016), which uses dynamical systems inference to estimate and forecast trajectories of microbiome subpopulations. Other uses of dynamical systems have concentrated mainly on observable phenotypes/experiemental conditions, rather than using microbiome status for prediction (Brooks et al., 2017). In addition, the use of co-measured features, such as metabolites (Franzosa et al., 2019), offers potentially useful information for integrative analyses. As another example of the use of ancillary information, an intriguing approach has also been used to predict biotransformation of specific drugs and xenobiotics by gut bacterial enzymes (Sharma et al., 2017). We also note that our review/tutorial has for clarity placed feature engineering, which may be viewed as a form of statistical regularization, as a separately-handled issue from the penalized prediction modeling. Some modern sparse regression and kernel modeling methods seek additional predictive ability by combining feature regularization and prediction in a single step, e.g., Xiao et al. (2018).

# AUTHOR CONTRIBUTIONS

Y-HZ is the leader of this review study. Her contribution includes writing the manuscript, designing the data analysis, summarizing the result, and software management. PG is responsible for the manuscript writing, implementation of analysis, results summary, and code summary.

## FUNDING

This work gets support from the NC State Game-changing Research Initiative Program and CFF KNOWLE18XX0.

## ACKNOWLEDGMENTS

Thanks to Mr. Chris Smith for the IT support in Bioinformatics Research Center.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00579/full#supplementary-material

Supplementary Table 1 | Full table of published prediction accuracies.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor and reviewer HM declared their involvement as co-editors in the Research Topic, and confirm the absence of any other collaboration.

Copyright © 2019 Zhou and Gallins. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Multitable Methods for Microbiome Data Integration

*Kris Sankaran1\* and Susan P. Holmes2*

*1 Mila, Universite de Montréal, Montréal, QC, Canada, 2 Department of Statistics, Stanford University, Stanford, CA, United States*

The simultaneous study of multiple measurement types is a frequently encountered problem in practical data analysis. It is especially common in microbiome research, where several sources of data—for example, 16s-rRNA, metagenomic, metabolomic, or transcriptomic data–can be collected on the same physical samples. There has been a proliferation of proposals for analyzing such multitable microbiome data, as is often the case when new data sources become more readily available, facilitating inquiry into new types of scientific questions. However, stepping back from the rush for new methods for multitable analysis in the microbiome literature, it is worthwhile to recognize the broader landscape of multitable methods, as they have been relevant in problem domains ranging across economics, robotics, genomics, chemometrics, and neuroscience. In different contexts, these techniques are called data integration, multi-omic, and multitask methods, for example. Of course, there is no unique optimal algorithm to use across domains different instances of the multitable problem possess specific structure or variation that are worth incorporating in methodology. Our purpose here is not to develop new algorithms, but rather to 1) distill relevant themes across different analysis approaches and 2) provide concrete workflows for approaching analysis, as a function of ultimate analysis goals and data characteristics (heterogeneity, dimensionality, sparsity). Towards the second goal, we have made code for all analysis and figures available online at https://github.com/ krisrs1128/multitable\_review.

#### *Edited by:*

*Lingling An, University of Arizona, United States*

#### *Reviewed by:*

*Kui Zhang, Michigan Technological University, United States Jing Ma, Fred Hutchinson Cancer Research Center, United States*

#### *\*Correspondence:*

*Kris Sankaran kris.sankaran@umontreal.ca*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 14 October 2018 Accepted: 17 June 2019 Published: 28 August 2019*

#### *Citation:*

*Sankaran K and Holmes SP (2019) Multitable Methods for Microbiome Data Integration. Front. Genet. 10:627. doi: 10.3389/fgene.2019.00627*

Keywords: microbiome, data integration, multiomics, dimensionality reduction, heterogeneity

Most methods in statistics expect data to be available as a single table. To a researcher confronted with multiple sources of data, it might therefore seem most natural to either analyze each source separately, one at a time, or else combine all data into a single, unified table. However, neither of these approaches is entirely satisfactory. First, many scientific problems can only be answered by collecting several complementary measurement types. Indeed, the situation is analogous to using many types of sensors to study a single system from many perspectives. Further, while in certain supervised problems, it is enough to predict a single measurement of interest, with other sources collected primarily to provide better features, there are often additional relational components to the analysis: how do different types of measurements co-vary with one another? Here, it is of interest to provide a representation of the data that facilitates comparisons across tables, rather than just comparing each table with a single response of interest. This richer scientific question motivates the development of methods distinct from those used to analyze a single measurement type at a time.

For more concrete motivation, we consider data from the WELL-China study, which is focused on the relationships between various indicators of wellness (Min et al., 2019). In this study, 1,969 individuals1 underwent clinical examinations, filled out wellness surveys (covering topics such as exercise, sleep, diet, and mental health, for example), and provided stool samples, used for 16s-rRNA sequencing and metabolomic analysis. To date, 16s-rRNA sequencing data are available for 221 of these participants. Evidently, various interesting relational questions can be investigated using this data source.

For the purpose of illustration, we focus on one relatively narrow question that can be addressed using these data: How is the distribution of lean and fat mass across the body related to patterns of microbial abundance? The measurement types most relevant in this analysis are DEXA scans and 16s-rRNA sequencing abundances. DEXA scans use relative X-ray absorption to gauge the amount of lean and fat body mass within a region of the body being scanned. We have access to these lean and fat body mass measurements at several body sites—arms, legs, trunk, etc.—along with related body type variables, like height, age, and android and gynoid fat measurements. In total, there are 36 of these variables. 16s-rRNA sequencing is a technology for gauging the abundance of different bacterial species in the gut by counting the alignments of reads to the 16s-rRNA gene, a component of all bacterial genomes with enough variation to allow discrimination between different individual species. We have counts associated with 2,565 species across 181 genera, though the vast majority are present in low abundances.

This question of the relationship between lean and fat mass distribution (informally, "body type") and the microbiome is motivated by findings that certain taxonomic groups are over- or underrepresented as a function of an individual's body mass index (BMI) (Ley et al., 2005; Ley et al., 2006; Turnbaugh et al., 2009; Ley, 2010). Further, since the distribution of fat is often more related to underlying biological mechanisms than overall body mass (Matsuzawa, 2008), and since this distribution is mediated by specific metabolic pathways, there is reason to suspect that a joint analysis of DEXA and 16s-rRNA microbial abundance data might yield a more complete view of the relationship between the microbiome and body type.

We use this motivating dataset in the examples that follow. Additional numerical examples, for methods only discussed abstractly in this review, are available in the github repository associated with this paper.

#### CLASSICAL MULTIVARIATE METHODS

Methods from classical multivariate statistics are a mainstay of single-table microbiome data analysis, so it is natural to revisit them before surveying extensions to the multitable setting. Here, we explore a few of the classically studied multitable methods that fit nicely into the modern microbiome data analysis toolbox. We first describe a naive approach based on Principal Components Analysis (PCA)—naive because it lifts a single-table method to the multiple table setting without any special considerations—before studying approaches that directly characterize covariation across several tables: Canonical Correlation Analysis (CCA), Multiple Factor Analysis (MFA), and Principal Component Analysis with Instrumental Variables (PCA-IV).

The earliest multitable method (CCA) was published in 1936, motivated by the problem of relating prices of groups of commodities (Hotelling, 1936). There are two notable aspects of data analysis in this classical paradigm that no longer hold in modern statistics,


These changes have driven the development of highdimensional methods and facilitated the adoption of iterative, more computationally intensive approaches.

Nonetheless, it is worth reviewing these original approaches, both to understand the context for many modern techniques and to have an easy starting point for practical data analysis. Indeed, these more established methods tend to be the most readily available through statistical computing packages and can provide a benchmark with which to compare more elaborate, modern methods.

#### PCA

The simplest approach to dealing with multiple tables is to combine them into one and apply a single-table method, for example, PCA. That is, write

$$X = \left\lceil X^{(1)} \right\rceil \dots \left\lfloor X^{(L)} \right\rfloor \in \mathbb{R}^{n \times p},$$

where *p pl <sup>l</sup> L* =∑=<sup>1</sup> , and compute the SVD *X* = *UDV*T. The *K*-principal component directions are the first *K* columns *v*<sup>1</sup> ,…, *vK*, while the associated scores are reweighted rows *d*1*u*1,…, *d*K*uK*. We call this method concatenated PCA.

While this does not account for the multitable structure of the data, it does accomplish two goals:


However, two drawbacks of this approach are worth noting:

• It does not provide a summary of the relationship between the sets of variables defining the tables—it can only relate pairs of variables.

<sup>1</sup>Though sampling is still ongoing.

• If some tables have many more variables than others, they can dominate the resulting ordination.

These limitations are addressed by CCA and MFA, discussed in sections CCA and MFA, respectively.

We provide one geometric and one statistical motivation for PCA. The geometric motivation is that, if each row *xi* of *X* is viewed as a point in *p*-dimensional space, then the principal component directions provide the best *K*-dimensional approximation to the data. The second interpretation is that PCA finds a low-dimensional representation of the *xi* such that the resulting points have maximal variance. Qualitatively, this is a desirable property, because it means that the simpler representation preserves most of the variation present in the original data.

PCA is a very widely used technique, and some standard references include Mardia et al. (1980), Friedman et al. (2001), and Pagés (2014). Nonetheless, it is not ideal in the multitable setting.

#### Example

**Figure 1** illustrates this approach on body composition and bacterial abundance data from the WELL-China study. Note that we have subsetted to only women, since men and women have very different body compositions, and we have slightly more data for women. Further, the 16s-rRNA data have been variance stabilized according to the methodology proposed in Anders and Huber (2010) and filtered to only those species that have count ≥5 in at least 7% of samples.

The left panel of **Figure 1** displays the loadings associated with this concatenated PCA approach, where body composition (36 columns) and 16s-rRNA abundances (372 columns) were combined into one dataset (408 columns). Columns associated with bacterial species are displayed as points, shaded by taxonomic family, while columns associated with body composition variables are labeled with text. Note that the fraction of variance explained by each axis is on the order of a few percent—this is to be expected, considering that the baseline proportion would be <sup>1</sup> <sup>408</sup> <sup>≈</sup> 0 2. %5 in the orthogonal case.

Most body composition variables lie close to the vertical

axis, in a direction approximately orthogonal to the main direction of variation among species. Columns that are highly correlated—e.g., right (R) and left (L) leg fat mass (FM)—have loadings nearly equal to one another. Among species, the most notable pattern is the concentration of Ruminococcaceae on the right.

To identify relationships between species and body composition variables, it would be of interest to isolate those species with large contributions along the axis defined by linking the center of the variables and the origin. Relatively few such species stand out, though note that there is nothing in this algorithm's objective that would seek covariation across tables directly, so the fact that such associations seem weak with respect to the top two principal components does not mean such relationships do not exist.

We can study individual samples with respect to these loadings, by plotting their projections onto the top two principal components. This is the content of the right panel of **Figure 1**, which displays samples in the same positions, but shaded by android (i.e., abdominal) fat mass. This shading

confirms the observations from the loadings directly using observed data. Indeed, the increasing android fat mass among samples in the top of the scores in that panel exactly corresponds to the fact that related variables lie at the top in the left panel.

In this approach, the loadings provide a description of the relationship between variables across datasets. Further, scores summarize variation in samples across multiple datasets. Hence, this heuristic is a natural first step in analyzing multiple table data. However, considering the difficulty in directly interpreting the covariation across datasets, as well as the method's failure to use any sense of covariation in the dimensionality reductions strategy, suggests that this method should not be the last step of an analysis workflow. Nevertheless, we now have a baseline with which to compare the more elaborate methods of subsequent sections.

#### CCA

CCA is a close relative of PCA, designed to compare sets of features across tables. Like PCA, it provides low-dimensional representations of observations, but it also allows comparisons at the table level. Suppose for now that there are only two tables of interest, *X n p* ∈ <sup>×</sup> <sup>1</sup> and *<sup>Y</sup> n p* <sup>∈</sup> <sup>×</sup> <sup>2</sup> . Let <sup>ˆ</sup> , <sup>ˆ</sup> ∑ ∑ *XX YY*, and <sup>ˆ</sup> ∑*XY* be the associated covariance estimates. Take the SVD, ˆ ˆ <sup>ˆ</sup> ∑ ∑ ∑ = − − *XX YY XY <sup>T</sup> UDV* 1 2 1 <sup>2</sup> . The canonical correlation directions associated with the two tables are *u u k k p XX* ∑ ∈ −1 <sup>2</sup> <sup>1</sup> and *v v k k p YY* = ∑ ∈ −1 <sup>2</sup> <sup>2</sup> . These directions give two sets of lowdimensional representations for each sample, one for each table: *z Xu z Yv k k n k k* ( ) 1 2( ) *<sup>n</sup>* = ∈ and = ∈ . If the two tables are closely related, then the *zk* ( ) 1 and *zk* ( ) 2 will be very correlated. The singular values *dk* are called the canonical correlation coefficients. Like the eigenvalues in PCA, they characterize the amount of covariation across tables that can be captured by each additional pair of directions.

As with PCA, there are many ways to view this procedure—here we discuss geometric, statistical, and probabilistic interpretations. Unlike the geometric interpretation of PCA, the geometric interpretation for CCA identifies point locations with features, not samples. Specifically, the columns of *X* and *Y* are thought of as points in ℝ*<sup>n</sup>*. Consider two subspaces spanning the columns of *X* and *Y*, respectively. These subspaces correspond to the linear combinations of features within each table. Place two ellipses on the respective subspaces, centered at the origin and with size and shape depending on the within-table covariances <sup>ˆ</sup> <sup>∑</sup>*XX* and <sup>ˆ</sup> ∑*YY*. The first canonical correlation directions are the pair of points, one lying on each ellipse, such that the angle from the origin to those two points is smallest. In this sense, it finds a pair of varianceconstrained linear combinations of features within the two tables such that the two combinations appear "close" to one another. The second pair of canonical correlation directions identify a pair of points with a similar interpretation, except they are required to be orthogonal to the first pair, with respect to the inner product induced by the covariances in each table.

For a statistical interpretation, the idea of CCA is to find the low-dimensional representations of the two tables with maximal covariance—this is analogous to the maximum variance interpretation. Formally, rows of the two tables are imagined to be i.i.d. draws from ℙ*XY*, which has marginals ℙ*<sup>X</sup>* and ℙ*<sup>Y</sup>*. Consider arbitrary linear combinations *z u <sup>i</sup> u xT i* ( )( ) <sup>1</sup> = and *z v v y <sup>i</sup> T i* ( )( ) <sup>2</sup> = of samples from the two tables. The first pair of CCA directions *ui* ∗ and *vi* <sup>∗</sup> are chosen to optimize

$$\begin{aligned} \underset{\boldsymbol{\mu} \in \mathbb{R}^{\mathbb{P}1}, \boldsymbol{\nu} \in \mathbb{R}^{\mathbb{P}2}}{\text{maximize}} \quad \text{Cov}\_{\mathbb{p}\times\mathbb{V}} \left[ \boldsymbol{z}\_{i}^{(1)}(\boldsymbol{\mu}), \boldsymbol{z}\_{i}^{(2)}(\boldsymbol{\nu}) \right] \\ \text{subject to } \mathbf{Var}\_{\mathbb{p}\times} \left( \boldsymbol{z}\_{i}^{(1)}(\boldsymbol{\mu}) \right) = \mathbf{1} \end{aligned} \tag{1}$$
 
$$\mathbf{Var}\_{\mathbb{p}^{\mathbb{V}}} \left( \boldsymbol{z}\_{i}^{(2)}(\boldsymbol{\nu}) \right) = \mathbf{1}$$

To produce subsequent directions, the same optimization is performed, but with the additional constraint that the directions must be orthogonal to all the previous directions identified for that table. Of course, in actual applications, we estimate these covariances and variances empirically.

This perspective makes it easy to derive the algorithm given at the start of this section. The empirical version of the optimization problem (1) is

$$\begin{array}{ll}\underset{\boldsymbol{\mu}\in\mathbb{R}^{P\_{1}},\boldsymbol{\nu}\in\mathbb{R}^{P\_{2}}}{\text{maximize}} & \boldsymbol{\mu}^{T}\,\hat{\sum}\_{XY}\,\boldsymbol{\nu} \\ \text{subject to} & \boldsymbol{\mu}^{T}\,\hat{\sum}\_{XX}\,\boldsymbol{\mu}=1 \\ & \boldsymbol{\nu}^{T}\,\hat{\sum}\_{YY}\,\boldsymbol{\nu}=1. \end{array} \tag{2}$$

Consider the transformed data, *u u* = ∑*XX* ˆ 1 2 and *v v* = ∑*YY* ˆ 1 <sup>2</sup> . The optimization can be now be expressed as

$$\begin{aligned} \underset{\bar{\boldsymbol{u}} \in \mathbb{R}^{P\_1}, \bar{\boldsymbol{v}} \in \mathbb{R}^{P\_2}}{\text{maximize}} \quad \bar{\boldsymbol{u}}^T \hat{\sum}\_{XY}^{-\frac{1}{2}} \hat{\sum}\_{XY}^{-\frac{1}{2}} \bar{\boldsymbol{v}} \\ \text{such that } ||\,\bar{\boldsymbol{u}}\,||\_2^2 = 1 \\ ||\,\bar{\boldsymbol{v}}\,||\_2^2 = 1. \end{aligned} \tag{3}$$

The optimal *u*1 and *v* 1 for this problem are well known they are exactly the first left and right eigenvectors of ˆˆˆ ∑∑∑ <sup>=</sup> <sup>−</sup> *XX XY YY <sup>T</sup> UDV* 1 2 1 <sup>2</sup> , respectively.

A probabilistic interpretation of this procedure views it as estimating the factors in an implicit latent variable model. In particular, (Bach and Jordan, 2005) supposes that *xi* and *yi* are drawn i.i.d. from the model,

$$\begin{aligned} \mathfrak{S}\_i &:= (\mathfrak{E}\_i^S, \mathfrak{E}\_i^{\times}, \mathfrak{E}\_i^{\times}) \sim \mathcal{N}(0, Id) \\ &\propto\_i \| \mathfrak{S}\_i \sim \mathcal{N}(\mathfrak{\mu}\_{\times} + W\_X \mathfrak{E}\_i^S + B\_X \mathfrak{E}\_i^{\times}, I\_d) \| \\ &\gg\_i \| \mathfrak{S}\_i \sim \mathcal{N}(\mathfrak{\mu}\_Y + W\_Y \mathfrak{E}\_i^S + B\_Y \mathfrak{E}\_i^{\times}, I\_d) \| \end{aligned}$$

That is, each sample is associated with a *d*-dimensional latent variable *ξi* , drawn from a spherical normal prior. A few of the coordinates of these latent variables, ξ*i s* , contribute to shared structure, through *WX* and *WY*. The remaining coordinates model table-specific structure, through *BX* and *BY*. It can be shown that the posterior expectations of the latent ξ*i s* given the observed tables must lie on the subspace defined by the CCA directions.

#### Example

We next apply CCA to the WELL-China body composition and microbiome data, with particular interest in how the results compare with those of section Example. We provide analogous loadings and scores plots in **Figure 2**. However, note that the data are not quite the same between the two analysis—we have filtered down to species passing a filter, which reduces the number of species to 66, from 2,565. This very aggressive filtering is necessary because CCA requires estimation of covariances matrices, and Σ*XX*, Σ*XY*, and Σ*YY*, which is impossible for *p > n* and highly unstable when *p* is a large fraction of *n.* Besides this stronger filtering, all preprocessing steps remain the same as in section Example.

The left panel of **Figure 2** provides the analog of CCA loadings. To be precise, let *X* ∈ ℝ102×36 be the matrix of body composition measurements and *Y*∈ ℝ102x66 be the variance-stabilized microbial abundances. As before, write *uk* ∈ ℝ36, *v*<sup>k</sup> ∈ ℝ66 for the *k*th canonical correlation directions. Text labels from column *j* of the body composition variables are displayed at position ( , *u u* ) *j j* 1 2 *<sup>j</sup>* <sup>1</sup> 36 = and shaded points for the *j* th species at position ( , *v v* ) *j j* 1 2 *<sup>j</sup>* <sup>1</sup> 66 = .

As in the concatenated PCA, we find that the groups of variables occupy separate spaces. Our interpretation is that sequences further to the left are correlated with the body variables further to the left, which are all in some way variants of body mass. Note that age is negatively correlated with total fat mass, which is why it appears on the opposite end. Among the abundant species that remain, there is limited clustering according to taxonomic group, though the Bacteroideceae and Ruminoccocus do appear restricted to the bottom right and left, respectively.

In the right panel of **Figure 2**, we plot the corresponding scores. Note that in CCA, there are two sets of scores for each *k*, the *Xuk* and *Yvk*. Indeed, the CCA objective finds directions that maximize the correlation between these scores. We use a different color legend for the two panels, each of which represents one set of scores. The legend for scores from species abundances are colored by family, while those for the body composition associates samples with android fat mass. The pairs of scores for each individual sample are drawn with small links. Since most links are relatively short, linear combinations of the two tables could be found that optimized the objective—indeed, the top two canonical correlations are 0.968 and 0.957. However, some caution is necessary here, and a more honest evaluation would be based on scores obtained by projecting new samples onto the original CCA directions. This is especially important in this nearly high-dimensional setting, where covariance estimation may be unreliable.

Aside from the fact that samples appear as pairs, interpretation proceeds as in a PCA scores plot, as in **Figure 1**. The association between these variables and the sample positions is not as strong as when performing PCA on the combined table. This is to be expected, however, as PCA maximizes variance without any thought to covariance, and the body composition table alone has a large portion of its variance related to android fat mass.

#### Co-Inertia Analysis

Co-inertia Analysis (CoIA) emerged in ecology to facilitate analysis of variation in species abundance as a function of environmental conditions (Dolédec and Chessel, 1994). It can be viewed as a slight modification of CCA. Again, we seek sets of orthonormal directions ( ) *uk k K* <sup>=</sup>1 and ( ) *vk k K* <sup>=</sup>1 such that the associated projections *Xuk* and *Yvk* explain most of the covariation between the tables. Unlike CCA, CoIA finds its first directions by maximizing the covariance—not the correlation—between scores,

$$\underset{u \in \mathbb{R}^{\mathsf{h}}, \nu \in \mathbb{R}^{\mathsf{P}\_2}}{\text{maximize}} \quad \mu^{\mathsf{T}} X^{\mathsf{T}} Y \nu$$
 
$$\text{such that } ||\!| \,\mu \,\|| = 1$$
 
$$\|\,\nu \,\|| = 1,$$

with subsequent directions found by the same optimization, after adding the constraint that they are orthogonal to the previously derived directions.

The only difference with the objective in equation (2) is that norm constraint is imposed on *u* and *v* directly, rather than their transformations ∑*XX u* 1 2 and ∑*YY v* 1 <sup>2</sup> . It is in this sense that the CCA objective maximizes the correlation between scores, while CoIA maximizes the covariance.

The solution ( ) *uk k K* <sup>=</sup>1 and ( ) *vk k K* <sup>=</sup>1 can be obtained as the first *K* left and right eigenvectors from the SVD of *XTY*, as opposed to the first *K* generalized eigenvectors, as in CCA. The proof of this fact is almost identical to the derivation in section CCA, for CCA.

#### Example

We apply CoIA to the same data as used in section Example, as CoIA also needs to estimate the covariance between tables, which is difficult when the number of species is large. We find that the associated scores are quite different from those found using CCA. Compare **Figure 3**, which shades samples by android fat mass with **Figure 2** for CCA. The scores for CoIA are not so closely aligned across tables, but they exhibit a clearer gradient across android fat mass. We find that the scores are not nearly as closely aligned as they are for CCA, but that they are more strongly associated with variation in android fat mass, as in the concatenated PCA result of **Figure 1**. It is not clear whether this phenomenon—the CoIA scores being more similar to those from PCA than CCA—holds in general, or what it is about the change in inner products between CoIA and CCA that is responsible for this difference.

#### MFA

MFA gives an alternative approach to producing scores and relating features across multiple tables (Pagés, 2014). It can be understood as a refined version of the concatenated PCA described in section PCA that reweights tables in a way that prevents any one table from dominating the resulting ordination. Specifically, MFA is a concatenated PCA on the matrix

$$X := \left[ \frac{1}{\lambda\_1(X^{(1)})} X^{(1)} \, | \, \dots | \, \frac{1}{\lambda\_1(X^{(L)})} X^{(L)} \right],$$

which reweights each table *X*(*k*) by its largest eigenvalue, λ(*X*(*k*)). This procedure is the multitable analog of the common practice of standardizing variables before performing PCA.

The resulting MFA directions and scores can be interpreted in the same way as those from PCA—the MFA directions still specify the relationship between measured features, and the position of each sample's projection describes the relative value of each feature for that sample. Moreover, MFA gives a way of comparing entire tables to each other, called a "canonical analysis" (Pagés et al., 2004). A *K*-dimensional representation of the *l* th group is given by

$$\left[\mathcal{L}(z\_1, \mathbf{x}^{(l)}), \dots, \mathcal{L}(z\_K, \mathbf{x}^{(l)})\right],$$

where *zk* = *dkuk* ∈ ℝ*<sup>n</sup>* is the *k*th column of principal component scores and

$$\mathcal{L}\left(z\_k, X^{(l)}\right) = \frac{\mathcal{\lambda}\_k(X)}{\mathcal{\lambda}\_1(X^{(l)})} \text{tr}\left(X^{(l)} X^{(l)T} z\_k z\_k^T\right) = \frac{\mathcal{\lambda}\_k(X)}{\mathcal{\lambda}\_1(X^{(l)})} ||\!| X^{(l)T} z k\!||\_2^2$$

is a measure of aggregate similarity between the coordinates in the *l* th table and the *k*th column of scores. According to this definition, if the samples, as represented by the *l* th table, have high correlation with the *k*th dimension of scores, then the canonical analysis displays positions the *l* th table far in the *k*th direction. Plotting these table-level coordinates helps resolve which tables measure similar underlying variation.

#### PCA-IV

PCA-IV adapts the dimensionality reduction ideas of PCA to the multivariate regression setting (Rao, 1964). It can also be

viewed as a version of PCA that chooses a dimension reduction of *X* based on its ability to predict *Y*. In this sense, it anticipates methods like Partial Least Squares, Canonical Correspondence Analysis, the Curds & Whey procedure, and the Graph-Fused Lasso, which are described in sections Partial Least Squares, CCpnA, Curds & Whey, and Graph-Fused Lasso.

Formally, suppose we are predicting *yi <sup>p</sup>* ∈ <sup>1</sup> from *xi <sup>p</sup>* ∈ <sup>2</sup> . Since *p*2 may be large, it might be useful to work with a lowerdimensional representation *z V <sup>i</sup> xT <sup>K</sup>* = ∈*<sup>i</sup>* , which is potentially more interpretable but still as (or more) predictive of *yi* . As in PCA, we require that *V* be orthonormal.

The criterion that PCA-IV uses to identify the loadings *V* and scores *Z* mirrors the maximum variance criterion for PCA. Instead of choosing *V* to maximize the variance of the *zi* , we choose it to minimize the residual covariance of *yi* given *zi* . That is, suppose that *y1* and *x*1 are jointly normal with mean 0 and covariance

$$
\mathbf{Var}\_{\mathbb{P}} \left( \begin{array}{cc} \mathcal{Y}\_{i} \\ \mathbf{x}\_{i} \end{array} \right) = \left( \begin{array}{cc} \boldsymbol{\Sigma}\_{\mathbb{X}\mathbb{Y}} & \boldsymbol{\Sigma}\_{\mathbb{X}\mathbb{X}} \\ \boldsymbol{\Sigma}\_{\mathbb{X}\mathbb{Y}} & \boldsymbol{\Sigma}\_{\mathbb{X}\mathbb{X}} \end{array} \right).
$$

If *zi = VTxi ,* then the joint covariance of *yi* and *zi* is

$$
\mathbf{Var}\_{\mathbb{P}} \left( \begin{array}{c} \mathbf{y}\_{i} \\ \mathbf{z}\_{i} \end{array} \right) = \begin{pmatrix} \boldsymbol{\Sigma}\_{\mathbf{YY}} & \boldsymbol{\Sigma}\_{\mathbf{XY}}V \\ \boldsymbol{V}^{T}\boldsymbol{\Sigma}\_{\mathbf{XY}} & \boldsymbol{V}^{T}\boldsymbol{\Sigma}\_{\mathbf{XX}}V \end{pmatrix},
$$

so the residual covariance of *y*1 given *z*1 is

$$
\left(\boldsymbol{\Sigma}\_{\mathbf{Y}\mathbf{Y}} - \boldsymbol{\Sigma}\_{\mathbf{Y}\mathbf{X}} \boldsymbol{V} \left(\boldsymbol{V}^{\mathrm{T}} \boldsymbol{\Sigma}\_{\mathbf{X}\mathbf{X}} \boldsymbol{V}\right)^{-1} \boldsymbol{V}^{\mathrm{T}} \boldsymbol{\Sigma}\_{\mathbf{X}\mathbf{Y}}\right.\tag{4}$$

Rao (Rao, 1964) uses the trace to measure the "size" of this matrix. The true population covariances are unknown to us, so we replace them by their empirical estimates. The formal optimization for PCA-IV then becomes

$$\underset{V \in \mathbb{R}^{P\_2 \times K}}{\text{minimize}} \; \text{tr} \left( \hat{\boldsymbol{\Sigma}}\_{\text{YY}} - \hat{\boldsymbol{\Sigma}}\_{\text{XX}} V (\boldsymbol{V}^T \hat{\boldsymbol{\Sigma}}\_{\text{XX}} \boldsymbol{V})^{-1} \boldsymbol{V}^T \hat{\boldsymbol{\Sigma}}\_{\text{XY}} \right) \tag{5}$$

The optimal *V* are the top *<sup>K</sup>* generalized eigenvectors of ˆ ˆ Σ Σ *XY YX* with respect to <sup>ˆ</sup> Σ*XX*, that is, the orthonormal set of (*vk*) satisfying

$$
\hat{\Sigma}\_{XY}\hat{\Sigma}\_{\text{XY}}V = \left(\mathcal{A}\_1\hat{\Sigma}\_{\text{XX}}\nu\_1 \mid \dots \mid \mathcal{A}\_k\hat{\Sigma}\_{\text{XX}}\nu\_k\right) = \hat{\Sigma}\_{\text{XX}}V\,\Lambda,
$$

where Λ = diag (*λk*) ∈ ℝ*K×K*. A derivation for why this choice is optimal is provided in section *Derivation Details for PCA-IV*.

For a geometric interpretation of PCA-IV, view each column *yj* in *Y* and *xj* in *X* as a point in ℝ*<sup>n</sup>*. Assuming *X* and *Y* are full rank, the collections (*yj* ) and (*xj* ) span *p*1- and *p*2-dimensional subspaces. A set of independent regressions of *yj* on *X* projects each individual *yj* onto the span of the (*xj* ), and the squared residuals are the distance to this subspace. The PCA-IV procedure is an attempt to find a further *K*-dimensional subspace within the span of the (*xj* ) such that the residuals of the regressions from *yj*

onto this further subspace is not much worse. This is displayed in **Figure 4**.

#### Example

Continuing our WELL-China case study, we now illustrate results from PCA-IV. The idea of scores and loadings in this context requires some clarification. By PCA-IV scores, we mean the coordinates of projections *zi* of samples onto the subspace defined by *V*, and by loadings, we mean the correlation between columns2 of *X* and *Y* with the PCA-IV axes defining *V*.

The scores and loadings are given in **Figure 5**. Interpretation of the species loadings is simple, since species seem well separated by taxa. Interpretation of the body composition variables is less clear—pairs of variables that would be expected to be near to one another are not, in many cases. Indeed, leg fat mass (leg\_fm) and left leg fat mass (l\_leg\_fm) should have a small angle between one another, but they do not. It is possible that by approximating the covariation across tables, the quality of within-table approximations deteriorates.

We find that the scores, displayed in figures, are similar to those that found by the concatenated PCA of section PCA. One possible explanation for this behavior is that the PCA-IVgeneralized SVD of *X* is similar to an ordinary PCA of *X*, and that in the concatenated PCA of (*Y X*), the fact that *X* has many more columns than *Y* means that the result is similar to a PCA on *X* alone.

#### Partial Triadic Analysis

Partial Triadic Analysis (PTA) gives an approach for working with multitable data when each table has the same dimension, *p*1 = *p*2 (Kroonenberg, 2008; Thioulouse, 2011). Specifically, it gives a way of analyzing data of the form ( . *X* . )*l l L* <sup>=</sup>1, where each *X*..*<sup>l</sup>* ∈ ℝ*n×p*. This is called a data cube because it can also be written as a three-dimensional array *X* ∈ ℝ*<sup>n</sup>*×*p*×*<sup>L</sup>*. We denote the *j* th feature measured on the *i* th sample in the *l* th table by *xiji*, and the slices over fixed *i*, *j*, and *l* by *Xi* .., *X*.*<sup>j</sup> .,* and *X*..*<sup>l</sup>* . This type of data arises frequently in longitudinal data analysis, where the same features are collected for the same samples over a series of *L* times. However, the actual ordering of the *L* tables is not ever used by this method: if we scrambled the time ordering for *L* tables, the algorithm's result would not change.

The main idea in PTA is to divide the analysis into two steps:


A naive approach to constructing the compromise table would be to average each entry across the *L* tables. Instead, PTA upweights tables that are more similar to the average table, as these are considered more representative. Formally, the compromise is defined as *X X c l X <sup>L</sup> l l n p* = = <sup>=</sup> ∈ <sup>×</sup> Σ <sup>1</sup>α α .. , where *α* (constrained to norm one) is chosen to maximize ∑*l*<sup>=</sup> *L* 1α*l l X X*, .. ,

<sup>2</sup> Geometrically, the angle between original columns and the subspace, in the sense of **Figure 4**.

FIGURE 4 | A geometric view of Principal Component Analysis with Instrumental Variables (PCA-IV). The columns of the response *Y* are views as *n*-dimensional vectors. The gray plane is the span of *X*. Multivariate OLS simply projects the columns of *Y* onto the plane, while PCA-IV searches for a further subspace *V* on which to project all responses.

a weighted average of inner-products3 between each of the *L* tables and the naive-average table, *X <sup>L</sup> <sup>l</sup> <sup>X</sup> <sup>L</sup>* = ∑ <sup>=</sup> *<sup>l</sup>* 1 <sup>1</sup> .. .

The optimal α can be derived using Lagrange multipliers (see *Derivation of PTA α*) and leads to the compromise table,

$$X\_{\iota} = \sum\_{l=1}^{L} \frac{\left< \overline{X}, X...\_{\iota} \right>}{\sqrt{\sum\_{r=1}^{L} \left< \overline{X}, X...\_{r} \right>^{2}}} X...\_{l}.$$

We can try to interpret the compromise matrix geometrically. Suppose the *X..l* define an orthonormal basis, so that *X X l l l l* , ( ) ′ = = ′ . Then, we can write the compromise table as

$$X\_{\epsilon} = \sqrt{L} \sum\_{l=1}^{L} \left< \overline{X}, X...\_{l} \right> X...\_{l} = \sqrt{L} \,\overline{X},$$

a scaled version of the mean.

intuitive than those observed previously.

If, however, the tables are not orthonormal, then we place more weight on directions that are correlated. For example, if *X*(1) = *X*(2), but the rest of the tables are orthogonal to each other and to these first two tables, then the compromise double counts the direction *X*(1). Therefore, compared to the naive average *X*, *Xc* upweights more highly represented tables.

#### Statico and Costatis

In the multivariate ecology literature, it is common to have a pair of data cubes, giving species abundances and environmental variables over time, respectively. We write these as *Y n p <sup>L</sup>* ∈ × × 1 and *Y n p <sup>L</sup>* ∈ × × <sup>2</sup> *.*  Costatis and Statico are two approaches for analyzing such data (Thioulouse, 2011). They are easiest to understand as divide-andconquer approaches, where the general problem of analyzing a pair of data cubes is divided into two steps, one designed for analyzing individual cubes, and another for studying covariation across tables. In Statico, the covariation problem is dealt with first, then followed by a data cube analysis, while in Costatis, that order is reversed.

Specifically, in Statico, an empirical cross-covariance matrix is constructed at each time point, *Z n Y X <sup>l</sup> l T l l* <sup>=</sup> <sup>1</sup> .. .. *.* For example, this is the correlation between the environmental variables and species counts at a specific time point *l*. The *L* matrices *Zl* are then

<sup>3</sup>We are using 〈A, B〉 = tr(*ATB*).

input into a PTA, yielding a compromise table *Zc* that can then be studied with PCA.

Alternatively, in Costatis, a compromise table is constructed for each of the data cubes *Y* and *X*, using PTA. Call these *Yc* and *X*c. These are now simply two matrices, each with *n* rows, and they can be analyzed by any two-table dimensionality reduction method, for example, CoIA.

Hence, we see that the only difference between these methods is the order in which CoIA and PTA are applied. Indeed, this is reflected in the names of the methods: Statis is an abbreviation for a PTA, and Statico performs a CoIA before a Statis while Costatis does the reverse.

#### MODERN MULTIVARIATE METHODS

Compared to classical approaches, modern multivariate methods are typically designed for more high-dimensional, heterogeneous settings. The two methods reviewed in this section are examples of this trend: Partial Least Squares (PLS) is well-suited for finding predictors in the presence of high-dimensional response matrices, while Canonical Correspondence Analysis (CCpnA) was designed to facilitate joint analysis of heterogeneous continuous and count data necessary. Unlike traditional statistical methods, neither approach is explicitly model-based, and both are iterative, requiring more extensive computation than earlier techniques.

#### Partial Least Squares

PLS sequentially derives a set of mutually orthogonal features ( ) *zk k K* <sup>=</sup>1 that characterizes the relationship between two tables, *Y* and *X* (Wold, 1985). To obtain the first PLS direction, *z*1, compute the first left singular vector *u*1 of the cross-covariance matrix between the two tables, ˆ Σ*YX T <sup>n</sup>* <sup>=</sup> *Y X* <sup>1</sup> . Then, for each of the *p*2 columns of *X*, compute the univariate (i.e., partial) regression coefficient <sup>ˆ</sup> . ϕ *j* . *j T <sup>x</sup> x uj* <sup>=</sup> <sup>1</sup> 2 <sup>2</sup> <sup>1</sup> , for *j* = 1*,…, p*1*.* The first PLS direction is defined

as *z x <sup>j</sup> p* 1 1 *j j* <sup>2</sup> = ∑ <sup>=</sup> ϕˆ . a weighted average of *x*. *j* according to their partial correlation with *u*1. To generate subsequent directions *zk*, orthogonalize both *Y* and *X* with respect to the current directions *z*1,… *zk*–1*,* and repeat the process.

This procedure is appealing because, like PCA, it reduces a potentially high-dimensional matrix *X* with many correlated columns into a smaller set of orthogonal directions. Moreover, it achieves this reduction in a way that accounts for correlation with columns in *Y*: columns of *X* that are uncorrelated with *Y* will have no contribution to the PLS directions, even if they account for a large proportion of variation in *X*.

We have stated the procedure in the form it was originally proposed, but this algorithmic description is difficult to understand geometrically or probabilistically. However, interpretational aids have since been developed. Frank and Friedman (1993) and Stone and Brooks (1990) studied the case where *p*1 = 1, so *y* is a single column vector. By assuming that the rows of *y* and *X* are drawn i.i.d. from distribution ℙ*YX*, with marginals ℙ*<sup>Y</sup>* and ℙ*<sup>x</sup>* , they found that the *k*th PLS direction *zk* is the *z* that solves the optimization

$$\underset{\boldsymbol{z}}{\text{maximize }} \mathbf{Corr}\_{\mathbb{P}^{\mathbb{P}}} \left[ \boldsymbol{x}\_{i}^{\top} \boldsymbol{z}\_{k}, \boldsymbol{\mathcal{y}}\_{i} \right] \mathbf{Var}\_{\mathbb{P}^{\mathbb{X}}} \{ \boldsymbol{z}^{\top} \boldsymbol{x}\_{i} \}$$
 
$$\text{such that } \boldsymbol{z}^{\top} \boldsymbol{X}^{\top} \mathbf{X} \boldsymbol{z}\_{j} = \mathbf{0} \text{ for all } j \le k - 1 \tag{6}$$
 
$$\|\boldsymbol{z}\|\_{2} = 1.$$

If the covariance term is omitted, the optimization is identical to the maximum variance problem that gives the principal component directions based on *X*. This formulation makes precise the idea that PLS is a version of principal components that accounts for correlation with *Y*.

An alternative interpretation, due to (Gustafsson, 2001), is that PLS fits a particular latent variable model. Suppose ξ ξ *i i* ξ *s i <sup>X</sup>* = ( , ) are drawn i.i.d. from a *K1 + K2 = K* dimensional spherical normal. PLS assumes the observed tables *Y* and *X* have rows drawn i.i.d. from

$$\begin{aligned} \mathcal{Y}\_i \mid \mathfrak{S}\_i &\sim \mathcal{N}(\boldsymbol{\mu}\_Y + \boldsymbol{W}\_Y \mathfrak{S}\_i^s, \boldsymbol{\sigma}^2 I\_{\boldsymbol{\rho}\_1}) \\ \propto\_i \mid \mathfrak{S}\_i &\sim \mathcal{N}(\boldsymbol{\mu}\_X + \boldsymbol{W}\_X \mathfrak{S}\_i^s + \boldsymbol{B}\_X \mathfrak{S}\_i^X, \boldsymbol{\sigma}^2 I\_{\boldsymbol{\rho}\_1}). \end{aligned}$$

That is, each table is the sum of two components, one that is a table-specific linear combination of a shared latent variable, and another that is an arbitrary linear combination of a table-specific latent variable. The shared feature *ξs* is the object of interest, and is what PLS implicitly estimates.

#### Sparse Partial Least Squares

PLS suffers from two of the same problems as PCA:


Different regularized, sparse modifications of PCA have been proposed to remedy these issues in the PCA context (Jolliffe et al., 2003; Zou et al., 2006; Witten et al., 2009). For PLS, similar analysis leads to sparse PLS (Lê Cao et al., 2008; Chun and Kele, 2010), and we briefly review this method here.

Directly regularizing the multiresponse version of the PLS optimization (6) leads to the problem

$$\begin{aligned} \text{maximize} & \sum\_{\boldsymbol{z}\_k}^{\mathbb{P}} \text{Cov}\_{\mathbb{P}^{\mathcal{K}}} \left[ \boldsymbol{x}\_i^T \boldsymbol{z}\_k, \boldsymbol{y}\_{\boldsymbol{\mathcal{Y}}} \right] \\ \text{such that } & \boldsymbol{z}^T \mathbf{x}^T \boldsymbol{x} \boldsymbol{z}\_j = \mathbf{0} \text{ for all } j \le k - 1 \\ & ||\boldsymbol{z}\_k||\_2 = 1 \\ & ||\boldsymbol{z}\_k||\_1 \le \lambda, \end{aligned}$$

which can be applied to real data by replacing the objective with its sample version, *z Mz <sup>k</sup> T <sup>k</sup>*, where *M* = *XTYYTX*. This version of the problem falls into the Penalized Matrix Decomposition framework of Witten et al. (2009), reviewed in the section penalized matrix decomposition.

However, Chun and Kele (2010) argue that this formulation does not lead to "sparse enough" solutions. Instead, they adapt the SPCA approach of Zou et al. (2006) to PLS. The resulting objective identifies two sets of directions, a set (*ak*) that maximizes the PLS-defining covariance and another, (*zk*), that approximates the first set by a sparser alternative. Formally,

$$\begin{aligned} \underset{z\_k, a\_k}{\text{maximize}} & -\kappa ||\boldsymbol{a}\_k \parallel\_M^2 + (1 - \kappa) ||\boldsymbol{z}\_k - \boldsymbol{a}\_k \parallel\_M^2} \\ \text{such that } & ||\boldsymbol{a}\_k \parallel\_2^2 = 1 \\ & ||\boldsymbol{z}\_k \parallel\_1 \leq \lambda\_1 \\ & ||\boldsymbol{z}\_k \parallel\_2 \leq \lambda\_2 \end{aligned} \tag{7}$$

where we have defined *x x <sup>M</sup> Mx <sup>T</sup>* = and *κ*, *λ*1, and *λ*2 are tuning parameters. The first term in the objective is the PLSdefining covariance, the second ensures that the solutions *zk* and *ak* are similar, and the norm constraints induce sparsity and stability on *zk*. Note that while this objective is not convex, for fixed *ak*, it is an elastic-net regression, while for fixed *zk*, it is a type of eigenvalue problem.

#### Example

Next we apply the sparse partial least squares (SPLS) implementation of Chung et al. (2012) to the WELL-China body composition data. We use the body composition variables as the response *Y* and the microbiome community composition as *X*. In this direction, a well-fitting model would allow the microbiome community measurements *X* to serve as a proxy for the variables in *Y*, in case those data were not easily accessible. To an extent, however, this choice of directionality is arbitrary regressing abundances on body composition variables would also be sensible—and reflects the basic limitations of using an asymmetric method to study a symmetric problem.

We subset to female subjects and filter species, keeping only those species with a count of at least 5 in at least 7% of samples. This leaves 372 species over 119 participants. All species abundances are variance-stabilized using the approach of Anders and Huber (2010). We cross-validate with five folds, searching through a grid over *K* ∈ {4,…,8} and *λ*1∈ {0, 0.05,…, 0.7}. This grid is used to prevent the model from regularizing to the point that there is no information to visualize. For example, if we set *K* = 1, every row of **Figure 6** would look identical. The predictive accuracy is poor, which is unsurprising considering the spike at 0 in the abundances histogram—the held out error is ≈ 1.29, after having scaled and centered the body composition variables.

**Figure 6** displays fitted coefficients relating body composition variables with species abundances. By fitted coefficients, we mean we display ˆ *B ZQ<sup>T</sup>* = , where *Z* are the SPLS directions and a multiresponse linear regression model is used. Specifically, *Y = XB* + *E = XZQT* + *E* where *X* is a matrix with rows *xi* , *Y* is a matrix with columns *yj* , and *Z* is a matrix with columns *zk*.

Positive associations tend to occur across all responses simultaneously, while negative associations can be unique to either lean or fat mass. Most taxonomic families seem to have slightly more negative than positive associations, with the possible exception of Porphyromonodaceae.

To interpret these coefficients in the raw data, we can visualize individual species with strong associations to body composition. Specifically, we study associations with the android and gynoid fat mass variables. In the left panel of **Figure 7**, we display the abundances *X* for species against android fat mass, respectively. The species are chosen according to whether the two-dimensional coefficient across android and gynoid fat mass has large norm4 . The main associations that are visible are those between the body composition and species presence or absence. That is, there don't seem to be any cases where a body composition feature varies smoothly as a species becomes more or less abundant. Instead, SPLS has identified species whose samples have lower or higher android or gynoid fat mass, depending on whether that species is present or absent.

#### CCpnA

CCpnA is a method, originally developed in ecology, useful for joint analysis of count and continuous data. The canonical application has a site-by-species count matrix *Y n p* ∈ <sup>×</sup> 1 and an environmental features matrix *X n p* ∈ <sup>×</sup> <sup>2</sup> , for example, historical rainfall and temperature measurements. In the WELL context, *Y* would be the samples by community abundance matrix, while *X* would contain the body composition measurements.

The scientific goal might be to identify species that are more abundant in sites with more rainfall or higher temperature. If these environmental variables were uncorrelated, it would be enough to fit a separate regression to each. This, however, is rarely the case, motivating the development for CCpnA.

Translating to the language of the WELL-study, individual samples can be thought of sites, and the supplemental data that is, the body composition variables—are analogous to environmental variables.

CCpnA produces low-dimensional representations of both the rows and columns of *Y* (the samples and species), along with latent subspaces on which these representations are defined. Algorithmically, CCpnA first constructs the following matrices, where 1*r* denotes a column vector of *r* ones,

1. An overall frequency matrix,

$$F = \frac{1}{n\_{-}^{y}} Y,$$

where *nY* .. is the sum of all counts in matrix *Y*. 2. A diagonal matrix of row (site) proportions,

$$D\_r = \text{diag}(F^1\_{\mathbb{R}}) \in \mathbb{R}^{n \times n}.$$

<sup>4</sup> Specifically, β β *android gynoid* > 2 0. . 065

FIGURE 7 | A more focused view of the species with high loadings according to SPLS (left) and sparse CCA (right). Each panel corresponds to a species. Points are shaded according to each species' taxonomic family. The *x*-axis within panels corresponds to variance-stabilized species abundance, while the *y*-axis gives android fat mass. A linear smooth is provided to summarize the direction of associations. Panels are arranged according to the size of that species' absolute SPLS coefficient value or loading onto the first sparse CCA axis. The presence of certain species seems to correspond to increased or decreased levels of android fat mass.

3. A diagonal matrix of column (species) proportions,

$$D\_{\boldsymbol{\varepsilon}} = \text{diag}(\boldsymbol{F}^{\boldsymbol{T}} \mathbf{1}\_{\boldsymbol{u}}) \in \mathbb{R}^{p\_1 \times p\_1}.$$

4. A projection onto the columns of the supplemental matrix *X*, reweighting samples according to their species counts,

$$P\_X = D\_r^{-\frac{1}{\lambda\_1'}} X \left(X^T D\_r X\right)^{-1} X^T D\_r^{-\frac{1}{\lambda\_2'}} \in \mathbb{R}^{n \times n},$$

With this notation, compute an SVD,

$$D\_r^{-\frac{1}{2}} = \left(F - F\mathbf{1}\_{\rho\_1}\mathbf{1}\_{\rho\_1}^T F\right) D\_C^{-\frac{1}{2}} P\_X = \mathbf{U} \mathbf{S} V^T, \quad$$

and define row and column scores *Z* and *Q* by

$$Z = D\_r^{-\frac{1}{2}} U \mathcal{S}$$

$$Q = D\_\epsilon^{-\frac{1}{2}} V^T \mathcal{S}\_\epsilon$$

There are several ways to interpret this procedure. CCpnA was originally proposed as the solution to a fixed-point iteration called reciprocal averaging (Ter Braak, 1986). Later, Greenacre (1984) and Greenacre and Hastie (1987), provided a geometric view and Zhu et al. (2005) gave an exact probabilistic interpretation.

The intuition for the reciprocal averaging procedure is simple: the scores for different samples should be a weighted average of the species scores, with larger weights for the species that are more common at those sites. Similarly, species scores can be defined according to a weighted average of sample scores. That is,

$$\begin{aligned} z\_i &\approx \frac{1}{f\_i} \sum\_{j=1}^{p\_1} f\_{\vec{g}} q\_{\vec{g}} \\\\ q\_i &\approx \frac{1}{f\_i} \sum\_{j=1}^{n} f\_{\vec{g}} z\_{\vec{g}}, \end{aligned}$$

*j*

*i*

1 .

=

or, in matrix form,

$$Z \approx \text{diag} \left( F1 \,\rho\_1 \right)^{-1} FQ^T$$

$$Q \approx \text{diag} \left( F^T \mathbf{1}\_n \right)^{-1} Z.$$

This formulation suggests an algorithm for finding *Z* and *Q*—arbitrarily initialize one and iterate these calculations until convergence.

As is, this is not yet the setup that yields CCpnA—it does not use information in the supplemental table *X*. To recover CCpnA, a projection step needs to be inserted before the calculation of row scores,


$$\text{a. }\underset{-}{\text{Solve}}\ Q'\_{\\_} \approx \underset{-}{\text{diag}}\ (F^T \mathbf{1}\_n)^{-1} F^T Z.$$

$$\text{b. } \text{Project } Q = P\_{\text{X}} Q'.$$

$$\text{c. Solve}\,Z \approx \text{diag}\,(\text{Z1}\_{\mathbb{N}})^{-1}\text{FQ}^{\top}.$$

The fixed point of this iteration is the previously described CCpnA solution.

A second interpretation is due to Zhu et al. (2005). Suppose first that we are only interested in a one-dimensional score for rows and columns. Let *α* be a latent gradient, for example, between warm-dry and cold-wet sites, or low and high androidfat mass samples. For each of the *p*1 species, define a normal density over the supplemental variables, *f x j i xi j <sup>j</sup>* ( ) = ( |µ , ) Σ . The mode of this density represents the preferred environment for species *j*. Next, project these densities onto the gradient, giving a univariate *f z z j i <sup>i</sup> T j T j* α ( ) = ( |α µ , ) α α Σ for each species. The *zi* represent the scores for species *i* along the gradient *α*.

The generative model views species–sample pairs one at a time. For each pair involving sample *i* and species *j*, draw a score according to *f z j i* α ( ). Hence, each site *i* draws species according to a *p*1-class linear discriminant (LDA) model.

To use this idea to compute scores, we need to estimate the gradient α, which is also of interest in its own right. This is done by supposing equal covariances across species, Σ*<sup>j</sup>* = Σ for all *j*, and finding the αˆ maximizing the between vs. total variance across species,

$$\frac{\alpha^{\tau}\Sigma\_{\mathcal{B}}\alpha}{\alpha^{\tau}\Sigma\alpha},$$

where

$$\Sigma\_{\underline{\boldsymbol{\alpha}}} = \sum\_{j=1}^{\underline{\boldsymbol{\mu}}} f \cdot\_{\underline{\boldsymbol{\mu}}} (\mu\_j - \overline{\mu}) (\mu\_j - \overline{\mu})^T$$

is a between-species covariance matrix. Estimating αˆ in this way and writing *z x <sup>i</sup> <sup>T</sup>* = *<sup>i</sup>* αˆ gives the original site scores from CCpnA.

We have omitted a detailed numerical example of this method in this review, but note that codes for applying this method are available in the github repository associated with this review.

#### Penalized Matrix Decomposition

In high-dimensional settings, sparsity is a desirable property, for both qualitative interpretability and statistical stability. A regression model using only a few features is easier to understand than one involving a linear combination of all possible features. Further, regularized models typically outperform their unregularized counterparts in terms of both predictive accuracy and inferential power (Buhlmann and Van De Geer, 2011). In fact, it is impossible to fit an unregularized linear regression when the number of features is greater than the number of samples.

The Penalized Matrix Decomposition (PMD) is a general approach to adapting the regularization machinery developed around regression to the multivariate analysis setting (Witten et al., 2009). The CCA and MultiCCA instances of PMD have been particularly well-studied (Witten et al., 2009; Witten et al., 2013).

The general setup is as follows. Suppose we want a onedimensional representation of the samples (rows) in *X* ∈ ℝn×*<sup>p</sup>*. Recall that the first *k*-eigenvectors recovered by PCA span a subspace that minimizes the *ℓ*<sup>2</sup> -distance from the original data to their projections onto that subspace. In particular, when *k* = 1, the associated PCA coordinates *u* ∈ ℝ*<sup>n</sup>* and eigenvector *v* are the optimal values in the problem

$$\underset{\boldsymbol{\mu}\in\mathbb{R}^{n},\boldsymbol{\nu}\in\mathbb{R}^{p},d\in\mathbb{R}}{\text{minimize}} \parallel\boldsymbol{X} - d\boldsymbol{\mu}\boldsymbol{\nu}^{T} \parallel\_{2}^{2}$$

$$\text{subject to} \quad \lVert\boldsymbol{\mu}\rVert\_{2}^{2} = \lVert\boldsymbol{\nu}\rVert\_{2}^{2} = 1.$$

The PMD generalizes this formulation of rank-one PCA to enforce additional structure on *u* and *v*. The PMD solutions *u* and *v* are defined as the optimizers of

$$\begin{array}{ll}\underset{\boldsymbol{\mu}\in\mathbb{R}^{n},\boldsymbol{\nu}\in\mathbb{R}^{P},d\in\mathbb{R}}{\text{minimize}} & \lVert\boldsymbol{X}-d\boldsymbol{\mu}\boldsymbol{\nu}^{\top}\rVert\_{2}^{2} \\ \text{subject to} & \lVert\boldsymbol{\mu}\rVert\_{2}^{2}=\lVert\boldsymbol{\nu}\rVert\_{2}^{2}=1. \end{array} \tag{8}$$

$$\text{Pen}\_{\boldsymbol{\mu}}(\boldsymbol{\mu})\leq\boldsymbol{\mu}\_{1}$$

$$\text{Pen}\_{\boldsymbol{\nu}}(\boldsymbol{\nu})\leq\boldsymbol{\mu}\_{2}$$

where Pen*u* and Pen*v* are arbitrary constraints *u* on and *v*.

To choose the regularization parameters *μ*1 and *μ*2, Witten et al. (2009) applied cross-validation to the reconstruction errors after holding out random entries in *X*. To obtain a sequence of scores ( ) *uk k K* <sup>=</sup>1 and ( ) *vk k K* <sup>=</sup>1 for *K* > 1, define *uk* and *vk* as the optimizers of the problem (equation 8) on the residual: *X X d u v k k k k k <sup>T</sup>* := − <sup>−</sup> − − − 1 1 1 1 where *d u X v k k T k* = *<sup>k</sup>* and *X*<sup>1</sup> = *X*.

This view can be specialized to develop regularized versions of a number of multivariate analysis problems. We consider applications to the CCA and MultiCCA problems. Recalling that *A A <sup>F</sup> A* <sup>2</sup> *<sup>T</sup>* = tr( ) along with the linearity and the cyclic properties of the trace, the objective in equation (8) can be rewritten, using ≡ to mean equality up to terms constant in *u* and *v*,

$$\begin{aligned} \|\|\boldsymbol{X} - d\boldsymbol{\mu}\boldsymbol{\nu}^{\boldsymbol{T}}\|\|\_{\boldsymbol{F}}^2 &= \text{tr}\Big(\Big(\boldsymbol{X} - d\boldsymbol{\mu}\boldsymbol{\nu}^{\boldsymbol{T}}\Big)^{\top}\Big(\boldsymbol{X} - d\boldsymbol{\mu}\boldsymbol{\nu}^{\boldsymbol{T}}\Big)\Big) \\ &\equiv -2d\operatorname{tr}\Big(\boldsymbol{X}^{\top}\boldsymbol{\mu}\boldsymbol{\nu}^{\boldsymbol{T}}\Big) + d^2\operatorname{tr}\Big(\boldsymbol{\mu}\boldsymbol{\nu}^{\boldsymbol{T}}\boldsymbol{\mu}\boldsymbol{\nu}^{\boldsymbol{T}}\Big) \\ &\equiv -2d\boldsymbol{\nu}^{\boldsymbol{T}}\boldsymbol{X}^{\top}\boldsymbol{\mu} + d^2\operatorname{,} \end{aligned}$$

where for the last equivalence we used that *vTv* = *uTu* = 1. From this expression, and by partially minimizing out *d* = *v TXTu*, we see that the PMD solutions *u* and *v* in equation (8)

 $u - v$   $\Lambda$   $u$ , we see that one  $\nu$ -MD sometimes  $u$  and  $v$  in equation can be found as the optimizers of

$$\begin{aligned} \underset{\boldsymbol{\mu} \in \mathbb{R}^{n}, \boldsymbol{\nu} \in \mathbb{R}^{p}}{\text{maximize}} \boldsymbol{\mu}^{\top} \boldsymbol{X}^{\top} \boldsymbol{\nu} \\ \text{subject to } ||\boldsymbol{\mu}||\_{2}^{2} = ||\boldsymbol{\nu}||\_{2}^{2} = 1 \\ \text{Pen}\_{\boldsymbol{\mu}}(\boldsymbol{\mu}) \leq \boldsymbol{\mu}\_{1} \\ \text{Pen}\_{\boldsymbol{\nu}}(\boldsymbol{\nu}) \leq \boldsymbol{\mu}\_{2} \end{aligned}$$

Notice that, as long as the penalties are convex in *u* and *v*, the optimization is biconvex, so a local maximum can be found by alternately maximizing over *u* and *v*.

From this form, we can derive a sparsity-inducing version of CCA. Recall the maximal-covariance interpretation of CCA,

$$\begin{aligned} & \underset{\boldsymbol{\mu} \in \mathbb{R}^{P \times \mathcal{V} \times \mathbb{R}^{P^2}}}{\text{maximize }} \boldsymbol{\mu}^T \boldsymbol{\hat{\Sigma}}\_{XY} \boldsymbol{\nu} \\ & \text{subject to } \boldsymbol{\mu}^T \boldsymbol{\hat{\Sigma}}\_{\mathcal{X}\mathcal{X}} \boldsymbol{\mu} = \boldsymbol{\nu}^T \boldsymbol{\hat{\Sigma}}\_{YY} \boldsymbol{\nu} = 1 \end{aligned}$$

Witten et al. (2009) argue for diagonalized CCA, in which the variance constraints are replaced by unit norm constraints, and sparsity-inducing *ℓ*<sup>1</sup> constraints are added,

$$\begin{aligned} \underset{\boldsymbol{\mu}\in\mathbb{R}^{P},\boldsymbol{\mu}\in\mathbb{R}^{P^{2}}}{\text{maximize}} \boldsymbol{\mu}^{T}\boldsymbol{\hat{\sum}}\_{XY}\ \boldsymbol{\nu} \\ \text{subject to} \quad ||\boldsymbol{\mu}\||\_{2}^{2} = ||\boldsymbol{\nu}\||\_{2}^{2} = 1 \\ \quad ||\boldsymbol{\mu}\||\_{1} \leq \boldsymbol{\mu}\_{1} \\ \quad ||\boldsymbol{\mu}\||\_{1} \leq \boldsymbol{\mu}\_{2} \end{aligned}$$

which is exactly of the form of equation (9) where *X* = ∑*XY* <sup>ˆ</sup> .

Multiple CCA can also be described in this framework, by replacing the objective with the sum over all pairwise covariances,

∑ ′ ′= ′ *l l L l T l T l c X X l* , () () ( ) ( ) 1 1 <sup>1</sup>*c* , and introducing constraints for each of the *c <sup>l</sup>* 1 ( ).

#### Example

We apply the PMD formulation of sparse CCA to the WELL-China data. As before, we *k*-over-*A* filter the microbiome data, requiring species to have counts of at least 5 in at least 7% of samples. Further, we first variance-stabilize, center, and scale these species abundances. For the regularization parameters, we set *μ*1 = 0.7 for the body composition data and *μ*2 = 0.3 for the species count data. The reasoning behind the relative values of these two tuning parameters is that sparsity in species loadings is more important than sparsity across body composition variables, because the microbiome data are more high-dimensional. The choice of the tuning parameters' overall magnitude was guided by the overall number of factors that we wanted to retain.

We only compute the first three PMD directions, and the associated correlations between scores are (*d*1, *d*2, *d*3) *=* (0.700, 0.435, 0.632). Note that the correlation can increase in subsequent directions, since directions are computed iteratively and cannot be defined and sorted all at once.

The learned loadings and scores are displayed in **Figure 8**. The *x*-axis in the loadings differentiates between high android and gynoid fat mass. The *y*-axes in the loadings reflect a gradient between overall right and left body mass. The size of points corresponds to the third PMD direction, and it seems to highlight high BMI, ratio of fat to lean mass, and overall weight. We interpret species based on their positions relative to these body composition variables, as in an ordinary biplot. For example, genus 492, located in the center-top, seems to be more common among people with higher android and lower gynoid fat mass.

The associated scores are displayed in the right panel, shaded according to android fat mass. The gradient between android and gynoid fat mass suggested by the loadings is clearly visible from this display. The length of links reflects the correlation between sets of scores. They are somewhat longer in the sparse CCA compared to the ordinary CCA on a subset of species, but this is likely a consequence of regularization and overfitting on the part of ordinary CCA.

We can follow up these displays by focusing on species that seemed related to the CCA axes. In the right panel of **Figure 7**, we isolate species with loa dings a distance of at least 0.15 from the origin. These are the same ones that are labeled by text in **Figure 8**. We can see associations between abundance and android fat mass, as suggested by the loadings. Generally, there is a difference between android fat mass among people with and without particular species—there is no smooth function between the quantity of a species android fat mass, even in these cases where an association exists. Further, no individual taxonomic group seems to dominate the set of associated species.

#### Multitable Mixed-Membership

In section CCA, a latent variable interpretation of CCA was provided as an alternative to the standard covariance maximization perspective. Since likelihood-based methods are easily adapted to different data types, it is natural to consider versions of CCA designed

of points and text reflects the contribution of the third CCA dimension. Many loadings have at least one dimension that is exactly zero, due to *ℓ*1-regularization. For the sample scores (right), each point is a sample, positioned at their coordinates with respect to the first two learned sparse CCA directions. Points are shaded according to android fat mass, and their sizes are set according to the third sparse CCA direction's contribution. Evidently, the first two directions reflect a gradient across android fat mass, suggesting that this is a substantial contributor to covariation across microbiome and body composition tables.

for non-Gaussian data, using section CCA as a starting point. We are particularly interested in data with the same structure as the WELL-China body composition and microbiome data, namely, two table data where one table is continuous with Gaussian marginals and correlated columns and the other is a high-dimensional collection of counts, where many entries are exactly zero.

As before, define a set of shared scores ξ*i s K* ∈ , and two sets of within-table scores ξ*i X L* ∈ <sup>1</sup> and ξ*i Y L* ∈ <sup>2</sup> . As before, we model the body composition variables using essentially a Gaussian factor analysis model, *y B W I i i X i Y y i s y i y <sup>p</sup>* | , ξ ξ ∼ + ( , ξ ξ σ ) <sup>2</sup> 2 with a spherical Gaussian prior ξ ξ *i X i <sup>Y</sup>* , on. For the counts matrix, we might consider a few different approaches:


Here, we focus on the LDA approach, though we suspect that the other two approaches are potentially interesting as well. Formally, this model supposes that counts are drawn according to

$$\begin{aligned} \left| \left. x\_i \right| (\theta\_k) \sim \text{Mult} \left( \left. x\_i \right| N\_i \sum\_{k=1}^K \theta\_k \beta\_k \right) \right| \\ \left. \theta\_i \sim \text{Dir} \left( \alpha \right) \\ \left. \beta\_k \sim \text{Dir} \left( \gamma \right) , \end{aligned} \right.$$

where *N x i j <sup>p</sup>* = ∑ <sup>=</sup><sup>1</sup> *ij* 1 is the total count in sample *i*. This has the flavor of a factor analysis where ( ) θ*ik k K* <sup>=</sup>1 are scores for the *i th* sample and *(βk*) are *K* underlying topics.

The only complexity with using an LDA model of *X* together with a Gaussian factor analysis on *Y* is that the shared scores ξ*i s* typically have different priors—a Dirichlet for LDA and a spherical Gaussian for factor analysis. In any formulation of probabilistic CCA that uses both models, this must be reconciled. One approach is to continue to place Dirichlet priors on all the scores, ξ ξ*i s i <sup>x</sup>* , , and ξ*i y* . While the model for the Gaussian data is no longer exactly traditional factor analysis, it has a similar interpretation. Alternatively, we could use a spherical Gaussian prior on all scores and then recover probability vectors by applying the softmax function, [ ( )] exp( ) exp( ) , <sup>k</sup> *<sup>v</sup> <sup>v</sup> k v <sup>k</sup>* <sup>=</sup> <sup>∑</sup> ′ ′

$$\|\varkappa\_i|\xi\_i^s, \xi\_i^x \sim \text{Mult}\left(\varkappa\_i|N\_i, \mathcal{S}\big(B^X \xi\_i^s + W^X \xi\_i^x\big)\right)\|,$$

$$\xi\_i^s \sim \mathcal{N}(\xi\_i^s |\mathbf{0}, \mathfrak{r}^2).$$

It is this second model that we use in our experiments below.

#### Example

We illustrate this multitable mixed-membership approach on the WELL-China data. We choose *K* = 2 for the number of shared topics and *L*1 *= L*2 = 3 for the number of unshared topics per table. We initialize scores and loadings using results from the PMD formulation of sparse CCA. While the use of shared ξ*i s* and unshared ( , ξ ξ ) *<sup>i</sup> x i <sup>y</sup>* scores gives more flexibility in modeling, it also leads to additional complexity in interpretation—there are both more scores and more loadings that need to be visualized.

Consider the loadings *WX* and *WY*, provided in the left panel of **Figure 9** and bottom three rows of **Figure 10**. Note that there is no notion of variance explained by different axes in this case.

The loadings *WX* of **Figure 9** summarize table-specific variation in bacterial abundances. Invariance under rotation and reflection

k

complicates interpretation of these estimates. If we flip the sign of all the loadings axes, then the more abundant species have larger loadings, so the direction of different trends is irrelevant. The main distinction between the first and second loadings is the rate of decay in frequencies, especially among Lachnospiraceae and Ruminococcaceae. For example, topic 1 seems to include species from these taxonomic families that are not very abundant. The main characteristic of the third loading is that it has higher values for Porphyromonadaceae, so samples with high weight on this loading have decreased levels of these taxa.

Next, consider within-table body composition loadings, given in the bottom three rows of **Figure 10**, which suggests that the first and

third axes of *WY* capture variation between overall and android vs. gynoid fat mass. The first axis has high loadings for weight, BMI, and total fat mass, and the third contrasts areas with high android and high gynoid fat mass. The second axis distinguishes between right and left total lean and fat mass variation, while the third axis captures difference between mass in the trunk versus arms and legs.

These summaries could have been obtained by analyzing each table separately. Covariation between the two tables is captured by the shared scores ξ*i s* and loadings *BX*, *BY*. The shared body composition loadings are given in the top two rows of **Figure 10**. These loadings again differentiate between android and gynoid fat mass, learning contrasts between body mass in arms and legs, for example, though the effects are less pronounced than in the table-specific loadings.

The shared bacterial abundance loadings are given in the right panel of **Figure 9**. The most notable observation is that the first axes places more weight on rarer species, while the second places proportionally more weight on abundant species. Further, the two axes seem to have very different behaviors with respect to Prevotellaceae and Veillonellaceae.

In general, we find the results from the LDA–CCA approach less satisfying than those of the sparse CCA of section Penalized Matrix Decomposition. It seems that inference of a probabilistic model with shared and unshared parameters is more difficult than optimization of a single set of shared parameters. It may be possible to improve this approach through the following strategies:


#### Curds & Whey

The Curds & Whey (C&W) procedure is a "soft" version of reduced-rank regression, differentially shrinking the ordinary least squares (OLS) fits with respect to the response canonical correlation directions (Breiman and Friedman, 1997). This is in contrast to reduced-rank regression, whose projection onto the first K response canonical correlation directions is a hardthresholding analog. Hence, C&W is to reduced-rank regression what ridge regression is to principal component regression.

More precisely, the C&W algorithm fits a table *Y* according to

$$
\hat{Y} = P\_X Y V \Lambda^{-1},
\tag{10}
$$

where again *V p p* ∈ <sup>×</sup> 1 1 are the CCA directions associated with the response *Y* and *Px* is the projection operator onto the column space of *X*. Λ is defined to be a diagonal matrix that determines the degree of shrinkage for the different canonical directions.

The main difficulty in C&W is the choice of Λ, and Breiman and Friedman (1997) suggest several possibilities. One choice is derived from a generalized cross-validation point of view, and results in shrinkage towards the response canonical correlation directions, without assuming the form of equation (10) *a priori*. This derivation is provided in section *Derivation of Curds & Whey Shrinkage*.

#### Graph-Fused Lasso

An approach to multiresponse regression, introduced by Chen et al. (2010), incorporates prior knowledge about the relationship between responses. Specifically, they use the correlation network between responses to induce structured regularization on the regression parameters.

Let *Y n p* ∈ <sup>×</sup> <sup>1</sup> and *X n p* ∈ <sup>×</sup> <sup>2</sup> and assume a correlation network between the *p*2 tasks. This is denoted by *G* = *(V, E*), where *V* = {1,…*,p*1}*.* Each edge *e* is associated with a weight, *r* (*e*), giving the correlation between the pair of responses.

The graph-fused lasso estimates a coefficient matrix *B p p* ∈ <sup>×</sup> 2 1 whose columns *β*(*r*) are the regression coefficients across tasks, but which have been pooled together, with the strength of the pooling depending on the separately computed strength of the relationship between tasks. Formally, *β ˆ* is defined as the solution to the optimization,

$$\underset{B\in\mathbb{R}^{p\_2\times p\_1}}{\text{minimize}}\frac{1}{2}||Y - XB||\_F^2 + \hat{\lambda}||B||\_\text{l} + \gamma\sum\_{\epsilon\in\mathcal{E}}\sum\_{j=1}^{p\_2}|r\_\epsilon||\mathcal{B}\_j^{(\epsilon+)} - \text{sign}(r\_\epsilon)\mathcal{B}\_j^{(\epsilon-)}|.\tag{11}$$

where ||*B*||1 is the sum of the absolute values of all entries of *B*, *βj* is the *j* th row of *B*, and *e*− and *e*+ denote the nodes at either end of the edge *e*. The last regularization term in the objective is called the graph fused-lasso penalty, and it is this element that encourages pooling of information across regression problems.

#### Example

We apply the graph-fused lasso to the body composition problem and compare it to a naive version of the lasso that does not share any information across responses. We consider predicting the body composition variables, many of which are strongly correlated with one another, using variance-stabilized bacterial abundances.

We filter away species that do not appear in at least 7% of samples, as in the original PCA approach. We set the smoothing parameter to *μ =* 0.01, while the *ℓ*1 and graphregularization parameters are set to *λ* = 0.1 and *γ* = 0.01, respectively, after they were heuristically found to provide interpretable levels of sparsity and smoothness in the fitted coefficients.

The graph-fused lasso requires a correlation graph between response variables. We estimate such a graph using the graphical lasso (Friedman et al., 2008), since there are only ~100 with which to estimate the 36-dimensional covariance matrix. The estimated correlation matrix is displayed in **Figure 11**.

The fitted coefficients from the graph-fused lasso are given in the top panel of **Figure 12**. The analogous display when the problem is decoupled into parallel lasso regressions is given in the bottom panel of the same figure.

Generally, both approaches highlight the same directions and size of association between individual species and the response variables, though those returned by the graph-fused lasso are smoother across responses. This smoothing may obscure true variation—for example, the stronger association between height\_dxa and a few Ruminoccocus species—that appears in the parallel-lasso approach. On the other hand, regularization reduces the number of one-off nonzero coefficients, which are likely just noise.

There appear to be real associations between Lachnospiraceae and Ruminococcaceae and the body composition measurements. The strongest negative association between species abundance and fat mass occurs among a few species of Ruminococcaceae. Most species that have any association tend to have the same direction and magnitude of association across all body composition variables, not just those restricted to one mass type. This seems to be the case even in the parallel-lasso context, where such structure has not been directly imposed.

## DISCUSSION

In this work, we have studied the problem of multitable data analysis, reviewing both the algorithmic foundations and practical applications of various methods. We have described approaches that are usually confined to particular literature areas and highlighted certain similarities in the process—for example, PCA-IV (section PCA-IV) and the graph-fused lasso (section Graph-Fused Lasso) were proposed in very different contexts, but have similar goals. By writing short, self-contained descriptions of various methods, we hope to contribute to an effort to distill ideas from the wide multitable data analysis literature to make them easily understandable to researchers interested in entering this field and useful for scientists hoping to apply these methods. A "cheatsheet" summarizing some of the key properties of these methods is given in **Table 1**, and relevant packages can be found in **Table 2.**

In developing our WELL-China case study, we have both 1) described the types of interpretations facilitated by different approaches and 2) provided accessible implementations that can be incorporated into practical scientific workflows. Though our focus on a single application has allowed side-by-side comparisons of methods, we do not want to leave the reader with the impression that these methods are tied in any way to this particular biological analysis task. Indeed, the value of mathematical abstractions is that they can be applied to situations outside the imaginations of the original method designers. For example, consider these potential use cases:


multitable methods to describe ways these (multidimensional) perturbations are related to microbiome community structure (Dethlefsen and Relman, 2011).

Our case study includes carefully thought-through visualizations of model results, a step that is crucial in scientific study but often overlooked in methodological research, where model results are reduced to tables of performance metrics. Recognizing that a good deal of effort in statistical work goes into data preparation and visualization of model results, we have ensured that codes for all steps are available, so that our work is fully reproducible.

TABLE 1 | A high-level comparison of the multitable analysis methods discussed in this review. The purpose of this table is to give rules-of-thumb that can guide practical application, where choices invariably depend on the scale and structure of the data, the goals of the analysis, the expected number of future workflow applications, and availability of programming computation time.


TABLE 2 | Pointers to R package that can be used to implement methods discussed in this survey. The vignettes in these packages go into more depth on the capabilities of these packages than do the short scripts used in our case study, available at https://github.com/krisrs1128/multitable\_review.


We have found that multitable data analysis problems have motivated a wide range of analysis approaches. This is not surprising, considering the variety of contexts in which it arises, and it speaks to the richness of this methodological problem. As new data sources arise and as science evolves, we expect these ideas will inspire future generations of multitable research advances.

# AUTHOR CONTRIBUTIONS

SH and KS conceived and designed the review, drafted the manuscript, and prepared all figures. KS implemented code for data analysis.

#### FUNDING

KS was supported by a Stanford University Weiland fellowship and the National Institutes of Health T32 grant 5T32GM096982-04. SH is supported by the National Institutes of Health TR01 grant AI112401.

#### **ACKNOWLEDGMENTS**

We thank the WELL-China study team for sharing the data appearing in this study and Yan Min for useful discussions.

An earlier version of this work first appeared in KS's PhD thesis (Sankaran, 2018).

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Sankaran and Holmes. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# APPENDIX

This appendix includes derivations and technical discussion of several methods surveyed in the main text: PCA-IV, PTA, and the C&W algorithm. While these methods can be understood and applied based on their computational description, these mathematical discussions provide motivation and context for their particular form.

#### DERIVATION DETAILS FOR PCA-IV

In this section, we provide the argument for why the generalized eigendecomposition ˆ ˆ <sup>ˆ</sup> ∑ ∑ *XY YX* = ∑*XX <sup>T</sup> V V*Λ provides the optimal V used in PCA-IV.

First consider *k* = 1. For any *v* , the objective in equation (5) has the form

$$\begin{split} \text{tr}\left(\hat{\Sigma}\_{\text{YX}}\tilde{\boldsymbol{\nu}}\left(\hat{\boldsymbol{\nu}}\hat{\Sigma}\_{\text{XX}}\boldsymbol{\tilde{\nu}}\right)^{-1}\left(\hat{\Sigma}\_{\text{YY}}\boldsymbol{\tilde{\nu}}\right)^{T}\right) &= \frac{\tilde{\boldsymbol{\nu}}^{T}\hat{\Sigma}\_{\text{XY}}\hat{\Sigma}\_{\text{YY}}\boldsymbol{\tilde{\nu}}}{\tilde{\boldsymbol{\nu}}^{T}\hat{\Sigma}\_{\text{XX}}\boldsymbol{\tilde{\nu}}}\\ &= \frac{\tilde{\boldsymbol{\nu}}^{T}\sum\_{\tilde{\mathbf{X}}\mathbf{x}}^{2}\sum\_{\mathbf{x}\mathbf{\tilde{X}}}^{-1}\sum\_{\mathbf{Y}\mathbf{x}}\sum\_{\mathbf{Y}\mathbf{x}}\sum\_{\mathbf{Y}\mathbf{x}}^{-1}\sum\_{\mathbf{x}\mathbf{\tilde{X}}}^{-1}\boldsymbol{\tilde{\nu}}\boldsymbol{\tilde{\nu}}}{\||\boldsymbol{\tilde{\nu}}\||\_{2}^{2}} \end{split} \tag{12}$$

where we change variables *w v* = ∑*XX* 1 <sup>2</sup> . But to maximize equation (12), just choose *w* to be the top eigenvector of ∑∑∑ <sup>∑</sup> − − *XX XY YX XX* 1 2 1 <sup>2</sup> , which implies that *v* is the top generalized eigenvector of Σ*XY* Σ*YX* with respect to Σ*XX*. Indeed, in this case,

$$\begin{split} \sum\_{\boldsymbol{XY}} \sum\_{\boldsymbol{YX}} \tilde{\nu} &= \sum\_{\boldsymbol{XY}} \sum\_{\boldsymbol{YX}} \sum\_{\boldsymbol{XX}}^{-\frac{1}{2}} \tilde{\nu} \\ &= \sum\_{\boldsymbol{XX}}^{\frac{1}{2}} \sum\_{\boldsymbol{XX}}^{-\frac{1}{2}} \sum\_{\boldsymbol{XX}} \sum\_{\boldsymbol{YX}} \sum\_{\boldsymbol{YX}}^{-\frac{1}{2}} \tilde{\nu} \\ &= \sum\_{\tilde{\boldsymbol{X}}}^{\frac{1}{2}} \mathcal{A}\_{1} \tilde{\nu} \\ &= \tilde{\mathcal{A}}\_{1} \sum\_{\boldsymbol{XX}} \tilde{\nu}. \end{split}$$

Hence, in the case *K* = 1, the criterion is maximized by the top generalized eigenvector. For larger *K*, recall that the problem of maximizing *v Av v T* || ||2 over *v* subject to being orthogonal to the first *K* − 1 eigenvectors of *A* is solved by the *K*th eigenvector of *A,* and applying this fact in step 12 of the argument above gives the result for general *K*.

#### DERIVATION OF PTA *α*

The Lagrangian of the optimization defined by PTA is

$$\mathcal{L}(\alpha, \lambda) = \sum\_{l=1}^{L} \alpha\_l \left( X, X...\_l \right) + \lambda (||\alpha||\_2^2 - 1),$$

$$\begin{aligned} \text{Which, when differentiated with respect to } \boldsymbol{\alpha}, \text{ yields} \\ \boldsymbol{\alpha}\_{l} = -\frac{1}{2\lambda} \langle \overline{X}, X..\_{l} \rangle \text{ for all } l. \text{ The constraint that } ||\boldsymbol{\alpha}||\_{2}^{2} = 1 \text{ implies} \\ \text{that } \frac{1}{4\lambda^{2}} \sum\_{l'=1}^{L} \left\langle \overline{X}, X..\_{l'} \right\rangle^{2} = 1, \text{ which gives } \lambda = \frac{1}{2} \sqrt{\sum\_{l'=1}^{L} \left\langle \overline{X}, X..\_{l'} \right\rangle^{2}}, \\ \text{so } \boldsymbol{\alpha}\_{l} = \frac{\left\langle \overline{X}, X..\_{l} \right\rangle}{\sqrt{\sum\_{l'=1}^{L} \left\langle \overline{X}, X..\_{l'} \right\rangle^{2}}}. \end{aligned}$$

#### DERIVATION OF CURDS & WHEY SHRINKAGE

Consider prediction across many related response variables. One way to pool information across responses is to define new fitted values from a linear combination of independent OLS fits. That is, to predict a response *yi <sup>p</sup>* ∈ <sup>1</sup> , we set *y B* ˆ ˆ*y <sup>i</sup> cw* = *<sup>i</sup>* ols for some square matrix *B p p* ∈ <sup>×</sup> 1 1. But how to choose *B*?

One reasonable idea is to choose a *B* that has the best performance in a generalized cross-validation (GCV). The GCV approximation is that the *hii* can be approximated by their average across all diagonal elements of *H*: *h h <sup>n</sup> ii* ≈ =: *<sup>H</sup>* <sup>1</sup> tr( ) for all *<sup>i</sup>*. In this spirit, define *<sup>g</sup> <sup>h</sup>* <sup>=</sup> <sup>−</sup> 1 1 and approximate

$$
\hat{\mathbf{y}}\_{-1} = (1 - \mathbf{g}')\mathbf{y}\_i + \mathbf{g}\hat{\mathbf{y}}\_i
$$

Then, the leave-one-out CV error can be simplified to

$$\sum\_{i=1}^{n} ||\boldsymbol{\nu}\_{1} - B\hat{\boldsymbol{\nu}}\_{-i}||\_{2}^{2} = \sum\_{i=1}^{n} ||\boldsymbol{\nu}\_{i} - B((1-\boldsymbol{g})\boldsymbol{\nu}\_{i} + \boldsymbol{g}\hat{\boldsymbol{\nu}}\_{-i})||\_{2}^{2} \dots$$

and differentiating with respect to *<sup>B</sup>*, we find that the optimal ˆ *B*cw in this GCV framework must satisfy

$$\sum\_{i=1}^{n} (\boldsymbol{\wp}\_{i} - \boldsymbol{B}((1-\boldsymbol{g})\boldsymbol{\jmath}\_{i} + \boldsymbol{g}\boldsymbol{\hat{\jmath}}\_{-i}))((1-\boldsymbol{g})\boldsymbol{\jmath}\_{i} + \boldsymbol{g}\boldsymbol{\hat{\jmath}}\_{-i})^{T}, \dots$$

or equivalently

$$\sum\_{i=1}^{n} \boldsymbol{\wp}\_{i}((\mathbf{1} - \mathbf{g})\boldsymbol{\wp}\_{i} + \mathbf{g}\boldsymbol{\hat{\jmath}}\_{-i})^{T} = \sum\_{i=1}^{n} \boldsymbol{B}((\mathbf{1} - \mathbf{g})\boldsymbol{\wp}\_{i} + \mathbf{g}\boldsymbol{\hat{\jmath}}\_{-1})((\mathbf{1} - \mathbf{g})\boldsymbol{\wp}\_{i} + \mathbf{g}\boldsymbol{\hat{\jmath}}\_{-1})^{T},$$

which in matrix form is

$$(1 - g)Y^T Y + g\hat{Y}^T Y = B\left((1 - g)Y\_i + g\hat{Y}\right)^T \left((1 - g)Y\_i + g\hat{Y}\right), \quad \text{(1.3)}$$

where ˆ *Y n p* ∈ <sup>×</sup> <sup>1</sup> has *i* th row *y*ˆ<sup>−</sup>*i*.

Next, we can represent these cross-products in a way that is suggestive of CCA,

$$\begin{aligned} \hat{Y}^T \hat{Y} &= n \hat{\Sigma}\_{YY} \\ \hat{Y}^T \hat{Y} &= Y^T H Y = Y^T X \left(X^T X\right)^{-1} X^T Y = n \hat{\Sigma}\_{YX} \hat{\Sigma}\_{XX}^{-1} \hat{\Sigma}\_{XY}, \\ \hat{Y}^T \hat{Y} &= Y^T P\_X^2 Y = Y^T P\_X Y = n \hat{\Sigma}\_{YX} \hat{\Sigma}\_{XX}^{-1} \hat{\Sigma}\_{XY}, \end{aligned}$$

Substituting this into equation (13) and ignoring the scaling *n* yields

$$\begin{aligned} &(1-\mathfrak{g})\hat{\Sigma}\_{YY} + \mathfrak{g}\,\hat{\Sigma}\_{YX}\,\hat{\Sigma}\_{XX}^{-1}\hat{\Sigma}\_{XY} = \\ &B\Big[ (-\mathfrak{g})\hat{\Sigma}\_{YY} + (2\mathfrak{g}-\mathfrak{g}^{2})\hat{\Sigma}\_{YX}\,\hat{\Sigma}\_{XY}^{-1}\hat{\Sigma}\_{XY} \Big]. \end{aligned}$$

Postmultiplying by <sup>ˆ</sup> ∑*YY* gives

$$(1 - \mathcal{g})I\_{\rho 1} + \mathcal{g}\hat{\mathcal{Q}}^{\mathcal{T}} = B[(1 - \mathcal{g})I\_{\rho 1} + (2\mathcal{g} - \mathcal{g}^2)\hat{\mathcal{Q}}^{\mathcal{T}}],\tag{14}$$

where

$$\hat{Q} := \hat{\Sigma}\_{XY}^{-1} \hat{\Sigma}\_{YX} \hat{\Sigma}\_{XY}^{-1} \hat{\Sigma}\_{XY} \in \mathbb{R}^{R \times R\_1}$$

Now, we claim that we can decompose *Q V* ˆ = *D V*2 1<sup>−</sup> , where *V p p* ∈ <sup>×</sup> 1 1 is the full matrix of CCA response directions and *D* is diagonal with the canonical correlations. Indeed, the usual CCA response directions *V* can be recovered by setting *V V* = ∑*YY* − ˆ 1 <sup>2</sup> , where *V* comes from the SVD of *A XX XX XY UDV<sup>T</sup>* := ∑ <sup>−</sup> ∑ ∑ <sup>−</sup> <sup>=</sup> 1 2 1 <sup>2</sup> . Hence

$$\begin{split} Q &= \Sigma\_{\tilde{Y}\tilde{Y}}^{-\frac{1}{2}} A^{\top} A \Sigma\_{\tilde{Y}Y}^{\frac{1}{2}} \\ &= \Sigma\_{\tilde{Y}\tilde{Y}}^{-\frac{1}{2}} \tilde{V}^{2} D^{2} \tilde{V}^{\top} \Sigma\_{\tilde{Y}Y}^{\frac{1}{2}} \\ &= V D^{2} V^{-1}, \end{split}$$

where we are able to write *V V<sup>T</sup> YY* <sup>−</sup> <sup>−</sup> = ∑ <sup>1</sup> 1 2 because *V* is the full (untruncated) matrix of eigenvectors, so *VV I <sup>T</sup>* = in addition to the usual *V V I <sup>T</sup>* = , which holds even for the truncated SVD.

Therefore, equation (14) can be expressed as

$$V^{-\top}[(1-g)]I\_{\rho\_1} + gD^2[V^T = BV^{-\top}[(1-g)]I\_{\rho\_1} + (2g - g^2)D^2[V^{\Upsilon}]$$

and the *B* satisfying the normal equations has the form

$$
\hat{B}^{\text{cw}} = V^{-r} \Lambda V T\_\* 
$$

where Λ is a diagonal matrix with entries

$$\mathcal{A}\_{\mathcal{Y}} = \frac{1 - \mathcal{g} + d\_{\text{fig}}^2}{1 - \mathcal{g} + (2\mathcal{g} - \mathcal{g}^2)d\_{\mathcal{Y}}^2}.$$

Notice that when *n* is large, <sup>1</sup> *<sup>n</sup> PX* tr will be small, leading to a smaller *g* ≈ 0 and less shrinkage. Recall that ˆ *B*cw is used to pool across OLS fits, *y B* <sup>ˆ</sup> <sup>ˆ</sup> *<sup>y</sup>*<sup>ˆ</sup> *i i* cw cw *ols* = . That is,

$$
\hat{Y}^{\text{cw}} = \hat{Y}^{\text{cds}} B^T = \hat{Y}^{\text{cds}} V \Lambda V^{-1}
$$

which we can also view as ˆ ˆ *Y V Y V* cw ols <sup>=</sup> ( )Λ. This means that the C&W coordinates along the canonical directions *V* are set as the OLS fits ˆ *Y*ols along the canonical directions *V*, with weights defined by Λ. The actual ˆ *Y*cw are recovered by transforming back to the original coordinate system. A similar way to view the C&W fits is to note ˆ *Y V P Y <sup>X</sup>* ( ) *V* cw = Λ, which is the original data *Y* according to the canonical directions, then projects the shrunk data onto the subspace defined by the columns of *X*. In any case, we see that C&W pools across regression problems through a soft shrinkage weighted along canonical response directions.

# A Generic Multivariate Framework for the Integration of Microbiome Longitudinal Studies With Other Data Types

*Antoine Bodein1†, Olivier Chapleur2†, Arnaud Droit1 and Kim-Anh Lê Cao3\**

*1 Molecular Medicine Department, CHU de Québec Research Center, Université Laval, Québec, QC, Canada, 2 Hydrosystems and Biopresses Research Unit, Irstea, Antony, France, 3 Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Melbourne, VIC, Australia*

#### *Edited by:*

*Himel Mallick, Merck, United States*

#### *Reviewed by:*

*Gholamali Ali Rahnavard, Broad Institute, United States Lingling An, University of Arizona, United States*

*\*Correspondence: Kim-Anh Lê Cao kimanh.lecao@unimelb.edu.au*

*†These authors have contributed equally to this work*

#### *Specialty section:*

*This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics*

> *Received: 04 April 2019 Accepted: 10 September 2019 Published: 07 November 2019*

#### *Citation:*

*Bodein A, Chapleur O, Droit A and Lê Cao K-A (2019) A Generic Multivariate Framework for the Integration of Microbiome Longitudinal Studies With Other Data Types. Front. Genet. 10:963. doi: 10.3389/fgene.2019.00963*

Simultaneous profiling of biospecimens using different technological platforms enables the study of many data types, encompassing microbial communities, omics, and meta-omics as well as clinical or chemistry variables. Reduction in costs now enables longitudinal or time course studies on the same biological material or system. The overall aim of such studies is to investigate relationships between these longitudinal measures in a holistic manner to further decipher the link between molecular mechanisms and microbial community structures, or host-microbiota interactions. However, analytical frameworks enabling an integrated analysis between microbial communities and other types of biological, clinical, or phenotypic data are still in their infancy. The challenges include few time points that may be unevenly spaced and unmatched between different data types, a small number of unique individual biospecimens, and high individual variability. Those challenges are further exacerbated by the inherent characteristics of microbial communities-derived data (e.g., sparse, compositional). We propose a generic data-driven framework to integrate different types of longitudinal data measured on the same biological specimens with microbial community data and select key temporal features with strong associations within the same sample group. The framework ranges from filtering and modeling to integration using smoothing splines and multivariate dimension reduction methods to address some of the analytical challenges of microbiome-derived data. We illustrate our framework on different types of multi-omics case studies in bioreactor experiments as well as human studies.

Keywords: time course, data integration, splines, feature selection, dimension reduction, multi-omics

# INTRODUCTION

Microbial communities are highly dynamic biological systems that cannot be fully investigated in snapshot studies. The decreasing cost of DNA sequencing has enabled longitudinal and time-course studies to record the temporal variation of microbial communities (Knight et al., 2012; Faust et al., 2015). These studies can inform us about the stability and dynamics of microbial communities in response to perturbations or different conditions of the host or their habitat. They can also capture the dynamics of microbial interactions (Bucci et al., 2016; Ridenhour et al., 2017) or associated changes of microbial features, such as taxonomies or genes, to a phenotypic group (Metwally et al., 2018).

However, besides the inherent characteristics of microbiome data, including sparsity, compositionality (Aitchison, 1982; Gloor et al., 2017), its multivariate nature, and high variability (Lệ Cao et al., 2016a), longitudinal studies suffer from irregular sampling and subject drop-outs. Thus, appropriate modeling of the microbial profiles is required—for example, by using spline modeling. Methods including loess (Shields-Cutler et al., 2018), smoothing spline ANOVA (Paulson et al., 2017), negative binomial smoothing splines (Metwally et al., 2018), or Gaussian cubic splines (Luo et al., 2017) were proposed to model dynamics of microbial profiles across groups of samples or subjects. The aim of these approaches is to make statistical inferences about global changes of differential abundance across multiple phenotypes of interest, rather than at specific time points. These proposed methods are univariate and, as such, cannot infer ecological interactions (Morris et al., 2016). Other types of methods aim to cluster microbial profiles to posit hypotheses about symbiotic relationships, interaction, or competition. For example, Baksi et al. (2018) used a Jenson– Shannon divergence metric to visually compare metagenomic time series.

Multivariate ordination methods can exploit the interaction between microorganisms but need to be used with sparsity constraints, such as ℓ1 regularization (Tibshirani, 1996), to reduce the number of variables and improve interpretability through variable selection. Several sparse methods were proposed and applied to microbiome studies, such as sparse linear discriminant analysis (Clemmensen et al., 2011) and sparse partial least squares discriminant analysis (sPLS-DA, Lệ Cao et al., 2016b), but for a single time point. Therefore, further developments are needed to combine time-course modeling with multivariate approaches to start exploring microbial interactions and dynamics.

In addition, current statistical methods have mainly focused on a single microbiome dataset, rather than the combination of different layers of molecular information obtained with parallel multi-omics assays performed on the same biological samples. Data derived from each omics technique are typically studied in isolation and disregard the correlation structure that may be present between the multiple data types. Hence, integrating these datasets enables us to adopt a holistic approach to elucidate patterns of taxonomic and functional changes in microbial communities across time. Some sparse multivariate methods have been proposed to integrate omics and microbiome datasets at a single time point and identify sets of features (multiomics signatures) across multiple data types that are correlated with one another. For example, Gavin et al. (2018) used the DIABLO method (Singh et al., 2019) to integrate 16S amplicon microbiome, proteomics, and metaproteomics data in a type I diabetes study; Guidi et al. (2016) used sparse PLS (Lê Cao et al., 2008) to integrate environmental and metagenomic data from the Tara Oceans expedition to understand carbon export in oligotrophic oceans, and Fukuyama et al. (2017) used sparse canonical correlation analysis (Witten et al., 2009) to integrate 16S and metagenomic data. However, methods or frameworks to integrate multiple longitudinal datasets including microbiome data remain incomplete. Zhou et al. (2008) used principal component analysis (PCA) to summarize functional data, with the PC scores used for model fitting, prediction, and inference. However, only pairwise relationships were investigated and for a single type of data. Other type of modeling (loess regression) was used by Ribicic et al. (2018) in combination with sparse PCA to explore the link between chemistry and microbial community data in the biodegradation of chemically dispersed oil, but their approach was not designed to seek for multi-omics signatures.

We propose a computational approach to integrate microbiome data with multi-omics datasets in longitudinal studies. Our framework, described in **Figure 1** includes smoothing splines in a linear mixed model framework to model profiles across groups of samples and builds on the ability of sparse multivariate ordination methods to identify sets of variables highly associated across the data types, and across time. Our framework encompasses data pre-processing, modeling, data clustering, and integration. It is highly flexible in handling one or several longitudinal studies with a small number of time points, to identify groups of taxa with similar behavior over time and posit novel hypotheses about symbiotic relationships, interactions, or competitions in a given condition or environment, as we illustrate in two case studies.

# METHOD

Our proposed approach includes pre-processing for microbiome data, spline modelization within a linear mixed model framework, and a multivariate analysis for clustering and data integration (**Figure 2**).

## Pre-Processing of Microbiome Data

We assume the data are in raw count formats resulting from bioinformatics pipelines such as QIIME (Caporaso et al., 2010) or FROGS (Escudié et al., 2017) for 16S amplicon data. Here, we consider the operational taxonomic unit (OTU) level, but other levels can be considered, as well as other types of microbiomederived data, such as whole genome shotgun sequencing. The data processing step is described in Lệ Cao et al. (2016b) and consists of:


integrative method.

(eq. 1) is a log transformation of each element of the vector divided by its geometric mean *G*(*x*):

$$\text{clr}\left(\boldsymbol{\omega}\right) = \left[\log\left(\frac{\boldsymbol{\omega}\_{i}}{G\left(\boldsymbol{\omega}\right)}\right), \dots, \log\left(\frac{\boldsymbol{\omega}\_{p}}{G\left(\boldsymbol{\omega}\right)}\right)\right] \tag{1}$$

where

$$G(\infty) = \sqrt[p]{\mathbb{1}\_1 \times \mathbb{1}\_2 \times \dots \mathbb{1}\_p}$$

#### Time Profile Modeling Linear Mixed Model Splines

The linear mixed model spline (LMMS) modeling approach proposed by Straube et al. (2015) takes into account between and within individual variability and irregular time sampling. LMMS is based on a linear mixed model representation of penalized splines (Durbán et al., 2005) for different types of models. Through this flexible approach of serial fitting, LMMS

avoids under- or over-smoothing. Briefly, four types of models are consecutively fitted in our framework on the CLR data:


All four models are described in **Appendix 1.** Straube et al., 2015 showed that the proportion of profiles fitted with the different models increased in complexity with the organism considered. Different types of splines can be considered in models (2)–(4), including a cubic spline basis (Verbyla et al., 1999), a penalized spline and a cubic penalized spline. A cubic spline basis uses all inner time points of the measured time

such as metabolic pathways measured with metabolomics, or information measured at a macroscopic level resulting from the aggregated actions of the microbiota.

interval as knots and is appropriate when the number of time points is small (≤5), whereas the penalized spline and cubic penalized spline bases use the quantiles of the measured time interval as knots; see Ruppert (2002). In our case studies, we used penalized splines. The LMMS models are implemented in the R package lmms (Straube et al., 2016).

#### Prediction and Interpolation

The fitted splines enable us to predict or interpolate time points that might be missing within the time interval (e.g., inconsistent time points between different types of data or covariates). Additionally, interpolation is useful in our multivariate analyses described below to smooth profiles, and when the number of time points is small (≤5). In the following section, we therefore consider data matrices *X* (*T* × *P*), where *T* is the number of (interpolated) time points and *P* the number of taxa. The individual dimension has thus been summarized through the spline fitting procedure, so that our original data matrix of size (*N* × *P* × *T*), where *N* is the number of biological samples, is now of size (*T* × *P*).

# Filtering Profiles After Modeling

A simple linear regression model (1) might be the result of highly noisy data. To retain only the most meaningful profiles, the quality of these models was assessed with a Breusch–Pagan test to indicate whether the homoscedasticity assumption of each linear model was met (Breusch and Pagan, 1979) simple. We also used a threshold based on the mean squared error (MSE) of the linear models, by only including profiles for which their MSE was below the maximum MSE of the more complex fitted models (2)–(4). The latter filter was only applied when a large number of linear models (1) were fitted and the Breusch–Pagan test was not considered stringent enough.

#### Clustering Time Profiles

#### Principal Component Analysis and Sparse Principal Component Analysis

Multivariate dimension reduction techniques such as PCA (Jolliffe, 2011) and sparse PCA (Huang and Zheng, 2006) can be used to cluster taxa profiles. To do so, we consider as data input the *X* (*T*×*P*) spline fitted matrix. Let *t*1, *t*2, …, *t*H denote the *H* principal components of length *T* and their associated *v*1, *v*2, …, *v*H factors—or loading vectors, of length *P*. For a given PCA dimension *h*, we can extract a set of strongly correlated profiles by considering taxa with the top absolute coefficients in *vh*. Those profiles are linearly combined to define each component *th*, and thus, explain similar information on a given component. Different clusters are therefore obtained on each dimension *h* of PCA, *h =* 1… *H*. Each cluster *h* is then further separated into two sets of profiles which we denote as "positive" or "negative" based on the sign of the coefficients in the loading vectors (see Results section).

A more formal approach can be used with sparse PCA. Sparse PCA includes ℓ1 penalizations on the loading vectors to select variables that are keys for defining each component and are highly correlated within a component (see Huang and Zheng, 2006 for more details).

#### Choice of the Number of Clusters in Principal Component Analysis

We propose to use the average silhouette coefficient (Rousseeuw, 1987) to determine the optimal number of clusters, or dimensions *H*, in PCA. for a given identified cluster and observation *I*, the silhouette coefficient of *I* is defined as

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\tag{5}$$

where *a*(*i*) is the average distance between observation *i* and all other observations within the same cluster, and *b*(*i*) is the average distance between observation *i* and all other observations in the nearest cluster. A silhouette score is obtained for each observation and averaged across all silhouette coefficients, ranging from −1 (poor) to 1 (good clustering).

We adapted the silhouette coefficient to choose the number of components or clusters in PCA and sparse PCA (sPCA, i.e., 2×*H* clusters), as well as the number of profiles to select for each cluster. Each observation in Eq. (5) now represents a fitted LMMS profile, and the distance between two profiles is calculated using the Spearman correlation coefficient.

Within a given cluster, we calculate the silhouette coefficient of each LMMS profile and apply the following empirical rules for cluster assignation: a coefficient > 0.5 assigns the profile to the cluster, and a value between 0 and 0.5 indicates an uncertain assignment as the profile can be assigned to one or two clusters, while a negative value indicates that the profile should not be assigned to this particular cluster.

To choose the appropriate number of profiles per sPCA component, we perform as follows: for each component, we set a grid of the number of profiles to be retained with

sPCA and calculated the average silhouette coefficient per cluster (there are two clusters per component). The final number of profiles to select is arbitrarily set when we observe a sudden decrease in the average silhouette coefficient (see Results section).

#### Comparison With Functional Principal Component Analysis

Functional principal component analysis (fPCA) has been widely used to cluster longitudinal data by decomposing data matrices into temporal variation models (Hyndman and Ullah, 2007) and has been used in several biological applications (Silverman et al., 1996; Yao et al., 2005). fPCA first models longitudinal profiles into a finite basis of functions then clusters the longitudinal profiles using the basis expansion coefficients of the fPCA scores. fPCA requires the user to choose the number of clusters and the number of components—based on Akaike information criterion, Bayesian information criterion, or percentage of total explained variance, the approach to estimate the fPCA scores—based on conditional expectation or numerical integration, and to cluster the profiles. We used the "fdapace" R package that includes two types of clustering methods, based on model-based clustering of finite mixture Gaussian distribution ("EMCluster") or k-means algorithm based on the fPCA scores.

# Evaluation

#### Clustering

We can assess the quality of clustering with internal measures such as compactness (Dunn, Rand indices, and Jaccard index) or cluster separation. For the latter case, the silhouette coefficient is recognized as an informative criterion Wang et al. (2009) and can be used to compare several clustering results based on the same data. Thus, we used this criterion to assess different methods (PCA, sPCA, and fPCA), or to assess the same method with different parameters—for example, to identify the appropriate number of clusters as we described in 2.4.2. The best clustering approach yields the highest silhouette coefficient.

#### Measure of Association for Compositional Data

Compositional data arise from any biological measurement made based on relative abundance (Lovell et al., 2015; Gloor et al., 2017). Microbiome data in particular are compositional for several reasons, including biological, technical, and computational. Thus, interpretation based on correlations between profiles must be made with caution as it is highly likely to be spurious. Proportional distances have been proposed as an alternative to measure association. The compositional data analysis field is an active field of research, but methods are critically lacking for longitudinal data. Here, we adopt a practical and *post hoc* approach to evaluate pairwise associations of microbial and omics profiles once they have been assigned to their clusters. We used the proportionality distance φs proposed by Lovell et al. (2015) and implemented in the "propr" R package (Quinn et al., 2017). For two LMMS profiles *xi* and *xj* , we define the pairwise proportionality distance as

$$\varphi \rho\_s \left( \mathbf{x}\_i, \mathbf{x}\_j \right) = \frac{\mathrm{var} \left( \boldsymbol{\omega}\_i - \boldsymbol{\omega}\_j \right)}{\mathrm{var} \left( \boldsymbol{\omega}\_i + \boldsymbol{\omega}\_j \right)}. \tag{6}$$

A small value indicates that, in proportion, the pair of profiles is strongly associated. We calculated the distance φS on the logtransformed LMMS modeled profiles within each identified cluster to exclude potentially spurious correlations and further guide the interpretation of the results. In addition, to evaluate the quality of our clustering approach, we compared the pairwise distances of the profiles within a particular cluster and profiles outside the cluster.

#### Integration

#### Multiblock Projection to Latent Structures Methods

To integrate multiple datasets (also called *blocks*) measured on the same biological samples, we used multivariate methods based on projection to latent structures (PLS) methods (Wold, 1975), which we broadly term *multiblock PLS* approaches. For example, we can consider generalized canonical correlation analysis (GCCA, Tenenhaus and Tenenhaus, 2011; Tenenhaus et al., 2014), which, contrary to what its name suggests, generalizes PLS for the integration of more than two datasets. Recently, we have developed the DIABLO method to discriminate different phenotypic groups in a supervised framework (Singh et al., 2019). In the context of this study, however, we present the sparse GCCA in an unsupervised framework, where input datasets are spline-fitted matrices.

We denote *Q* data sets *X*(*1*) (*TxP1*), *X*(*2*) (*TxP2*), …, *X*(*Q*) (*TxPQ*) measuring the expression levels of *Pq* variables of different types (taxa, "omics," continuous response of interest), modeled on *T* (interpolated) time points, *q =* 1,…,*Q*. GCCA solves for each component *h =* 1,…,*H*:

$$\begin{aligned} \max\_{a\_h^{(1)}, \ldots, a\_h^{(Q)}} \quad & \sum\_{q, j=1, q \neq j}^{Q} c\_{q, j} \text{cov} \left( X\_h^{(q)} a\_h^{(q)}, X\_h^{(j)} a\_h^{(j)} \right), \\ \text{s.t.} \quad & \left\| a\_h^{(q)} \right\|\_2 = 1 \text{ and } \left\| a\_h^{(q)} \right\|\_1 \lesssim \lambda^{(q)} \end{aligned} \tag{7}$$

where λ(*<sup>q</sup>*) is the ℓ1 penalization parameter, *ah* ( ) *<sup>q</sup>* is the loading vector on component *h* associated with the residual (deflated) matrix *Xh* ( ) *<sup>q</sup>* of the data set *X*(*q*) , and *C =* {*cqj*} is the design matrix. *C* is a *Q*×*Q* matrix that specifies whether datasets should be correlated and includes values between zero (datasets are not connected) and one (datasets are fully connected). Thus, we can choose to take into account specific pairwise covariances by setting the design matrix (see Rohart et al., 2017 for implementation and usage) and model a particular association between pairs of datasets, as expected from prior biological knowledge or experimental design. In our integrative case study, we used sparse PLS, a special case of Eq. (7) to integrate

microbiome and metabolomic data, as well as sparse multiblock PLS to also integrate variables of interest. Both methods were used with a fully connected design.

The multiblock sparse PLS method was implemented in the mixOmics R package where the ℓ1 penalization parameter is replaced by the number of variables to select, using a softthresholding approach (see more details in Rohart et al., 2017).

#### Parameter Tuning

The integrative methods require choosing the number of components *H*, defined as *t X <sup>h</sup> a <sup>q</sup> h q h* () () () *<sup>q</sup>* = , and number of profiles to select on each PLS component and in each dataset. We generalized the GCCA approach by using the silhouette coefficient based on a grid of parameters for each dataset and each component.

#### Simulation and Case Studies

#### Simulation Study Description

A simulation study was conducted to evaluate the clustering performance of multivariate projection-based methods such as PCA, and the ability to interpolate time points in LMMS.

Twenty reference time profiles were generated on nine equally spaced time points and assigned to four clusters (five profiles each). These ground truth profiles were then used to simulate new profiles. We generated 500 simulated datasets.

#### *Clustering Performance*

We first compared profiles simulated then modeled with or without LMMS:


Clustering was obtained with PCA and compared to the reference cluster assignments in a confusion matrix. The clustering was evaluated by calculating the accuracy of assignment ( ) *TP TN TP FP TN FN* + + + + from the confusion matrix, where for a given cluster, TP (true positive) is the number of profiles correctly assigned in the cluster, FN (false negative) is the number of profiles that have been wrongly assigned to another cluster, TN (true negative) is the number of profiles correctly assigned to another cluster, and FP (false positive) is the number of profiles incorrectly assigned to this cluster. Besides accuracy, we also calculated the Rand index (Rand, 1971) objective as a similarity metric to the clustering performance of PCA. The clustering results from fPCA were poor, even for a low level of noise (**Supplementary Figure 1**); thus, fPCA was not compared against PCA.

#### *Interpolation of Missing Time Points*

To evaluate the ability of LMMS to predict the value of a missing time point for a given feature over time, we randomly removed 0 to 4 measurement points in the simulated datasets described above in step A. We compared the PCA clustering performance with or without LMMS interpolation.

#### Infant Gut Microbiota Development

The gastrointestinal microbiome of 14 babies during the first year of life was studied by Palmer et al. (2007). The authors collected an average of 26 stool samples from healthy full-term infants. As infants quickly reach an adult-like microbiota composition, we focused our analyses on the first 100 days of life. Infants who received an antibiotic treatment during that period were removed from the analysis, as antibiotics can drastically alter microbiome composition (Dudek-Wicher et al., 2018).

The dataset we analyzed included 21 time points on average for 11 selected infants (vaginal delivery = 6, C-section = 5; see **Figure 3**). Samples were collected daily during days 0–14 and weekly after the second week. We separated our analyses based on the delivery mode (C-section or vaginal), as this is known to have a strong impact on gut microbiota colonization patterns and diversity in early life Rutayisire et al. (2016). The purpose of our statistical analysis was to identify a bacterial signature that describes the dynamics of a baby's microbial gut development in the first days of life, as well as compare differences in signatures between babies born by vaginal delivery or by C-section. As this study is single omics, we applied our framework depicted in **Figure 2** with sPCA.

#### Waste Degradation Study

Anaerobic digestion (AD) is a highly relevant microbial process to convert waste into valuable biogas. It involves a complex microbiome that is responsible for the progressive degradation of molecules into methane and carbon dioxide. In this study, AD's biowaste was monitored across time (more than 150 days) in three lab-scale bioreactors as described in (Poirier et al., 2016).

We focused our analysis on days 9 to 57, which correspond to the most intense biogas production. Degradation performance was monitored through four parameters: methane and carbon dioxide production (16 time points) and the accumulation of

FIGURE 3 | Infant gut microbiota development study: stool samples were collected from six male and five female babies over the course of 100 days. Samples were collected daily during days 0–14 and weekly thereon until day 100. Time is indicated on the x-axis in days. As delivery method is known to be a strong influence on gut microbiome colonization, the data are separated according to either C-section or vaginal birth.

acetic and propionic acid in the bioreactors (5 time points). Microbial dynamics were profiled with 16S RNA gene metabarcoding as described in Poirier et al. (2016) and included 4 time points and 90 OTUs. A metabolomic assay was conducted on the same biological samples at four time points with gas chromatography coupled to mass spectrometry GC-MS after solid phase extraction to monitor substrates degradation (Limam et al. (2010). The XCMS R package (version 1.52.0) was used to process the raw metabolomics data (Smith et al., 2006). GC-MS analyses focused on 20 peaks of interest identified by the National Institute of Standards and Technology database. Data were then log-transformed. The purpose of the study was to investigate the relationship between biowaste degradation performance and microbial and metabolomic dynamics across time. The aim of our statistical analysis was to identify highly associated multiomic signatures characterizing waste degradation dynamics in the three bioreactors. This study involves the integration of two omics datasets and degradation performance measures; thus, we applied sPLS and multiblock sPLS, as shown in our workflow in **Figure 2**.

## RESULTS

#### Simulation Study Clustering Performance

**Figure 4** shows the clustering performance of PCA with an increasing amount of noise in the simulated profiles. Unsurprisingly, PCA gave optimal clustering performance when noise was absent, with or without profile modeling to take into account individual variability. When noise increased, PCA performed better with modeling, which acts as a denoising process. Finally, a high level of noise showed the limitation of the modeling approach, as similar clustering results were obtained with or without LMMS modeling. However, the PCA clustering performance was still very good, with a mean accuracy of 0.7 when the level of noise was maximum.

#### Interpolation of Missing Time Points

We evaluated the ability of LMMS to interpolate an increasing number of missing time points (up to four). Interpolation is important in our framework as it allows the estimation of evenly spaced time points as well as time points that may be missing in

(LMMS) modeling: five new profiles were generated per reference, and without modeling: only one profile was simulated per reference. We evaluated the ability of principal component analysis clustering to correctly assign the simulated profiles in their respective reference clusters based on mean accuracy: without noise, both approaches lead to a perfect clustering; with noise < 1, LMMS modeling acts as a denoising process with better performance than no modeling; and with a high level of noise ≥ 1, the performances of both approaches decrease.

one data set but not in the other (e.g., biowaste degradation study). Interpolation did not seem to affect the clustering performance of PCA (**Figure 5** and **Supplementary Figure 2**). Rather, the level of noise had the largest impact on clustering: the mean accuracy was close to 1 when the noise was nonexistent but decreased as the number of missing time points and noise increased. In the latter scenarios, LMMS interpolation seemed to give, on average, better clustering than without interpolation. When the number of missing time points increased, we observed a better classification accuracy with noise compared to no noise. This can be explained by the LMMS modeling of straight lines in the latter case that led to poor clustering (**Supplementary Figure 3**).

#### Clustering Time Profiles: Infant Gut Microbiota Development Study Pre-Processing and Modeling

A total of 2,149 taxa were identified in the raw data (**Table 1**). After the pre-processing steps illustrated in **Figure 2**, a smaller number of OTUs were found in fecal samples of babies born by C-section than vaginal delivery. Similarly, a simple linear regression model showed a smaller proportion of OTUs in babies born *via* C-section (73%) than vaginal delivery (81%), and this was also observed after the filtering step (**Table 1**).

#### Comparison of Principal Component Analysis and Functional Principal Component Analysis

According to our tuning criteria, we obtained four clusters with PCA (i.e., two components). We therefore set the same number TABLE 1 | Infant gut microbiota development study: number of operational taxonomic units (OTUs) identified and linear model types fitted according to delivery mode.


of clusters in fPCA for comparative purposes. PCA clustering outperformed fPCA for each delivery mode dataset that was analyzed (see **Table 2**). The resulting fPCA clustering is displayed in **Figure 6** for babies born *via* vaginal delivery. We found that the EM approach in fPCA tended to cluster a larger number of uncorrelated OTUs compared to the *k*-CFC approach (average silhouette coefficient = 0.07 for EM and 0.61 for *k*-CFC).

We used sPCA to select key OTU profiles for each cluster. This step is essential for discarding profiles that are distant from the

TABLE 2 | Infant gut microbiota development study: average silhouette coefficient according to clustering method.


points were removed. We compared the ability of linear mixed model spline (LMMS) to interpolate missing time points. When there are no time points missing, both interpolated and non-interpolated approaches gave a similar performance. When the number of time points increases, the classification accuracy decreases. Without noise and with several time points removed, LMMS tended to model straight lines, resulting in poor clustering (see also Supplementary Figure 3).

average cluster profile and thus not informative. As expected, we observed an overall increase in the silhouette average coefficient for the sPCA clustering compared to PCA, indicating a better clustering capability (see **Table 2**). According to the silhouette average coefficient, vaginal delivery showed the best partitioning for PCA clustering (0.87; **Table 2**). Cluster 1 (denoted "component 1 positive" in **Figure 7A**) showed a relative increase in abundance of species, including some that are characteristic of a healthy "adult-like" gut microbiome composition such as the clade *Bacteroidetes* (Thursby and Juge, 2017). The proportionality distance within cluster 1 was low (**Supplementary Table 1**), with a strong association between *Bacteroides* and *Fusobacteria* (φs = 0.04), as well as between *Actinobacter* with *Bacteroides* (φs = 0.02) and *Fusobacteria* (φs = 0.09). According to this distance, there might have been a spurious correlation identified between the genus *Bacteroides* and an environmental uncultured bacterium *(clone HuCA36)* (φs = 14.81); see **Supplementary Table 2**. In cluster 2 ("component 1 negative"), relative profile abundance tended to decrease and corresponded to genera found in vaginal and skin microbiota, such as *Lactobacillus* and *Propionibacterium* (Grice and Segre, 2011; Bing et al., 2012). According to the proportionality distance, *Propionibacterium* and *Lactobacillus* were highly associated (φs = 0.29) as well as with *Campylobacter* (φs = 0.39, see **Supplementary Table 2**). Clusters 3 and 4 (denoted "component 2 positive and negative") highlighted taxa profiles with negative association.

A cladogram representing all OTUs and those selected by sPCA for each cluster is shown in **Figure 8** and illustrates that most families are presented in our OTU selection. In addition, we can observe specific clusters—family patterns as discussed above.

Thus, with this preliminary PCA analysis, we were able to rebuild a partial history behind the development of the gut microbiota. Vaginal species that initially colonized in the gut progressively disappeared to enable species that characterize adult gut microbiota.

For babies born by C-section, four clusters were identified by PCA (**Figure 7D**; cladogram visualization is available in **Supplementary Figure 4**). The median values of the proportionality distance within the different clusters were significantly lower than between the selected OTUs in the clusters and all the other OTUs (**Supplementary Table 3**). For example, the median value within cluster 1 was 0.11 compared to 1.36 outside the cluster. Clusters 1 and 2 ("component 1 positive and negative") displayed either an increase or decrease in relative abundance. However, none of the cluster 2 species are known to characterize, or were found in, vaginal delivery, suggesting that the infant gut was first colonized by the operating room microbes as already demonstrated by Shin et al. (2015). Cluster 3 ("component 2 positive") revealed transitory states of increase then decrease of relative abundance profiles, while cluster 4 ("component 2 negative") showed the reverse trend.

When comparing the dynamics of the two delivery methods, we found a higher diversity in the intestinal microbiota of babies born vaginally (117 modeled profiles) than by C-section (107). For vaginal delivery, the modeling step identified a larger proportion of straight lines, which may indicated a greater inter-individual variability compared to C-section delivery. The clusters denoted "component 1 positive" in both delivery modes showed an increased relative abundance over time, with 32 OTUs assigned to this cluster in vaginally born babies, compared to 11 in C-section (**Table 3**). Despite the relatively sterile environment of the operating room, it was surprising to observe similar number of OTUs in cluster "component 1 negative" for both types of delivery mode (vaginal: 38, C-section: 35), as we would have expected to identify a larger number of opportunistic microorganisms colonizing babies born vaginally(e.g., *Propionibacterium acnes, Campylobacter*). These include species found on the surface of the skin and in the vaginal flora. However, for babies born by C-section, we observed a large number of microorganisms from various origins (e.g., *Staphylococcus*, *Rickettsia*, *Rhodobacter*).

In summary, sparse PCA clustering of LMMS modeled profiles enabled the identification of groups of microorganisms with relative increased abundance over time. These microorganisms are characteristics of an adult gut microbiota. We also identified groups of opportunistic microorganisms with a decreasing relative abundance over time. We also found that, during the first year of life, gut microbiota was more diverse for babies born by vaginal than C-section delivery.

#### Clustering Omics: Waste Degradation Study Pre-Processing and Modeling

A total of 90 OTUs were identified in the 12 samples of the initial dataset (**Table 4**). After pre-processing, 51 OTUs were retained. Approximately 60% (resp. 50%) of the OTUs (resp. metabolites) were fitted with linear regression models (1), and 40% (resp. 50%) were modeled by more complex spline models (2)–(4). All performance measures were also modeled by splines. During the filtering step, seven OTUs and four metabolites that were fitted with linear regression models were discarded. The small number of profiles that were filtered out indicated that the variability between the three bioreactors was relatively low.

#### Sparse PCA on Concatenated Datasets

As a first and naive attempt to jointly analyze microbial, metabolomic, and performance measures, all three datasets were concatenated

then analyzed with sPCA. Only a very small number of profiles from the different datasets were selected. This small selection is likely due to the high variability in each data type. Selected variables included mainly OTUs and performance measures. These were assigned to four clusters and included respectively 1, 3, 2, and 3 OTUs with 0, 1, 2, and 0 metabolites and 2, 0, 1, and 0 performance measures. The average silhouette coefficient was 0.744, a potentially suboptimal clustering compared to our analyses presented in the next section. This preliminary investigation highlighted the limitation of sPCA to identify a sufficient number of associated profiles from disparate sources.

#### Microbiome-Metabolomic Integration With sPLS

The results from the sPLS analysis are shown in **Supplementary Figure 5**. Four clusters of variables were identified, and the average silhouette coefficient of 0.954 confirmed that sPLS led to better clustering of the different types of profiles than sPCA. The TABLE 3 | Infant gut microbiota development study: number of operational taxonomic units (OTUs) per cluster identified with principal component analysis (PCA) clustering and OTUs selected in brackets with sparse PCA.


proportionality distances of the profiles within each cluster are presented in **Table 5** and in **Supplementary Figure 6**. Their low values indicated strong associations between profiles within each cluster, compared to any association outside each of the clusters. A cladogram representing the selected OTUs only, according to each sPLS cluster is shown in **Supplementary Figure 7**.

The first cluster (denoted "component 1 negative") included 10 OTUs and 4 metabolite variables and showed increasing


TABLE 5 | Waste degradation study: proportionality distance for clusters identified with sparse PLS. The median distance between all pairs of profiles, within cluster, and with the entire background set (outside a given cluster) is reported. A Wilcoxon test p-value assesses the difference between the medians.


relative abundance until a plateau was reached at approximately 40 days. Median value of the proportionality distance within the cluster was 0.42, which was compared to 1.11 between the variables selected in the cluster and all the other variables, indicating strong associations within this cluster. The OTUs were microorganisms often recovered during AD of biowaste, such as methanogenic archaea of *Methanosarcina* genus or bacteria of *Clostridiales*, *Acholeplasmatales*, and *Anaerolineales* orders. These were reported as being involved in the different steps of AD (Poirier et al., 2016). Their relative abundance increased while biowaste was degraded, until there was no more biowaste available in the bioreactor.

From the proportionality distances, we found that their abundance across time was, in proportion, similar, indicating a synchronized role during this biological process. In particular, of all the proportionality distances between the profiles of archaea of *Methanosarcina* genus and bacteria of *Clostridiales* order, the *Syntrophomonadaceae* family was the lowest which made sense as these microorganisms have already been reported as syntrophs (Liu et al., 2011); see **Supplementary Table 4**.

Their abundance was also highly associated, in proportion, to the intensity of various metabolites produced during the AD process, such as benzoic acid that is formed during the degradation of phenolic compounds (Hoyos-Hernandez et al., 2014), or phytanic acid, known to be produced during the fermentation of plant materials in the ruminant gut (Watkins et al., 2010), as well as indole-2-carboxylic acid. Thus, the identified microorganisms were likely responsible for the production of these compounds. Cluster 2 (component 1 positive) included 10 OTUs and 4 metabolites. The median value of the proportionality distance within the cluster was also very low compared to the proportionality distance outside the cluster (0.29 and 0.97; **Table 5**). Profiles of cluster 2 were negatively correlated to cluster 1, and their relative abundance decreased with time. OTUs mainly belonged to the *Bacteroidales* order. They were present in the initial inoculum but did not survive in this experiment, as the operating conditions or the substrate were not optimal for their growth, as observed in other studies (Madigou et al., 2019). Consequently, their relative abundance progressively decreased over time. Metabolites identified in cluster 2 were present in the biowaste and were degraded during the experiment. They included fatty acids (decanoic and tetradecanoic acids) that can be found in oil, or 3-(3-hydroxyphenyl)propionic acid, arising from the digestion of aromatic amino acids or breakdown product of lignin or other plant-derived phenylpropanoids. As their profile was negatively correlated to those from cluster 1, it is likely that these metabolites were consumed by OTUs assigned to cluster 1 (Torres et al., 2003). Cluster 3 (component 2 negative) included one OTU and five metabolites. Profiles relative abundance decreased slowly with time until reaching a stable abundance after 20 days. One OTU of *Clostridiales* order appeared to have been out-competed by other OTUs or phase active only during the first days of the degradation, which corresponds to the degradation of complex biopolymers contained in biowaste (Poirier et al., 2016). Among the metabolites of this cluster, hydrocinnamic and 3,4-dihydroxyhydrocinnamic acids are commonly found in plant biomass and its residues (Boerjan et al., 2003). Their molecular structure may have contributed to their slower degradation compared to other molecules, which may explain their stable abundance in the digesters until day 30. Finally, cluster 4 (component 2 positive) included 11 OTUs and 3 metabolites with slow relative abundance increase. OTUs from this group were very varied with eight orders represented. They may have had slower growth rates than OTUs of cluster 1 or were possibly involved in the degradation of molecules from cluster 3. Their abundance may also have had a slow increase as they fed on specific molecules that are only formed during the digestion process. Metabolites included N-acetylanthranilic acid and dehydroabietic acid that were likely produced by microorganisms and accumulated during the AD process, suggesting they could not be metabolized by other microorganisms.

#### Integration of Microbiome, Metabolomic and Performance Data with MultiBlock sPLS

**Figure 9** illustrates the results from the integration of the three datasets, where the performance data are considered as the response of interest. Similar to the sPLS analysis, block sPLS assigned profiles to four clusters, with an average silhouette coefficient of 0.909. The proportionality distances are summarized in **Figure 10** and in **Supplementary Table 5** and show a greater level of association between profiles within each cluster, compared to the associations with all other profiles outside the cluster (see **Supplementary Figure 8** per omic variable).

Two performance variables (methane and carbon dioxide productions) were assigned to cluster 1 (component 1 negative). This result is biologically relevant, as biogas is the final output of

of OTUs, metabolites and performance measures selected by multiblock sPLS across time. OTUs, metabolites and performance measures were clustered according to their contribution on each component. The clusters were further separated into profiles denoted `positive' or `negative' that refer to the sign of the loading vector from multiblock sPLS.

the AD reaction and is known to be associated with microbial activity and growth. Moreover, it is produced by archaea, such as *Methanosarcina*, which is also selected in this cluster. The proportionality distance between this OTU and methane was very low (φs = 0.25; **Supplementary Table 4**) confirming a strong association. Cluster 1 therefore represented the progress of the degradation process. In Cluster 2 (component 1 positive), we identified acetate produced by bacteria in the early days of the incubation and consumed by archaea (cluster 1) to produce biogas. It was logically negatively associated to cluster 1 representing the progress of the degradation. Propionate was assigned to cluster 3 (component 2 positive). Its degradation was delayed compared to the molecule of cluster 1. It was expected as, for thermodynamical reasons, its degradation usually only starts when all acetate is degraded (Chapleur et al., 2014). It was biologically relevant to find it associated with hydrocinnamic and 3,4-dihydroxyhydrocinnamic acids, which are also difficult to degrade. Cluster 4 (component 2 negative) was composed of only OTUs and metabolites and was similar to the one obtained with sPLS on component 2 positive.

In summary, our framework allowed us to integrate different omic datasets measured longitudinally and identify subsets of relevant microorganisms that were highly associated with metabolites abundance and performance measures through the biodegradation process. These analyses constitute a first step toward generating novel hypotheses about the biological mechanisms underpinning the dynamics in AD.

## DISCUSSION

Advances in technology and reduced sequencing costs have resulted in the emergence of new and more complex experimental designs that combine multiple omic datasets and several sampling times from the same biological material. Thus, the challenge is to integrate longitudinal, multi-omic data to capture the complex interactions between these omic layers and obtain a holistic view of biological systems. In order to integrate longitudinal data from microbial communities with other omics, meta-omics, or other clinical variables, we proposed a data-driven analytical framework to identify highly associated temporal profiles between these multiple and heterogeneous datasets.

The application of this method allows the identification of similar expression profiles within a particular dataset (e.g., infant gut microbiota development study) but also across heterogeneous data types (16S amplicon microbiome data, metabolomics, chemical data in the waste degradation study). The clustering of longitudinal profiles helps identify groups of biological entities that may be functionally related and thus generate novel hypotheses about the regulatory mechanisms that take place within the ecosystem.

In the proposed framework, the microbial counts of the microbiota's constituent species are normalized for uneven sequencing library sizes and compositional data. Modeling with linear mixed model splines enables us to reduce the dimension of the data across the different biological replicates and take into account the individual variability due to either technical or biological sources. This approach also enables us to compare data analyzed at different time points (e.g., the waste degradation study). Lastly, we clustered the data using multivariate dimension reduction techniques on the spline models that further allowed integration between different data types, and the identification of the main patterns of longitudinal variation.

Ribicic et al. (2018) proposed an approach similar to ours, but they applied individual PCA or sPCA on each dataset (chemical loss and microbial community) after local polynomial regression modeling. Integration was performed in a second stage of the analysis with PLS by using hierarchical clustering (Cluster Image Maps visualization) to identify correlations between the two datasets. In comparison, we offer a more complete framework that accommodates complex scenarios, across several omics and across replicates, and handles compositional data. The LMMS allows for the modeling of expression over time for each compound across biological replicates while taking into account the overall individual variability. We used sPCA, sPLS, and block sPLS as clustering means by leveraging on the loading vectors from these methods while selecting meaningful profile signatures.

Integrating different types of microbiome longitudinal data (e.g., abundance, activity, metabolic pathways, or macroscopic output) can be naively performed by concatenating all datasets. However, we showed that this approach was unsuccessful at selecting a sufficiently large number of profiles of different types and thus did not shed light on the holistic view of the ecosystem dynamics (bioreactor study). Our integrative multivariate methods sPLS and block sPLS were better suited for the integration task, as they do not merge but rather statistically correlate components built on each dataset, and thus avoid unbalance in the signature when one dataset is either more informative, less noisy, or larger than the other datasets.

When compared with fPCA, which uses either *k*-CFC or EM clustering algorithms, we showed that our approach led to better clustering performance. In addition, the sparse multivariate approaches sPCA and block sPLS enabled the identification of key profiles to improve biological interpretation. Note however that fPCA might be better suited than our approach for a large number of time points, as we discuss next.

We have identified several limitations in our proposed framework. First, a high individual variability between biological replicates limits the LMMS modeling step, resulting in simple linear regression models to fit the data. While a straight line model may accurately describe temporal dynamics, it could also be due to a poor quality of fit. We have implemented the Breusch–Pagan test to address this issue. Alternatively, in the case of a very high inter-individual variability that prevents appropriate smoothing, one could consider *N of One* analyses as proposed by (Gerber et al. (2012); Äijö et al. (2017) with time dynamical probabilistic models.

Second, a large number of time points can result in the modeling of noisy profiles and clusters, often due to high individual variability. Highly variable and vastly different profiles can also be difficult to cluster appropriately. Therefore, this framework is recommended when the number of time points remains small (5–10) and when regular and similar trends are expected from the data.

Third, even though our simulation results showed that the LMMS interpolation of missing time points did not seem to impact clustering, the overall performance of the approach would be optimal for regularly spaced time points in the omics longitudinal experiments.

Fourth, we have not fully addressed the issue of analyzing time-course compositional data. Indeed, when working with relative abundances, fluctuations in the abundance of a particular microorganism might result in spurious fluctuations in the abundance of other microorganisms. This issue is not specific to microbiome data only, as other sequenced-based data are intrinsically compositional (Gloor et al., 2017). Thus, when looking for associations between longitudinal profiles, the optimal solution could be to analyze absolute abundances. However, such data require spike-ins and are currently rarely available. Badri et al. (2018) have investigated normalization strategies and their effect in correlation analysis but for a single time point, while Metwally et al. (2018) proposed three normalization strategies that ignore the compositionality data problem. No method for longitudinal compositional data analysis has been proposed as yet. The proportionality measure proposed by Lovell et al. (2015) is a promising solution to reduce spurious correlations. However, it has not been developed for longitudinal problems, and the metric is not suitable in our context to perform variable selection. Instead, we chose to use the proportionality distance as a *post hoc* evaluation in our framework, not only to reduce potential spurious associations between profiles assigned in each cluster, but also to improve and help interpretation with respect to proportional and relative abundance of the profiles.

Finally, our framework does not include time delay analysis, even though dynamic delays between different types of molecules (e.g., DNA, RNA, or metabolites) can be expected. For example, 16S data describes the abundance of the microorganisms, with metabolites as the consequence of their activity, and performance as the macroscopic resulting output. Potential delays between these molecules can be detected using other techniques, such as the fast Fourier transform approach from Straube et al. (2017), and will be further investigated in our future work.

To summarize, we have proposed one of the first computational frameworks to integrate longitudinal microbiome data with other omics data or other variables generated on the same biological samples or material. The identification of highly associated key omics features can help generate novel hypotheses to better understand the dynamics of biological and biosystem interactions. Thus, our data-driven approach

## REFERENCES


will open new avenues for the exploration and analyses of multi-omics studies.

# DATA AVAILABILITY STATEMENT

Infant gut microbiota phylochip raw data can be found in Palmer et al. (2007). The microbiome and performance datasets for the bioreactor study can be found in Poirier and Chapleur (2018); metabolomic data are available on request. In-house scripts and code to conduct both case study analysis are available in a Github public repository: https://github.com/abodein/timeOmics

## AUTHOR CONTRIBUTIONS

All authors contributed to the design of the study. AB and OC performed the statistical analyses. AB, OC, and K-ALC wrote the manuscript. All authors read and approved the submitted version.

# FUNDING

Waste degradation study was supported in part by the Digestomic project funded by the French National Research Agency (ANR-16-CE05-0014). K-ALC was supported in part by the National Health and Medical Research Council (NHMRC) Career Development fellowship (GNT1159458). K-ALC and OC scientific travels were supported in part by the France-Australia Science Innovation Collaboration (FASIC) Program Early Career Fellowships from the Australian Academy of Science. AB and AD are supported by Research and Innovation chair L'Oreal in Digital Biology.

## ACKNOWLEDGMENTS

We thank Angéline Guenne for analytical support with GC-MS analysis, Kodjovi Dodji Mlaga for the biological interpretations of the infant study, and Zoe Welham for proof-reading the manuscript. We thank the reviewers for their constructive comments.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00963/ full#supplementary-material


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The handling editor and reviewer LA declared their involvement as co-editors in the Research Topic, and confirm the absence of any other collaboration.

*Copyright © 2019 Bodein, Chapleur, Droit and Lê Cao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities

*Duo Jiang1, Courtney R. Armour2, Chenxiao Hu1, Meng Mei1, Chuan Tian1, Thomas J. Sharpton1,2 and Yuan Jiang1\**

1 Department of Statistics, Oregon State University, Corvallis, OR, United States, 2 Department of Microbiology, Oregon State University, Corvallis, OR, United States

The advent of large-scale microbiome studies affords newfound analytical opportunities to understand how these communities of microbes operate and relate to their environment. However, the analytical methodology needed to model microbiome data and integrate them with other data constructs remains nascent. This emergent analytical toolset frequently ports over techniques developed in other multi-omics investigations, especially the growing array of statistical and computational techniques for integrating and representing data through networks. While network analysis has emerged as a powerful approach to modeling microbiome data, oftentimes by integrating these data with other types of omics data to discern their functional linkages, it is not always evident if the statistical details of the approach being applied are consistent with the assumptions of microbiome data or how they impact data interpretation. In this review, we overview some of the most important network methods for integrative analysis, with an emphasis on methods that have been applied or have great potential to be applied to the analysis of multi-omics integration of microbiome data. We compare advantages and disadvantages of various statistical tools, assess their applicability to microbiome data, and discuss their biological interpretability. We also highlight on-going statistical challenges and opportunities for integrative network analysis of microbiome data.

Keywords: compositionality, heterogeneity, microbiome networks, multi-omics data integration, network analysis, normalization, sparsity

# INTRODUCTION

The microbiological sciences have undergone a research transformation in recent years as extensive volumes of microbiome data have been generated. By coupling environmental DNA sequencing procedures with bioinformatic and data analytic approaches, scientists have begun to disentangle the composition, diversity, and function of microbiomes (The Human Microbiome Project Consortium, 2012; Sunagawa et al., 2015; Thompson et al., 2017). However, the complexity of microbial systems, which frequently include diverse taxa and ecological covariates, continues to challenge the discovery of biological signal in these massive data sets. One common goal is to resolve how the microbiome influences or responds to its environment (Alivisatos et al., 2015; Blaser et al., 2016). To disentangle these mechanisms among the complex milieu of microbiome features, researchers have developed a rich array of analytical procedures, with one of the most widely used being microbiome network reconstruction.

Edited by:

Himel Mallick, Merck, United States

#### Reviewed by: Angela Re,

Italian Institute of Technology, Italy Yuehua Cui, Michigan State University, United States

\*Correspondence: Yuan Jiang yuan.jiang@stat.oregonstate.edu

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 13 February 2019 Accepted: 18 September 2019 Published: 08 November 2019

#### Citation:

Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ and Jiang Y (2019) Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities. Front. Genet. 10:995. doi: 10.3389/fgene.2019.00995

1 **123**

Networks can be used to itemize interactions between community members, between communities, and between community members and some set of covariates (Follows et al., 2007; Faust et al., 2012; Gaulke et al., 2016; Tapio et al., 2017; Gould et al., 2018; Mandakovic et al., 2018). As a result, they offer a mapping of how information flows among the members of the microbiome or its environment (Röttjers and Faust, 2018). These networks have been most widely applied to microbiome taxonomic data and are traditionally assembled by correlating microbiome features and establishing linkages between features based on the significance or magnitudes of these correlations (Faust and Raes, 2012; Röttjers and Faust, 2018). Networks can then be visualized or analyzed using a variety of techniques to resolve, for example, taxa that potentially co-depend on one another, taxa that potentially compete with one another, or keystone taxa (Faust and Raes, 2012; Layeghifard et al., 2017). More analytically rigorous methods for inferring these taxonomic interactions have recently been developed to resolve the biologically relevant interactions and to account for unique statistical features of microbiome data (Dohlman and Shen, 2019).

While the analysis of networks representing microbe-microbe interactions has transformed our knowledge of how uncultured microbes potentially interact with one another in their environment, a small but growing number of studies increasingly leverage multi-omics networks to infer how microbial taxa interact with features of their environment (Kint et al., 2010; McHardy et al., 2013; Theriot et al., 2014; Morgun et al., 2015; Heintz-Buschart et al., 2016; Pfalzer et al., 2016; Maier et al., 2017). Microbiome multi-omics data involve collecting multiple types of high-dimensional biological data—including 16S, metagenomic, metatranscriptomic, metabolomics, etc.—from a microbiome sample and its environment or host. While these approaches often remain relatively expensive, technological transformations continue to reduce the cost of generating diverse data constructs, which, in turn, increases the rate at which researchers can apply these multi-omics approaches. This increased accessibility is fortunate, as the integration of multiomics data holds potential to resolve functional mechanisms of the microbiome (Rodrigues et al., 2018; Wang et al., 2019). For example, these data integrative networks can clarify how changes in the relative abundance of a taxon relates to the expression of genes across a microbial community (i.e., the metatranscriptome), the pool of metabolites, or the phenotype of the microbiome's host. However, there remain relatively few tools that investigators can rely on to integrate and understand these data.

Multi-omics network integration offers an opportunity to resolve how specific members of the microbiome functionally relate to specific environmental features, which, in turn, helps researchers key in on pathways of information flow that may ultimately transform our ability to manipulate, rescue, or mimic microbiomes. However, their application remains nascent. Most studies in this area thus far apply measures of correlation (such as Spearman's rank correlation) to resolve microbial taxa that correlate with specific environmental or host features. This approach has specifically been used to clarify how gut microbial abundance relates to the pool of intestinal metabolites (McHardy et al., 2013), discern possible connections between mucosal bacterial abundance and intestinal gene expression in association with inflammatory bowel disease (Morgan et al., 2015), resolve which specific microbes on the human skin may produce metabolites of interest (Bouslimani et al., 2015), and uncover how ocean microbes express transcripts (Aylward et al., 2015). However, this relatively simplistic statistical approach does not necessarily meet the assumptions of microbiome data or address the needs of the problems that arise from such data and may yield inappropriate conclusions.

To promote the innovation of statistical approaches that are more appropriate and specific for microbiome multi-omics network analysis, we present a comprehensive review of the currently available network-based statistical methods and discuss their application to multi-omics data integration. In addition, we consider the unique features of microbiome data and microbiome multi-omics data integration and further explore the reviewed network-based statistical methods in terms of their appropriateness and limitation when applied to microbiome multi-omics data integration. At the end, we conclude with remarks on the major challenges and research opportunities in the innovation of statistical approaches for microbiome multiomics network analysis.

# OVERVIEW OF NETWORKS

Network data structures are often complex and involve rich and unusual terminology. In this section, we orient readers to basic concepts and terms associated with network data science, with the goal of improving comprehension of the subsequent discussion of network-based statistical approaches (Section "Review of Available Network-Based Procedures").

Networks, which are also called graphs, are useful data structures for examining how components of a system interact with or relate to one another. These interactions are commonly derived using statistical approaches that reveal associations between pairs of components and are further illustrated graphically as edges that connect pairs of nodes that represent the components of a system. Networks can also represent empirical interactions between components that have been experimentally validated. However, in the case of microbiome research, limitations in the number of cultured taxa and the complexity of most microbial communities restrict the application of such empirical approaches. Networks have been effectively used in a variety of fields. Examples include infectious disease research (Silk et al., 2017), social interaction analysis applied to marketing (Liu et al., 2019) and political science (Cranmer et al., 2017), analysis of neuroimaging data (Fujita et al., 2017), information flow through the internet (Dorogovtsev and Mendes, 2003), genomics data analysis (Kleaveland et al., 2018). In microbiome science, network data structures have been used in a variety of contexts (as reviewed by Faust and Raes, 2012, and Layeghifard et al., 2017), including efforts to evaluate interactions between members of a microbial community (Faust et al., 2012), associate taxa with metabolite production (Bouslimani et al., 2015), and determine which taxa interact with host bile acid metabolism (Theriot et al., 2016).

Networks adopt a variety of terms and properties, some of which we define here to orient readers. The components of the system being modeled by a network are represented as nodes or vertices. In microbiome research nodes can be biological features such as microbial taxa, genes, metabolites, and proteins. Nodes may also represent environmental or host features, such as pH and markers of immune status. The presence of an edge between a pair of nodes indicates an association between the nodes, such as a correlation between the abundance of two taxa. Such edges may suggest a dependency between the taxa by indicating, for example, that when one taxon increases in abundance, the other taxa do as well possibly due to crossfeeding. We note that an inferred edge itself does not imply a causal dependency between the features, the inference of which requires a controlled experiment. If the associations differ in strength, edges can be weighted to illustrate the strength of association and guide interpretation. The distinction between positive and negative associations can also be captured by weights of different signs. In some cases, the interactions being modeled by a network are directed, meaning that they indicate that the change to one component causes a change in another connected component. In such instances, such directed network edges are represented by arrows and can be used to depict the cause and effect relationships among components. It is worth noting that causality can be challenging or impossible to infer in many genomic investigations depending on the study design. In those cases, the directionality of the relationship might be pre-specified based on knowledge to construct a bipartite network (e.g., in some regression models, see Section "Regression-Based Methods") or inferred using the data in a probabilistic framework as a way of representing the information propagation in the system (e.g., in Bayesian networks, see Section "Bayesian Networks").

In this article, our main interest is in the problem of estimating or constructing a network by integrating two or more types of omics data including microbiome data. In the rest of this article, the variables representing the components corresponding to each data type will be referred to as "features." Variables from different types of omics data will be said to belong to different "feature types." Examples of feature types include but are not limited to microbiome taxonomic, transcriptomic, and metabolomic features. The corresponding features within these feature types may include the abundance of a microbial taxon, the expression level of a gene, and the concentration of a metabolite. Depending on the scientific question of interest and the analytical approach used, there are various types of networks that can be constructed based on multi-omics data. When considering associations between distinct feature types, a bipartite network can be used where the edges are drawn between nodes of different types (as reviewed by Pavlopoulos et al., 2018). Alternatively, it is possible to construct a network among features of a single type where data from another type are incorporated in the analysis as additional information or covariates to improve the estimation of the network. Examples of this approach include studies conducted by Li et al. (2012) and Chun et al. (2013), a more detailed discussion of which can be found in Section "Methods Based on Graphical Models".

Once networks are estimated from the data, there are numerous metrics that can be quantified on the networks to summarize the overall structure of the system. One of the primary metrics used is degree, which is the count of edges that connect one node to all the others. Nodes with higher degree represent features that are relatively highly connected to other features in the system being modeled. Such nodes may have more influence on the system's dynamics and may represent, for example, keystone taxa in a community. Most real-world networks have a right-skewed degree distribution where most vertices have low degree, and few have high degree. When the degree distribution monotonically decreases over its entire range, it has a power-law distribution and is referred to as a scale-free network. In a scale-free network, some nodes can have significantly higher degree than others. Such nodes are often referred to as "hubs" because they are strong participants of the interactions in the network. Another way to identify important nodes is through measures of betweenness. To calculate betweenness, the shortest path between each pair of nodes in the network is first identified. Then, the betweenness for each node is measured as the number of times the node in question lies in the shortest path between two other nodes. Nodes with high betweenness are potentially influential in the network since they come between many pairs of nodes. Nodes with high betweenness can also have high degree; however, that is not always the case. High-betweenness nodes are often interpreted as bottlenecks of the information flow in the network. Various other topological properties of the network can also be assessed to glean interesting biological insights into a system, such as modularity, which aims to identify clusters of nodes densely connected to each other, with relatively low connectivity to the rest of the network. We refer the readers to the papers of Newman (2010), Ma'ayan (2011), and Charitou et al. (2016) for more in-depth discussions of topological analysis techniques. In this review, we will focus on the statistical estimation of networks instead of the topological analysis of an estimated network.

# REVIEW OF AVAILABLE NETWORK-BASED PROCEDURES

In recent years, integrative network analysis has increased in popularity, particularly for multi-omics data sets. The statistical methods utilized in these analyses lend perspective to how microbiome multi-omics networks can be inferred. In this section, we review network-based statistical methods with an emphasis on their applications to multi-omics data integration. We categorize commonly adopted methods into six types and present a detailed review of each type. **Table 1** provides a summary and a comparison of the six types of methods alongside software packages that enable their implementation.

# Marginal Correlation Analysis

The most commonly applied statistical method for constructing biological networks is marginal correlation analysis. In this analysis, the relationship between two biological features, such as genes, transcripts, proteins, metabolites, and microbes, is described by the correlation of their expression, concentration, or abundance levels inferred from multiple statistically independent observations, such as biological replicates or samples. Technically,

TABLE 1 | Summary of available network-based procedures.


this relationship can be quantified by any statistical measure of correlation, including but not limited to Pearson's correlation, Spearman's rank correlation, and Kendall's tau, as long as the approach is meaningful for a given biological context. Marginal correlation analysis is also useful when integrating multiple biological feature types (e.g., genes, transcripts, and proteins) to uncover relationships across feature types (Heintz-Buschart et al., 2016; Bakker et al., 2018; Frost and Amos 2018; McGrail et al., 2018).

Marginal correlation analysis can also be extended to observations that are statistically dependent. For example, consider the case wherein two biological features are observed over time (i.e., two time series of measures). One might want to assess the correlation of the features across the time series. In this case, it is essential that the correlation measures account for the longitudinal nature of the observations. One approach to this problem is the so-called local similarity analysis of two time series (Ruan et al., 2006). In this approach, both time series are first transformed separately to their normal scores. Then, for any subsequence of the first time series starting from the beginning, all subsequences of the same length from the second time series are identified within some predefined time delay. Pearson's correlations are then calculated between each pair of subsequences across the two time series. Finally, the local similarity score is defined as the maximum correlation for all such possible pairs of subsequences, aiming to find associations with possible delays between the two time series. Local similarity analysis has proven useful for detecting co-varying pairs of microbes as well as the association between a microbe and an environmental factor (e.g., temperature), especially when the variations between features are not synchronous (Ruan et al., 2006).

While the abovementioned methods are purely data-driven, other methods construct biological networks based on both statistical correlations and existing biological knowledge. For example, to create a bipartite network describing the relationship between mRNAs and miRNAs, Gade et al. (2011) combined two p-values for each pair of mRNA and miRNA expression values: (a) a p-value measuring the statistical correlation of the observed data and (b) a p-value obtained from an existing database of miRNA-target predictions (e.g., miRBase) (Griffiths-Jones et al., 2008). The authors applied a truncated product method of combining p-values (Zaykin et al., 2002), which they then transformed to weights and viewed as the adjacency matrix of a bipartite network describing the relationship between mRNAs and miRNAs.

In order to produce a biological network that facilitates meaningful interpretations, studies often only include correlations in the network that manifest correlation coefficients whose absolute value exceeds a threshold, which is usually arbitrarily determined, or if its associated p-value is less than a significance level such as 0.05. In the latter case, some applications simply use the raw p-values, which tend to yield excessive false positive edges, while other applications more carefully control false positives by adjusting the p-values with a multiple testing correction for familywise error rate (FWER) or false discovery rate (FDR). A biological network is then constructed by connecting those pairs of biological features with a statistically robust correlation and leaving all other pairs unconnected.

The abovementioned thresholding procedure produces a biological network that is unweighted, in the sense that an edge either exists or not between any pair of nodes. Weighted networks based on marginal correlation analysis have also attracted recent attention, such as in the case of Weighted Gene Co-expression Network Analysis (WGCNA) (Zhang and Horvath, 2005; Langfelder and Horvath, 2008). In this method, an edge in a network is weighted by a soft thresholding function of the inferred correlation (e.g., the sigmoid function, the power adjacency function, etc.) on a continuous scale. Many topological analysis methods have also been extended from unweighted networks to weighted networks, such as node connectivity (Barrat et al., 2003; Amano et al., 2018), network modules (Newman, 2004; Li et al., 2011; Lecca and Re, 2015), clustering coefficient (Opsahl and Panzarasa, 2009), and scalefree topology (Tan and Lei, 2013; Zhang et al., 2015). Because weighted networks encode additional information in the form of connection strengths as compared to unweighted networks, weighted networks have been shown to be a useful option for many biological datasets, including but not limited to microarray data (Kadarmideen et al., 2011; Mohammadnejad et al., 2019), single cell RNA-Seq data (Xue et al., 2013), DNA methylation data (Horvath et al., 2012; Wang et al., 2016), and microbiome data (Tong et al., 2013; Li et al., 2019).

Marginal correlation analysis is probably the most commonly used method to infer biological networks due to its computational simplicity. However, the approach is limited by the fact that it can only infer relationships between pairs of biological features and does not consider how the observed relationship may depend upon other variables or features. As a result, marginal correlation analysis can lead to spurious correlations: two features that independently interact with a third, but not with one another, may appear to correlate. Therefore, marginal correlation analysis is known to be prone to false positives when seeking to identify direct interactions or causal effects among the features. It is important to keep this limitation in mind and to critically assess the risk of confounding factors before drawing conclusions about biological interactions that result from marginal correlation analysis.

# Dimension Reduction Methods

Dimension reduction, such as the widely used method principal component analysis (PCA), is a useful statistical tool that aims to reduce the dimension of a set of variables while retaining as much information from the original data as possible. It is also useful when the relationships between two feature types are investigated, in which case data associated with each feature type are reduced to a lower dimension in a way that captures as much association between the two feature types as possible. We refer the readers to the review papers of Burges (2009) and Engel et al. (2011) as two statistical reviews on dimension reduction and to the review paper of Meng et al. (2016) as a review on the application of dimension reduction to the integrative analysis of multi-omics data.

Commonly used dimension reduction tools include canonical correlation analysis (CCA), partial least square regression (PLS), and co-inertia analysis (CIA) (Meng et al., 2016). These tools share the same goal of summarizing the variables in each feature type by using a small number of linear combinations so as to maximize the association between the two feature types as demonstrated by these linear combinations. Different measures of association correspond to different tools in this category. More specifically, CCA uses Pearson's correlation to capture the association between two linear combinations (or equivalently, all linear combinations are normalized to have a unit variance), PLS uses covariance to quantify the association with the constraint that the linear combination from one feature type has a unit variance, and CIA uses covariance to represent the similarity with no variance constraint. CCA, PLS, and CIA have all been applied to infer biological networks from multiomics data. For example, CCA was used to construct gene co-expression networks by considering linear combinations of gene expression at the exon or base pair level for each gene obtained from an RNA-seq dataset (Hong et al., 2013). In this study, the authors then calculated the canonical correlation between each pair of genes, ranked the correlations based on their magnitude, and constructed a co-expression network by retaining a predetermined percentage of edges. In other studies, CIA was applied to mRNA and microRNA data to determine which microRNAs regulates gene expressions (Jovanović et al., 2014) as well as to microbiome and metabolomic data sets to understand the impact of a short-term increase in dietary fiber intake on the gut microbial community (Tap et al., 2015). PLS has also been utilized in multi-omics studies, for example, to analyze the associations between biomarkers for insulin sensitivity and a variety of omic data, including gut microbiota, adipose gene expression, and metabolomic data (Dao et al., 2019).

These methods suffer from a few limitations, which recent efforts have sought to overcome. The first limitation stems from the fact that a linear combination found by CCA, PLS, and CIA tends to include every variable under consideration, albeit with varying weights. This tendency to include every variable results in poor interpretability as it can be difficult to determine which variables contribute to the canonical correlations and which do not. Therefore, a desirable extension is to introduce sparsity to the linear combinations, where the coefficients for variables with less contribution are shrunk to zero. Recent methods that apply such a strategy include sparse canonical correlation analysis (SCCA) (Parkhomenko et al., 2007; Waaijenborg et al., 2008; Parkhomenko et al., 2009; Witten and Tibshirani, 2009; Witten et al., 2009; Hardoon and Shawe-Taylor, 2011; Suo et al., 2017), sparse partial least squares (SPLS) (Lê Cao et al., 2008; Chun and Keleş, 2010; Chung and Keles, 2010; Lee et al., 2011), and sparse co-inertia analysis (SCIA) (Min et al., 2018). These methods try to balance between maximizing the correlation between linear combinations defined for different feature types and minimizing the number of variables included in each linear combination. These methods share the same basic idea of incorporating variable selection techniques, such as lasso and elastic net (Tibshirani, 1996; Zou and Hastie, 2005), into traditional dimension reduction methods. As a result, these methods produce a sparse linear combination for each group of variables, although they each differ in either the problem formulation or computational details. These methods have been used to integrate SNP and gene expression data with the goal of identifying a group of SNPs that explain the variation in gene expression across a group of genes while keeping the group sizes sufficiently small to aid biological interpretation (Parkhomenko et al., 2007; Parkhomenko et al., 2009).

Another limitation of the traditional dimension reduction tools is that they can only consider two feature types, i.e., two groups of variables. Extensions of SCCA have been proposed to accommodate the analysis of multiple groups of variables (Witten and Tibshirani, 2009; Tenenhaus et al., 2014). Meng et al. (2014) proposed the multiple CIA method and used it to integrate transcriptomic, proteomic, and metabolomic data. All of these methods aim to find a linear combination from each group of variables so as to maximize the sum of squared pairwise correlations or the sum of squared covariances between each linear combination and a synthetic axis that is also parametrically optimized.

The third limitation of the traditional dimension reduction tools is that they only replace the original features by their linear combinations. Nonlinear dimension reduction tools have also been proposed to overcome this limitation such as kernelbased dimension reduction methods including kernel principal component analysis (KPCA) (Schölkopf et al.,1997), kernel canonical correlation analysis (KCCA) (Lai and Fyfe, 2003), and kernel fusion methods (Daemen et al., 2009). For example, Reverter et al. (2012) applied KPCA to classify disease types using the kernel principal components estimated from gene expression profiles. Daemen et al. (2009) proposed a kernel fusing method for clinical decision support that transforms multi-omics data into a linear combination of their corresponding kernel matrices and implements a classifier based on the combined result.

A common feature of the aforementioned dimension reduction tools for multi-omics data integration is that they are all based on the integration of two or more types of observed data. They are thus sometimes referred to as data-driven methods. Another class of dimension reduction tools try to integrate the observed data with external knowledge and are therefore called knowledge-driven methods. As an example, Yang et al. (2009) proposed a method called knowledge-based matrix factorization (KMF). In this study, the authors used KMF to build a gene co-expression network based on pairwise correlations between gene expression levels while incorporating existing pathway information from external databases such as Gene Ontology (GO) (Gene Ontology Consortium 2004). To incorporate this external knowledge, KMF finds the best low-rank factorization of the correlation matrix so that it is decomposed into the product of three matrices. The left and right matrices are transpose of each other and they approximate the membership of genes in pathways, while the center matrix captures the relationship between the pathways. This procedure allows KMF to construct a gene-gene correlation network whose structure is consistent with external pathway information while also identifying interactions between the pathways.

In summary, dimension reduction methods look for a combination of the features to represent each feature type while maximizing the correlation or covariance between the resulting combinations. Therefore, dimension reduction methods can be regarded as a multivariate extension of marginal correlation analysis. As a result, these methods are subject to the same pitfall that marginal correlation analysis faces (see Section "Marginal Correlation Analysis"); for example, they may lead to spurious correlations caused by confounding factors. In addition, although sparse versions of dimension reduction methods have been developed, lack of interpretability remains a limitation because each combination includes multiple, if not all, biological features in a group, and thus, the inferred relationships cannot be attributed to a specific pair of features.

# Regression-Based Methods

Network inference in multi-omics data have also been formulated as a regression problem. In this case, a series of regression models are fitted by taking one feature type as the response variable and another type as the predictor variable. Associations identified by these regression models are often interpreted as a directed relationship in which the feature type serving as the predictor is considered to affect or explain the feature type serving as the response. However, this inferred effect does not necessarily demonstrate a cause and effect relationship among the variables. For example, to assess the extent to which mRNA abundance was able to explain protein abundance, Nie et al. (2006b) fitted a linear model for each protein-mRNA pair with the former as the response and the latter as a predictor, incorporating multiple sequence features as additional covariates. For noncontinuous data, generalized linear models such as Poisson regression have also been employed to elucidate interactions between genomic features (Nie et al., 2006a). More recently, Yuan et al. (2018) proposed a regression model that aims to infer gene regulatory networks by incorporating DNA methylation and copy number variation as well as their interactions. Regression-based methods have also been used to integrate other types of multi-omics data. Recent examples include a somatic eQTL analysis using linear regression to model the association between gene expression and the mutation status of linked loci while accounting for various covariates including DNA methylation and gene copy number variation (Zhang et al., 2018). Moore and Hoen (2019) discussed the use of the regression framework to analyze RNAprotein interactions.

As opposed to considering a single predictor at a time, each regression model can also simultaneously include a large number of predictors, possibly from multiple feature types, to identify a set of variables that best predict the response. Typically, in these methods, a feature type of interest is regarded as the response data, with the other feature types regarded as the explanatory data. In each regression model, one feature is taken as the response variable, which is fitted against all variables in the explanatory data as predictors. The resulting high dimensionality leads to an underdetermined regression problem and thereby renders ordinary least squares and maximum likelihood estimation illposed. Therefore, variable selection techniques are needed to estimate the model parameters.

Regularized regression, the most representative method being lasso (Tibshirani, 1996), is commonly used for variable selection to overcome these limitations (as reviewed by Bickel and Li, 2006, and by Wu et al., 2019, for its application to multi-omics integration). In this case, a penalty term is incorporated in the usual least squares or maximum likelihood objective function in order to shrink some of the set of parameter estimates to zero, hence inducing sparsity in the regression coefficients. This strategy achieves variable selection and parameter estimation simultaneously. Each coefficient estimated to be nonzero is then represented by an edge in the network between the associated predictor and the response. There have been many applications of this approach to multi-omics studies. For example, Kim et al. (2014) and Yuan et al. (2018) estimated networks between DNA methylation, copy number variation, and gene expression based on a set of regularized linear regressions where separate L1 penalties were imposed on the three feature types. Qin et al. (2014) integrated ChIP seq and transcriptome data to infer gene regulatory networks using a regularization method where the L1 penalty is replaced by L0 and L0.5 penalties.

Another type of regression-based method for integrative network inference uses a technique called multivariate regression (Kim et al., 2009; Peng et al., 2012), which includes a multivariate response (i.e., multiple response variables) in a single model. When a multivariate response is modeled against a set of predictors, the unknown coefficients come in the form of a matrix, where an entry is assigned to relate each response variable to each predictor. Constraints are often imposed either on the sparsity or the rank of this coefficient matrix, or both, to ensure that the model can be fitted despite the limited sample size in comparison to the number of parameters. Applications of this approach to multi-omics data usually combine variables from one feature type, which serves as the multivariate response, while another type of omics features serve as the predictor variables. Like methods based on univariate regression, a directed network can be constructed with edges corresponding to nonzero coefficients. However, unlike univariate methods which involve a large number of separate regression models, multivariate regression only fits one joint model, which allows more realistic modeling and simplified understanding of the biological mechanisms *via* sparsity and rank constraints. For example, Goh et al. (2017) proposed a multivariate regression method, which was used to fit time-course mRNA data for >500 genes against binding information of the target genes for >100 transcription factors. Sparsity and low-rank constraints were imposed to account for the fact that many transcription factors are not related to the genes and the samples are correlated due to the study design.

Regression-based methods are widely used to construct biological networks mainly because they are relatively straightforward to implement. Compared with marginal correlation analysis and dimension reduction methods, regression models have the advantage of being able to incorporate relevant covariate information. A regression framework is also equipped with many well-studied statistical tools to flexibly handle specific analytical needs. For example, random effects can be incorporated to account for inter-sample correlation between samples due to study design (Zhang et al., 2013) and to correct for data heterogeneity due to unobserved confounders (Furlotte et al., 2011). The regression-based approach is also empowered by the recent statistical developments in penalized regression to handle high-dimensional data. However, most regression-based methods entail that each feature (or feature type) is identified as either a response variable or a predictor, which can be a nontrivial choice to make especially when the underlying biology is poorly understood for the system being studied.

# Methods Based on Graphical Models

Gaussian graphical models are widely applied in network analysis (as reviewed by Drton and Maathuis, 2017). Specifically, in a multivariate Gaussian distribution, two variables are statistically independent conditional on all the other variables if and only if the corresponding entry in the inverse covariance matrix of the distribution is zero. Then, to construct a network with each edge representing the conditional dependence between two features given all other features, it is equivalent to identify the nonzero entries of the inverse covariance matrix for the multivariate Gaussian distribution. In reality, the data are often high-dimensional with more variables than samples, which leads to a degenerate sample covariance matrix and makes the estimation of the inverse covariance matrix challenging.

There are two major statistical approaches for estimating the inverse covariance matrix in the high-dimensional Gaussian graphical model: the neighborhood selection method (Meinshausen and Bühlmann, 2006) and the graphical lasso method (Yuan and Lin, 2007; Friedman et al., 2008). Both methods yield a sparse estimator of the inverse covariance matrix, whose nonzero entries can be used to construct a network that denotes the conditional dependency between the variables in the Gaussian graphical model. To apply Gaussian graphical models to the integration of multi-omics data, a naive strategy combines all variables from multiple feature types into one vector, which is assumed to follow a multivariate normal distribution (Shin et al., 2014). However, this approach effectively treats all variables as exchangeable, and, in turn, ignores the potentially important information about their group structure.

One typical application of Gaussian graphical models to multi-omics data is the joint Gaussian graphical model, which simultaneously estimates multiple graphical models under some constraints among them. The constraints are often determined by some prior knowledge for the multiple inverse covariance matrices such as their similarity in magnitudes or sparsity or the membership of nodes in biological pathways (Guo et al., 2011; Danaher et al., 2014; Kim et al., 2017). This idea has been applied to find biological networks from different groups simultaneously, e.g., disease subtypes or experimental conditions. For example, Kim et al. (2017) used a joint Gaussian graphical model to estimate multiple mRNA expression networks from different datasets. Zhang et al. (2016) further extended the idea of joint graphical models to a two-dimensional joint graphical lasso model. This model imposed a joint penalty function to simultaneously estimate two gene expression networks that are patient group-specific from gene expression profiles collected from different data generation platforms. After obtaining the gene networks, the differential networks between the two patient groups were constructed by calculating the differences of dependencies between two group-specific networks (i.e., one differential network for each platform).

Bayesian inference based on joint Gaussian graphical models has also been used to construct networks by applying a G-Wishart prior on the inverse covariance matrix (Peterson et al., 2015). In this particular case, a Markov random field prior was imposed to encourage common edges between joint graph structures. This procedure enabled the identification of which groups have a shared network structure by placing a spike-and-slab prior on parameters which measure network relatedness.

Conditional graphical models represent another class of graphical model approaches that are useful for solving data integration problems. Different from the traditional graphical models, the conditional graphical model incorporates an additional conditioning step to remove spurious dependence that may be caused by common external factors. For example, two genes may depend on each other only because they are regulated by the same DNA markers and have no relationship otherwise. Along this research direction, Li et al. (2012) proposed a method which infers such a conditional graphical model in two steps. It first estimates the conditional covariance matrix and then uses penalized maximum likelihood to obtain the inverse conditional covariance estimator. The authors used their method to define a gene expression network conditional upon eQTL data. Moreover, Chun et al. (2013) extended the same idea to multiple conditional graphical models, allowing the integration of gene expression data from different sources, say, heart and fat tissues. Other similar research includes the covariate-adjusted graphical models that use genetic markers (SNPs) as covariates to correct both false positives and false negatives in gene regulatory networks (Cai et al., 2013; Gao and Cui, 2015). In these methods, the effect of genetic variation is estimated in the first step. Then, the graphical structure is estimated in the second step while adjusting for the genetic effects.

Like graphical lasso, most joint or conditional graphical models incorporate the sparsity assumption to tackle the high dimensionality problem in the context of inverse covariance matrix estimation, but often rely on the assumption of a multivariate Gaussian distribution. Zhang et al. (2017) is one of the few studies that estimate the inverse covariance matrix under a mixed model that includes different biological feature types by accommodating both discrete and continuous variables. Due to the computational complexity of discrete variables, the authors used the pseudolikelihood method instead of the usual likelihood method for parameter estimation. In spite of these innovations, methods based on graphical models still need to account for the unique characteristics of microbiome data when applied to microbiome multi-omics data integration (Section "Unique Challenges of Microbiome Multi-Omics Network Analysis").

# Bayesian Networks

Like Gaussian graphical models, Bayesian networks are probabilistic graphical models and are increasingly used as a statistical and machine learning tool for analyzing genomic data. In a Bayesian network, a graph with directed edges is used to represent the conditional relationships in the joint probability distribution of a set of variables: for each variable X, given its parent variables (i.e., nodes pointing to X), X only affects its child variables (i.e., nodes pointed to by X) and is conditionally independent of all other variables. These conditional independence constraints serve to cut down, frequently substantially, the number of parameters needed to jointly model the variables. We refer the readers to a review paper (Koski and Noble, 2014) for a more thorough introduction to Bayesian networks.

In the past decade, Bayesian networks have seen many applications in genomic data integration. For example, Akavia et al. (2010) introduced an algorithm based on Bayesian networks (CONEXIC) to identify driver mutations in cancer by integrating gene expression data with matched copy number data. QTLnet (Chaibub-Neto et al., 2010) is a method that uses a Bayesian network that includes both phenotype and genotype variables as nodes to jointly estimate the causal network between multiple phenotypes and their respective genetic architecture. In order to improve the recovery of gene interaction networks based on experimental data, Isci et al. (2014) proposed a hierarchical method called BNP where a Bayesian network is nested within a classical Bayesian modeling framework. This approach enables the incorporation of rich external knowledge about gene interactions as the prior information in the Bayesian inference procedure. More recently, Khanna et al. (2018) applied Bayesian network to elucidate the interplay between genotype information, neuroimaging measurements, and clinical data to help uncover biological mechanisms underlying Alzheimer's disease.

The Bayesian network approach has several appealing advantages when applied to multi-omics data analysis. First, because of the structure of the underlying probabilistic model, Bayesian networks are usually considered akin to directed networks, in such that causal relationships are often inferred among nodes. In particular, network edges are often interpreted to represent how information propagates between variables or components in a biological process. We note that, although causal interpretation of Bayesian networks is appealing and widespread, there have been growing skepticism over the liberal use of such interpretation because a Bayesian network does not guarantee causality (Korb and Nicholson, 2008). Second, Bayesian networks can incorporate prior knowledge about plausible relationships among variables within or between feature types (Ni et al., 2014). Third, Bayesian networks may be set up in a way that allows for simultaneous modeling of variables following different types of distributions. For example, Chaibub-Neto et al. (2010) modeled a Bayesian network where the nodes consist of a mixture of continuous phenotype variables and discrete genetic variables. The ability to handle disparate data types is an attractive feature as multi-omics studies frequently involve feature types that are more appropriately modeled using different distributions such as continuous, count, and binary data.

However, a major challenge limiting the use of Bayesian networks in genomic studies is its steep computational cost. The estimation of the structure of a Bayesian network usually involves the optimization of a complicated objective function over a large, nonconvex search space. As the number of variables increases, the computational burden increases super exponentially. Consequently, in most applications of Bayesian networks to multi-omics data, either only a small to moderate number of omics variables are considered or dimension reduction techniques are applied to reduce the number of variables before implementing Bayesian networks.

# Network Integration

A key goal of multi-omics data integration is to create a comprehensive view of a biological process from diverse types of omics data. Network integration approaches seek to solve this problem by integrating multiple, distinct biological networks assembled from different data types. There are many network integration strategies and we review below a representative subset of these approaches.

One approach to this problem, as illustrated by the method GeneMANIA (Mostafavi et al., 2008), is to build a composite association network by taking a weighted average of multiple association networks between features, such as genes, where the weights are selected based upon the composite network's ability to reconstruct referential characteristics of the features. For example, GeneMANIA uses ridge regression (Hoerl and Kennard, 1970) to find the weights of individual association networks to minimize the difference between the composite network and a target network constructed from known gene functions (such as GO functional categories), while incorporating the prior information of the weights in the ridge penalty.

Diffusion component analysis (DCA) (Cho et al., 2015; Wang et al., 2015; Cho et al., 2016) is another network integration method that targets heterogeneous networks with different connectivity patterns. In DCA, the diffusion state of each node is analyzed with the random walk with restart (RWR) method and is stored as a probability simplex that represents the probabilities that an RWR that starts at one node will end up at another node in equilibrium. A similar diffusion state between two nodes implies that the nodes are in similar positions within the network with respect to other nodes. Next, the node-specific diffusion state in individual networks are represented by two low-dimensional latent vectors: one that is shared across all networks and another that encodes the intrinsic topological property using multinomial logistic models. These shared low-dimensional node-specific latent vectors represent the homogeneous topological property across the network and can be used in other machine learning methods to derive further insights of the nodes. DCA has been applied to the functional analysis of genes (Cho et al., 2016) and drug-target interaction network (Luo et al., 2017).

While GeneMANIA and DCA integrates networks of features, similarity network fusion (SNF) (Wang et al., 2014) constructs a merged network between objects (e.g., biological samples) by combining multiple features types measured for each object. In particular, SNF first creates a network for the same set of samples from each data type, such as mRNA expression, DNA methylation, and microRNA expression. Then, it fuses these networks into one similarity network. The key idea of fusion is to update one network by utilizing two pieces of information: (a) the local affinity of the network and (b) the average similarity matrix of all the other networks. An iterative fusion process takes place, which increases the similarity between networks with each iteration until SNF achieves a final network by taking the average of all networks. In summary, SNF makes use of a network's local structure, integrating both common and complementary information across networks. SNF has been applied to identify cancer subtypes and predict survival (Wang et al., 2014).

More network integration methods have been applied in genomics research in addition to the ones reviewed here, although they are often application specific and differ substantially from one another. For a more substantial review, we refer the readers to the review paper of Wani and Raza (2018). In general, network integration methods offer a simple and straightforward solution whereby similar nodes (e.g., genes and proteins) across multiple networks are integrated by merging different types of edges from multiple networks. Although simple, they are less efficient when it comes to preserving the relationships across multiple networks, particularly when the networks are heterogeneous and do not share the same biological mechanism.

# UNIQUE CHALLENGES OF MICROBIOME MULTI-OMICS NETWORK ANALYSIS

Microbiome data science is often challenged by various statistical properties of microbiome data, including its compositionality, heterogeneity, and sparsity. These properties impact how statistical methods are applied to microbiome data and require careful consideration to ensure appropriate analysis. In this section, we discuss these various properties and how they impact the application of the approaches described in Section "Review of Available Network-Based Procedures" to microbiome data, especially with respect to microbiome multi-omics data integration. Our hope is that this discussion helps readers identify opportunities to transform microbiome multi-omics network analysis.

# Compositionality

One of the unique characteristics of microbiome data is its compositionality. Microbiome data are often presented as the abundances of different microbial taxa contained in a microbial community. However, microbiome data only carry information about the relative abundances of the taxa instead of their true abundances. This is because the total sequence count of all taxa for each sample, known as the sequencing depth of the sample, is an experimental technicality imposed by the sequencing instrument and bears no biological relevance. Therefore, the abundance count of a taxon in a sample only reflects the relative abundance of the taxon compared against all other taxa, rather than the absolute count of molecules in the underlying community attributable to the taxon. As a result, these data exist under an arbitrary sum constraint and are thus referred to as compositional data. This feature is also visualized in **Figure 1A**.

When modeling compositional data, it is important to account for the fact that the sum is uninformative about (i.e., ancillary for) the parameters of interest, and therefore, it may be desirable to consider the conditional distribution of the data regarding the sequencing depths as pre-fixed quantities. For example, a common strategy to acknowledge compositionality of microbiome data is to convert the abundance count of each taxon into proportions or relative abundances that sum up to one for each sample. A consequence of the sum constraint is that the features will tend to be negatively correlated even if the underlying (unobserved) true abundances are independent.

The traditional marginal correlation analysis methods in Section "Marginal Correlation Analysis" such as Pearson's, Spearman's, and Kendall's correlations do not consider microbiome data compositionality. The key issue is that there exists a constraint on the correlations between one taxon and all other taxa due to the compositionality of the data, which can yield spurious inferences of interaction. For example, for any given taxon, its Pearson's correlation coefficients with the other taxa always sum up to −1, regardless of how this taxon interacts with the rest of the microbiome. Recently, new methods have been proposed to account for data compositionality when constructing microbial networks. For example, SparCC (Friedman and Alm, 2012) employs a log-ratio transformation for every pair of taxa being correlated to remove compositionality: the ratio of the abundances of two taxa is independent of which other taxa are included in the analysis, a property termed subcompositional coherence. SparCC also uses an iterative algorithm that identifies the pair of taxa with the strongest correlation in each step and terminates iterations when a relatively sparse network structure is obtained. More recently, CCLasso (Fang et al., 2015) and REBACCA (Ban et al., 2015) use global optimization procedures that estimate the correlation network of all species while imposing an explicit constraint caused by the compositionality of the data and a sparsity constraint on the network. While this approach is effective at controlling for data compositionality, these methods are only designed to reconstruct taxon-taxon interaction networks. To the best of our knowledge, we are unaware of approaches that consider compositionality when constructing microbiome multiomics networks.

The compositionality of microbiome data has also been considered in methods based on graphical models (Section "Methods Based on Graphical Models"). Given that the major goal of graphical modeling is to infer microbial interactions through the estimation of the inverse covariance matrix between species, it is harder to correct for data compositionality as compared to marginal correlation analysis. The unique challenge here is that the sum constraint in compositional data induces linear dependency between features and thus gives rise to a degenerate covariance matrix, meaning that the inverse covariance matrix does not exist. To overcome this challenge, Kurtz et al. (2015) proposed a method called SPIEC-EASI that first converts raw counts into relative abundances, i.e., the proportions of each taxon's abundance within a sample, and then uses the centered log-ratio transformation on the relative abundances. They further argue that the covariance matrix of the transformed relative abundances is a good approximation to that of the logtransformed raw counts. SPIEC-EASI uses both neighborhood selection (Meinshausen and Bühlmann, 2006) and graphical lasso (Friedman et al., 2008) to infer a sparse inverse covariance matrix for a network. In addition, Yang et al. (2017) proposed a method called mLDM that uses a hierarchical Bayesian model (lognormal-Dirichlet-multinomial) on the compositional counts and then estimates a sparse inverse covariance matrix between the species through maximizing the L1 penalized posterior distribution.

Compositionality is also important to consider in regression-based methods. In Section "Review of Available Network-Based Procedures", we reviewed several regression methods to construct biological networks. To apply these methods to integrate microbiome data and another data type, it is possible to use microbiome data as either predictors or responses. Therefore, we discuss these two situations separately. In the case that microbiome data are used as predictors, there are two major challenges: the high dimensionality of the data and a sum constraint on the predictors imposed by the compositional nature of the data. Lin et al. (2014) proposed an L1 regularization method for the linear log-contrast model that meets these unique challenges of compositional data to study the association between the microbial compositions and the response variable. Moreover, Shi et al. (2016) extended the previous method to consider the subcompositions of taxa, i.e., the composition of taxa that belong to a given higher level taxonomic rank, and studied whether the observed subcompositions are associated with the response variable. On the other hand, if microbiome data are used as responses, it is essential to incorporate an appropriate distribution in the model to reflect the compositionality. For example, Chen and Li (2013) applied the Dirichlet-multinomial regression to investigate the association between microbiome composition and environmental covariates. Furthermore, Xia et al. (2013) proposed to use the logistic normal multinomial regression model to link covariates with taxonomic counts, given that the logistic normal distribution has a more flexible covariance structure than the Dirichlet distribution. The mLDM method (Yang et al., 2017) also investigates the association between the taxonomic counts and the environmental factors in their lognormal-Dirichlet-multinomial model.

As mentioned above, many network analysis methods have been proposed to consider the compositionality of the microbiome data. However, very few of them have been applied for network analyses that integrate multi-omics data alongside microbiome measures. We anticipate that this will be an active research area in the near future. Moreover, technological developments in microbiome data science, including the estimation of absolute cellular abundances from microbiome sequence data (Vandeputte et al., 2017) may help offset the need to correct for data compositionality when reconstructing microbiome networks.

FIGURE 1 | Visualizing the unique challenges of microbiome data. A mock set of bacterial samples from two populations where each colored shape is a bacterial taxon. (A) Compositionality. The taxon abundance table depicts the count of each observed taxon in each sample. When sequencing microbiome samples, the resulting counts of taxa are not representative of the actual taxa counts in the sample due to constraints of sequencing. Due to this, relative abundances are generally used in analysis of microbiome data. The bar plots illustrate the difference in community representation between raw counts (top) and relative abundances (bottom). (B) Normalization. Due to the constraints of sequencing, the overall sequencing depth of a sample can impact the results. For example, shallow sequencing may miss rare taxa such as the green taxon V in the example sample A that is present in low abundance in the community. (C) Sparsity. Microbiome data are often very sparse, where most observations are zero. This is illustrated by the histogram of taxa counts for each sample where most counts are zero and there are few taxa with high counts. This can also be seen in the table for part A, where many entries are zero. (D) Heterogeneity. The table summarizes the taxonomic heterogeneity in the mock dataset between the two populations. Each sample has a unique taxonomic composition, but there are also population specific signatures. The samples in each population are dominated by a few taxa, and these dominant taxa are different for the two populations. Additionally, there are taxa that are highly abundant in one sample and absent from the rest, such as the purple taxon Y in sample A.

# Normalization

Similar to many other omics data, microbiome data can exhibit strong heterogeneity from one study to another or from one biological sample to another even in the same study. For example, microbiome data may be collected from different geographic populations and they may have very different taxonomic distributions (He et al., 2018). In addition, varying data generation and processing procedures for microbiome data can also lead to heterogeneity across studies. For example, different sequencing technologies will result in different sequence lengths across studies, which can impact the discovery of taxa. Moreover, different studies may apply different data processing procedures (e.g., how sequences are assigned to taxonomic units or phylotypes) that may impact the distribution of taxa across studies.

One unique heterogeneity between studies or between samples in microbiome data is the variation of sequencing depths, as visualized in **Figure 1B**. Sequencing depth, the total count of sequences generated across all taxa for a biological sample, is an experimental technicality and often varies considerably across samples in a microbiome sequencing experiment. Like other omics data, normalization is an important and often first analytical step. The traditional approaches for normalizing microbiome data is either to transform count-based measures of taxa into relative abundances (i.e., proportions) of the taxa or to rarefy the counts, i.e., subsampling without replacement from each sample such that all samples have the same number of total counts across taxa. In addition, alternative normalization methods using other criteria are also used in the microbiome research community, including upper quantile normalization (Bullard et al., 2010), CSS normalization (Paulson et al., 2013), variance stabilizing transformation (Love et al., 2014), and trimmed mean of M-values normalization (Robinson et al., 2009; McCarthy et al., 2012). Most of these alternative normalization methods are borrowed from the techniques for RNA-seq data analysis. While these alternative methods are advocated in studies that focused on differential abundance testing, the traditional approaches of proportion- and rarefaction-based normalization provide more accurate community-level comparisons (McKnight et al., 2018).

Studies have also assessed the influence of sequencing depth on the quality of microbiome data. For example, Jovel et al. (2016) measured the minimum sequencing depth that can still provide a consistent taxonomic classification by randomly sampling from a sequencing library with different depths, while Nayfach et al. (2015) conducted a similar analysis for the functional annotation of metagenomes. Zaheer et al. (2018) evaluated the impact of sequencing depth on the characterization of the microbiome and resistome and indicated that the relative proportions of sequence assignments remained fairly constant regardless of depth. Although these studies show that taxonomic and functional annotation is fairly stable regardless of the sequencing depth, McMurdie and Holmes (2014) argued that current practice in the normalization of microbiome count data is inefficient in the statistical sense. One key issue with rarefaction is that while it maintains the mean of the taxonomic proportions it ignores the variation of the proportions. For example, two equal proportions of an OTU in two samples can

have unequal variances due to the different sequencing depths between the two samples. This problem of unequal variances is called "heteroscedasticity" and is not accounted for during typical rarefaction approaches. Heteroscedasticity could impact downstream analysis such as differential abundance analysis and construction of microbial networks.

In Section "Compositionality", we reviewed statistical models such as Dirichlet-multinomial regression (Chen and Li, 2013), logistic normal multinomial regression (Xia et al., 2013), and mLDM (Yang et al., 2017). These models not only consider the compositionality of microbiome data but also take the heteroscedasticity into account because the sequencing depth is explicitly modeled in the multinomial distribution. However, most of the above methods are applied to identify the association between the taxonomic composition and the environmental factors. While these models are potentially applicable to network analyses that integrate microbiome and other omics data, further investigations are warranted, especially considering the scale of the dimensionality of multi-omics data.

## Sparsity

Taxonomic abundance data are typically sparse in nature, meaning that a high proportion of the counts are zeros (Paulson et al., 2013). This feature of microbiome data frequently poses challenges to common statistical methods, and tailored techniques are often required to properly analyze microbiome data and to integrate them with other omics data. For example, due to the compositionality of microbiome data (see Section "Compositionality"), many statistical methods utilize transformations that involve taking logarithms on the counts or ratios between them. However, zero counts cause a technical problem for these transformations. To circumvent this issue, a widely used strategy is to add a small constant to all count measures, known as a pseudo-count (Kurtz et al., 2015; Mandal et al., 2015), or to replace the zeros by an estimated value (Palarea-Albaladejo and Martín-Fernández, 2015; Gloor et al., 2016). Some recent work has studied the problem of how to best choose the pseudo-count and how to find the estimated value (Martín-Fernández et al., 2003; Martín-Ferńandez et al., 2011; Martín-Fernández et al., 2015). However, more research is needed to determine how these techniques impact integrative network estimation for microbiome multi-omics data.

The sparsity of microbiome data also challenges modeling. The excess zeros, coupled with a high frequency of a very low number of observations per taxon, results in a heavily skewed distribution of taxon counts across samples, with a large point mass at zero and a long right tail. This is also visualized *via* a mock dataset in **Figure 1C**. Consequently, network estimation methods that work well for continuous data, including those assuming that the counts follow a Gaussian distribution such as graphical lasso, may not work well when directly applied to such data because of poor model fit. Nonparametric correlation measures such as Spearman's rank correlation and Kendall's tau can be used to avoid an assumption of normality and tackle highly skewed data. However, the power of such methods may deteriorate when data measures distribute with a point mass at zero, as this mass of zeros leads to a large number of ties that complicate rank-based measures of correlation (Huson, 2007). In addition, agglomeration of taxon measures into higher order taxonomic groups may reduce the effects of sparsity and improve alignment between the observed data distributions and model assumptions. However, such agglomerative procedures can erode resolution of specific taxonomic units that manifest important and nuanced relationships with other study covariates.

In recent years, a variety of probability models have been developed for microbiome count data. The Poisson or negative binomial distributions have been useful for analyzing count data from other types of sequencing studies, such as transcriptomic studies using RNA sequencing. However, microbiome data often—though not always—exhibit more zeros and heavier skewness than expected from these models. To this end, zeroinflated models (Sharpton et al., 2017) and hurdle models (Hu et al., 2011) have been proposed. For example, the zero-inflated Poisson distribution considers a mixture of a Poisson distribution and a probability mass at zero to account for the large frequency of zeros in microbiome data (Xu et al., 2015). However, most of these methods focus on modeling the marginal distribution of a single taxon at a time and are not directly applicable to the joint modeling of multiple taxa and therefore cannot be used for microbial network estimation.

Another type of models used for microbiome count data is the Dirichlet-multinomial model and its zero-inflated versions. It has been used in a number of methods to model the multivariate distribution of the counts of a collection of taxa (Holmes et al., 2012; Chen and Li, 2013; Tang and Chen, 2018). However, a criticism of these methods is that the Dirichlet-multinomial distribution imposes a negative correlation between the abundances of any given pair of taxa. This inflexibility in the correlation structure makes such methods particularly problematic when used to infer the interaction between taxa. A promising approach to addressing this pitfall is to consider a hierarchical model where the conditional distribution of the observed counts is modeled by a multivariate count distribution such as multinomial distribution or Dirichletmultinomial distribution, whose parameters are linked to a multivariate continuous distribution, such as multivariate normal distribution, that allows a flexible and realistic correlation structure (Xia et al., 2013; Yang et al., 2017).

Despite the success of the aforementioned models for microbiome count data, their use has for the most part been limited to differential abundance analysis, where the abundance of individual or groups of taxa is associated with an environmental factor of interest. Further work is needed to explore their applicability to multi-omics data and integrative network analysis. We see it as a great research opportunity to combine these models with cutting edge multi-omics network estimation methods to make the latter more appropriate for microbiome studies.

# Heterogeneity

Related to the issue of sparsity is the heterogeneity exhibited in studies that survey the composition of microbial communities. The composition of microbial communities often varies tremendously across hosts and environments. For example, it is not uncommon to observe that a taxon that is relatively abundant in one person's gut while being completely absent in another's; for a given taxon, it is often the case that only a proportion of the samples have nonzero abundance. While the number of observed taxa from the entire data set may be large, the microbiota in any given sample tend to be dominated by only a relatively small number of taxa with high abundance, with the rest of the taxa having zero and very low counts. Moreover, the set of dominant taxa can vary drastically from individual to individual. We call the above phenomena taxonomic heterogeneity, as visualized in **Figure 1D**. It results in a unique characteristic of microbiome data sets that features (i.e., taxa) present in all samples are rare and those present in a small proportion of samples prevail. This is in contrast to most other types of omics data such as transcriptomic data, where the majority of genes are expected to have nonzero expression levels in all samples.

Different approaches have been applied to account for taxonomic heterogeneity when measuring the interaction between two microbial taxa or between a taxon and another biological feature (e.g., a metabolite). The most commonly used strategy is to include the data from all biological samples, regardless of whether the taxon of interest is present or not. An alternative strategy is to exclude the samples in which the given taxon is not present and only consider those abundance data that are nonzero for the taxon. A third strategy focuses on the dichotomous outcome of whether a taxon is present or absent in individual samples, while ignoring the actual abundance (Mainali et al., 2017; Albayrak et al., 2018). The first approach regards a sample where a taxon is absent as having "zero abundance" of the taxon, which is only quantitatively, but not qualitatively, different from a sample where the abundance of the taxon is very low. This approach's main advantage is that no information is discarded from the data, whereas the latter two approaches each discard part of the data. Most methods using the first approach assume that, if a biological interaction exists between a microbial taxon T and another feature M (e.g., a metabolite), the feature M is associated with the abundance of T in the same way that it is associated with the occurrence of T in a community. However, the biological process in which M is involved in the introduction or establishment of T may in theory be very different from the one in which M impacts its abundance. For example, M may promote the growth of T in a person's gut microbiome only if it already contains T. It is also possible that elevated levels of M are associated with increasing a person's chance of exposure to T and consequently its presence in the gut, but do not affect its abundance. For these types of relationships, the latter two strategies may have merits.

In addition to taxonomic heterogeneity, functional heterogeneity is another feature of microbiome data that challenges statistical methods for network inference. Most current methods for microbial network estimation, such as those by Kurtz et al. (2015) and Yang et al. (2017), assume that there exists a common microbial network underlying all samples in the data. However, the interaction between two microbial taxa or between a taxon and another type of feature may be context dependent and may vary from sample to sample. For example, the interaction between taxa in the human gut may depend upon the enterotypic context of the individual's gut microbiome. Recent statistical developments have been made on the joint estimation of multiple graphical models, which assumes the samples are from several known subpopulations (e.g., corresponding to several biological conditions) and allows a different network to be inferred for each subgroup (Chun et al., 2015; Lin et al., 2017). In addition, some emergent methods have been applied to genomic data to allow network heterogeneity among all samples, between or within biological conditions. For example, Luo and Wei (2018) developed a nonparametric Bayesian method to estimate dynamic transcription factor networks by borrowing information across biological conditions and meanwhile allowing heterogeneity across samples. Another example is mixGlasso (Städler et al., 2017), a latent variable extension of graphical lasso, which uses a mixture model to allow samples to be clustered into groups that can have different networks. Despite these recent statistical developments, methods have not been established to address the unique needs of microbiome data analysis and for the purpose of integrating microbiome multi-omics data.

# DISCUSSION

This review focuses on statistical network analysis methods that have been applied or have great potential to be applied to multiomics integration of microbiome data. Therefore, this review does not cover some of the other analytical methods and tools that are either not directly relevant to statistical network analysis or not specific to microbiome data but are still applicable to general multi-omics integration. For these more general methods and tools, we refer the readers to the following review papers. Bersanelli et al. (2016) categorized various data integration methods into four classes according to whether they are Bayesian and whether they are network-based, and they reviewed each class of methods focusing on their mathematical and methodological aspects. Li et al. (2018) provided a comprehensive review on omics and clinical data integration techniques from a machine learning perspective. Huang et al. (2017) separately reviewed unsupervised, supervised, and semisupervised data integration tools and their applications to predicting patient survival. Zeng and Lumley (2018) reviewed the traditional statistical methods of exploratory and supervised learning as well as their variations tailored to multi-omics studies. Mirza et al. (2019) discussed state-of-the-art machine learning-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance, and scalability issues.

While our review focuses on data analysis, it is important to note that study design and data collection can impact data integration-based investigations. For example, in a multi-omics study, it is rarely the case that researchers are able to collect a complete data set in the sense that all feature types are measured for all samples. This incomplete coverage of samples can dramatically reduce the set of samples subject to integration. In a longitudinal multi-omics study of the gut microbial ecosystem

in inflammatory bowel diseases (Lloyd-Price et al., 2009), 132 participants were followed for one year and their stool samples were collected every two weeks, resulting in 1,785 stool samples. However, given the difficulty of collecting all feature types (for example, metagenomics, metatranscriptomics, proteomics, metabolomics, etc.) at each timepoint, the final data include only 305 samples that yielded all stool-derived feature types, whereas 791 samples offered paired metagenomic and metatranscriptomic data. As exemplified in this study, to derive networks depicting the relationships between certain pairs of feature types, one may need to rely on separate sets of samples for the two feature types. This strategy, compared with one in which paired multi-omic data are available on a common set of samples, would impact the accuracy and interpretation of the resulting networks. In addition to the above practical issue of missing data, considerations of study design can impact integration, such as whether the samples were collected longitudinally or cross-sectionally. Given that this is a very broad topic, we refer readers to additional review papers (Franzosa et al., 2015; Buescher and Driggers, 2016; Haas et al., 2017; Hasin et al., 2017) for more detailed discussions about how study design impacts multi-omics investigations.

The recent work by the Integrative Human Microbiome Project (iHMP, https://hmpdacc.org/ihmp/) exemplifies the power and promise of microbiome multi-omic data integration. As the second phase of the NIH Human Microbiome Project, iHMP aimed to link interactions between humans and their microbiomes to health-related outcomes by analyzing data sets on microbiome and host activities in longitudinal studies of disease-specific cohorts (Integrative HMP (iHMP) Research Network Consortium 2014; Integrative HMP (iHMP) Research Network Consortium 2019). Fortunately for the research community, the iHMP has made these measures publicly available as downloadable datasets that can serve as resources to test and evaluate new models, methods, and analyses, including the network methods reviewed in this paper. In fact, many of the individual studies conducted as part of iHMP have applied and/ or developed network-based methods for integrating multi-omics data. For example, Lloyd-Price et al. (2019) applied integrative analysis to identify microbial, biochemical, and host factors central to the functional dysbiosis in the gut microbiome during inflammatory bowel disease activity. They constructed networks for associations of features from 10 feature types: metagenomic species, species-level transcription ratios, functional profiles at the Enzyme Commission level (metagenomes, metatranscriptomes, and proteomes), metabolites, host transcription (rectal and ileal separately), serology, and fecal calprotectin. In particular, they used mixed-effects regression models (which belong to the regression-based methods discussed in Section "Regression-Based Methods") to remove subject-specific random effects and covariate effects from each feature type, and then applied Spearman correlation (which belong to the marginal correlation analysis methods discussed in Section "Marginal Correlation Analysis") to the resulting residuals to construct cross-feature type interactions.

We conclude this review with some final thoughts about microbiome multi-omics network analysis. Integrative network analysis holds great potential to resolve how microbes interact among themselves and with their environment. However, the application of such analyses to microbiome data remains nascent, and the requisite analytical tools have only begun to emerge. Fortunately, a growing number of statistical methods have been developed in the fields of network estimation and multiomics data analysis, which provide a promising pool of ideas and methodologies to potentially borrow from. However, when applying these existing tools to microbiome multi-omics network inference, it is important to consider the limitations of the underlying methodologies and their applicability to microbiome studies. In particular, the unique features of microbiome data present pressing statistical challenges and often call for tailored computational tools. A thorough understanding of the unmet statistical needs and specific properties of microbiome data is critical to the innovation of efficient, robust, and scalable network inference methodologies suitable for microbiome multiomics network inference. Meanwhile, awareness of the analytical challenges associated with microbiome data can facilitate the development of new study designs and technologies that have the potential to mediate some of the major limitations currently hindering microbiome data analytics. An emerging example is the coupling of 16S data with measures of the total abundance of microorganisms in a sample, which is a possible way of

# REFERENCES


circumventing the compositionality constraint in microbiome data. Going forward, joint statistical, scientific, and technological efforts will help promote the application of multi-omics network analysis to solve pressing problems in microbiome science.

# AUTHOR CONTRIBUTIONS

DJ, TS, and YJ led and conducted the review. CA, CH, MM, and CT contributed equally to the review and wrote the first draft of sections of the manuscript. DJ, TS, and YJ wrote the first draft of sections of the manuscript and contributed to the manuscript revision. All authors read and approved the final version.

# FUNDING

Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM126549. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.


bacterioplankton species and environmental factors. *Bioinformatics* 22, 2532– 2538. doi: 10.1093/bioinformatics/btl417


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Jiang, Armour, Hu, Mei, Tian, Sharpton and Jiang. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Reads Binning Improves Alignment-Free Metagenome Comparison

*Kai Song1†\*, Jie Ren2†‡ and Fengzhu Sun2\**

1 School of Mathematics and Statistics, Qingdao University, Qingdao, China, 2 Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, United States

Comparing metagenomic samples is a critical step in understanding the relationships among microbial communities. Recently, next-generation sequencing (NGS) technologies have produced a massive amount of short reads data for microbial communities from different environments. The assembly of these short reads can, however, be timeconsuming and challenging. In addition, alignment-based methods for metagenome comparison are limited by incomplete genome and/or pathway databases. In contrast, alignment-free methods for metagenome comparison do not depend on the completeness of genome or pathway databases. Still, the existing alignment-free methods, *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , which model k-tuple patterns using only one Markov chain for each sample, neglect the heterogeneity within metagenomic data wherein potentially thousands of types of microorganisms are sequenced. To address this imperfection in *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , we organized NGS sequences into different reads bins and constructed several corresponding Markov models. Next, we modified the definition of our previous alignment-free methods, *d<sup>S</sup>* 2 and *d*<sup>2</sup> \* , to make them more compatible with a scheme of analysis which uses the proposed reads bins. We then used two simulated and three real metagenomic datasets to test the effect of the k-tuple size and Markov orders of background sequences on the performance of these de novo alignment-free methods. For dependable comparison of metagenomic samples, our newly developed alignment-free methods with reads binning outperformed alignment-free methods without reads binning in detecting the relationship among microbial communities, including whether they form groups or change according to some environmental gradients.

#### Edited by:

Lingling An, University of Arizona, United States

#### Reviewed by:

Marc Sze, Merck, United States Bryan David Martin, University of Washington, United States

#### \*Correspondence:

 Kai Song ksong@qdu.edu.cn Fengzhu Sun fsun@usc.edu

†These authors have contributed equally to this work

‡Present Address:

Jie Ren Google Inc., Mountain View, CA, United States

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 10 January 2019 Accepted: 22 October 2019 Published: 21 November 2019

#### Citation:

Song K, Ren J and Sun F (2019) Reads Binning Improves Alignment-Free Metagenome Comparison. Front. Genet. 10:1156. doi: 10.3389/fgene.2019.01156

Keywords: alignment-free methods, metagenomic samples, Markov model, reads binning, beta-diversity

# INTRODUCTION

Understanding the impact of environmental factors on the composition of microbial communities, along with the effects of microbes on their hosts, is a crucial problem in microbiological studies. Traditional culture-dependent techniques can obtain pure isolates of individual microbes, but such techniques are low-throughput and can capture only a tiny fraction of microbes in a microbial community. With the rapid development of next-generation sequencing (NGS) technology, whole metagenome shotgun sequencing (WMGS) has become a widely used and powerful approach to investigate complex microbial communities (Qin et al., 2010; Qin et al., 2012; Xie et al., 2016; Mehta et al., 2018). Several large scale international metagenomics projects including the Human Microbiome Projects (HMP) (Lloyd-Price et al., 2019) and TARA ocean project (Brum et al., 2015; Sunagawa et al., 2015) have been carried out and most of the metagenomic samples have metadata available. Metagenomic data provide the whole genetic information from microbial communities. A metagenomic sample usually contains millions of short reads, consisting of several hundred of base pairs, and each read is randomly sampled from a genomic region of a microbial genome in the community. Given the massive amount of metagenomic data, computational methods are in great demand to infer the relationships between microbes and environmental factors/hosts. Accurately quantifying the similarities and differences among microbial communities from multiple environments/hosts is one of the most important steps in metagenomic data analysis.

The general approach to analyze metagenomic data is based on alignment methods, such as the Smith-Waterman algorithm (Smith and Waterman, 1981) and BLAST (Altschul et al., 1990), both of which first map NGS reads to known genomes or pathways in existing public protein databases, such as non-redundant (NR), Kyoto Encyclopedia of Genes and Genomes (KEGG), and evolutionary genealogy of genes: Nonsupervised Orthologous Groups (eggNOG), and then compare the abundance of different microbial organisms or functional categories between samples (Qin et al., 2010; Muegge et al., 2011; Qin et al., 2012). However, many microbial genomes and gene families are unknown, making it impossible to map all reads to the known genomes or pathways in many environments, in turn making the comparison of metagenomic samples incomplete, as suggested above. Based on the current literature, about 40% of unassigned reads, on average, exist in the human gut microbiome (Qin et al., 2010; Qin et al., 2012), and up to 50% of reads cannot be assigned to reference databases in ocean samples (Marchetti et al., 2012). Apart from alignment-based methods, assemblybased analytical methods reconstruct bacteria genomes by assembling short reads. However, assembly is time-consuming and challenging, especially for metagenomic samples because bacteria genomes can share similar regions, and a short read is not long enough to resolve the ambiguity. These limitations leave alignment-free methods as promising alternative approaches for microbial community comparison by eliminating the requirements of reference sequences or *de novo* assembly.

Although alignment-free methods can be defined as any methods that do not depend on sequence alignment, one of the major types of alignment-free methods is based on the frequencies of *k*-tuples (*k*-words or *k*-mers) as recently reviewed (Song et al., 2014; Zielezinski et al., 2017; Ren et al., 2018). A *k*-tuple is a segment consisting of consecutive nucleotide bases of length *k*. The effectiveness of these alignment-free methods for genome and metagenome comparison was based on the fact that relative *k*-tuple frequencies were similar across different regions of the same genome, but differed between genomes (Karlin et al., 1997). Similarly, the relative *k*-tuple frequencies for closely related genomes would be more similar than those between distantly related genomes. The alignment-free dissimilarity measures, *dS* <sup>2</sup> and *d*<sup>2</sup> \* , were developed for high-throughput sequencing data comparison, and they were then used for phylogenetic tree construction (Song et al., 2013), followed by successful applications in the comparison of metagenomic samples (Jiang et al., 2012; Liao et al., 2016) and gene regulatory regions (Song et al., 2013), identification of horizontal gene transfer (Tang et al., 2018b) and virus-host interactions (Ahlgren et al., 2017), and improving contig binning for metagenomes (Wang et al., 2017). Recently, they have also been used to identify the geographic origin of white oak trees (Tang et al., 2018a) and sources of viruses (Li and Sun, 2018). A user-friendly interface for alignment-free genome and metagenome comparison, aCcelerated Alignment-FrEe (CAFÉ) (Lu et al., 2017b), has now been developed. Many other alignment-free methods have been developed including the delta-distance between dinucleotide relative frequencies of different genomes (Kariin and Burge, 1995; Karlin and Mrázek, 1997) and CVTree (Qi et al., 2004a; Qi et al., 2004b). Ren et al. (2018) and Zielezinski et al. (2017) presented the most recent reviews of alignment-free methods for genome and metagenome comparisons and their many applications (Zielezinski et al., 2017; Ren et al., 2018). Zielezinski et al. (2019) recently compared the performance of 74 alignment-free methods for protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer, and recombination events. However, the authors did not evaluate their performance on metagenome comparison (Zielezinski et al., 2019).

While the previous alignment-free methods were successful in comparing metagenomic samples, these methods (Jiang et al., 2012; Liao et al., 2016) only considered metagenomics sequencing data as a whole from which to extract *k*-tuple frequencies and calculate their expectations using a common Markov model. However, microbial communities contain thousands of microorganisms and the relative abundance profiles of the microbial communities were shown to change across many environmental factors, such as geographic distance, temperature, oxygen, pH, and biotic factors (Lozupone and Knight, 2007; Steele et al., 2011; Philippot et al., 2013). Different microbial organisms have varied nucleotide frequencies; therefore, it is unreasonable to use only one Markov Chain to model the sequences in a microbial community and to calculate the probability of *k*-tuples. Instead, the present study posits that different Markov models can be used; accordingly, we first organized sequenced bacterial genomes and used them to construct the Markov models. These models were then used for grouping NGS reads into different bins, followed by extracting the *k*-tuples and calculating their expectation in each bin. Markov models have been used extensively for genome modeling (Narlikar et al., 2013), motif discovery (D'haeseleer, 2006), computational gene search (Lomsadze et al., 2005), classification of metagenomic sequences (Brady and Salzberg, 2009) and alignment-free sequence comparison (Chang and Wang, 2011). Next, we extended the definition of our previous alignment-free measures, *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , to make them more compatible with a scheme of analysis that uses the proposed reads binning datasets. We then used two simulated and three real metagenomic datasets to test the effect of *k*-tuple size and Markov orders of background sequences on the performance of these *de novo* alignment-free methods. For dependable comparison of metagenomic samples, our alignment-free methods with reads binning outperformed alignment-free methods without reads binning in detecting the relationships among metagenomic samples whether they form groups or change according to environmental gradients. For detecting group relationship among samples, the triplet distance between the inferred tree and the gold standard tree is reduced by over 10%. For detecting gradient relationship among the samples, the Pearson correlation coefficient (PCC) between the first principal coordinate and the gradient is increased by 10%. The software is available at https://github.com/songkai1987/ MetaBin.

## MATERIALS AND METHODS

The framework of our method is given in **Figure 1**. First, the bacterial sequences were divided into several bins and a Markov model is used to model the sequences in each bin. Second, each read in the metagenomics samples was assigned to the bin that has the highest probability of generating the sequence. Third, the *k*-tuple counts and their expectations were calculated in each bin of the NGS reads. The *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* (Eq. 1 and 2) were calculated between each pair of samples. Finally, the samples are clustered using the dissimilarity matrix obtained from *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* Details of each of the steps are given below.

#### The k-Tuple Count Vectors and Alignment-Free Comparison Measures

In our previous studies (Jiang et al., 2012; Song et al., 2013), the first step toward comparing metagenomic samples

has the highest likelihood. The k-tuple counts and their expectations are calculated in each bin of the NGS reads. The *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* are calculated between each pair of samples. Finally, the samples are clustered using the dissimilarity matrix obtained from *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* .

involved counting the number of occurrences of each *k*-tuple. Since a read could be from the forward or reverse strand of a genome, we considered each read together with its complement when calculating the occurrences of each *k*-tuple. Thus, for metagenomic data, we have a finite alphabet set *S={A,C,G,T}* and consider all possible *k*-tuples in the reads of metagenomic samples. Let *X X* = ( ,X , *<sup>k</sup>* ...,X ) 1 2 <sup>4</sup> and *Y Y* = ( ,*Y <sup>k</sup>* ,...,Y ) 1 2 <sup>4</sup> be the *k*-tuple count vectors of two metagenomic samples *X* and *Y*, respectively. Then, we define the centralized count variables by using Markov model-based expectation as

$$\begin{aligned} \overline{X}\_i &= X\_i - n\_X \not p\_{X,i} \\\\ \overline{Y}\_i &= Y\_i - n\_Y \not p\_{Y,i} \end{aligned}$$

where *nX* is the total count of *k*-tuples, and *pX,i* is the probability of *i*-th *k*-tuple under the Markov model of order *r*. The idea behind subtracting the expected *k*-tuple count from the observed count is that the *k*-tuples responsible for the similarity between microbial communities will stand out after subtraction. Then, the two measures *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* can be defined as

$$D\_z^S(X,Y) = \sum\_{i=1}^{k^\*} \frac{\overline{X\_i}\overline{Y\_i}}{\sqrt{\overline{X\_i}^2 + \overline{Y\_i}^2}}$$

$$d\_z^S(X,Y) = \frac{1}{2} \left( 1 - \frac{D\_z^S(X,Y)}{\sqrt{\sum\_{i=1}^k \frac{\overline{X\_i}^2}{\sqrt{\overline{X\_i}^2 + \overline{Y\_i}^2}}} \sum\_{i=1}^k \frac{\overline{Y\_i}^2}{\sqrt{\overline{X\_i}^2 + \overline{Y\_i}^2}}} \right) \tag{1}$$

and

$$D\_2^\*(X, Y) = \sum\_{i=1}^{4^k} \frac{\overline{X\_i}\,\overline{Y\_i}}{\sqrt{n\_X p\_{X,i}} \sqrt{n\_Y p\_{Y,i}}}$$

$$d\_2^\*(X, Y) = \frac{1}{2} \left(1 - \frac{D\_2^\*(X, Y)}{\sqrt{\sum\_{i=1}^{4^k} \frac{\overline{X\_i}^2}{n\_X p\_{X,i}} \sum\_{i=1}^{4^k} \frac{\overline{Y\_i}^2}{n\_Y p\_{Y,i}}}\right) \tag{2}$$

The first statistic *D<sup>S</sup>* <sup>2</sup> is based on the observation by Shepp (Shepp, 2006) that for two independent normal random variables *X* and *Y* with mean zero, *XY* / *X Y* 2 2 + is also normally distributed. The second statistic *D*<sup>2</sup> \* is motivated by Pearson correlation where the mean and variance of each tuple are calculated based on Poisson distribution assumption for the *k*-tuples. When the two samples are more similar, the *k*-tuple frequency profiles are more similar and the values of *DS* <sup>2</sup> and *D*<sup>2</sup> \* are higher. The ranges of *D<sup>S</sup>* <sup>2</sup> and *D*<sup>2</sup> \* can depend on the nucleotide frequencies. In order to make their range independent of nucleotide frequencies, we normalize them to dissimilarities, *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , respectively, so that they have a range between 0 and 1 according to the Cauchy inequality. When two samples are similar, the values of *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* are close to 0.

#### The Alignment-Free Measures Based on a Mixture of Markov Models Learned From Reads Bins

Metagenomic samples consist of a mixture of many different microbial genomes; thus, it is unreasonable to expect that all these reads can be modeled using only one single Markov model for each sample. To address this difficulty, we first group these reads into different bins. Then, we count the *k*-tuple vectors and obtain the expectation of each *k*-tuple for the reads in each bin individually.

We used the bacterial genomic sequences to train the Markov models. First, we calculated the guanine-cytosine (GC) frequency of each bacterial genomic sequence and then grouped these bacterial genomic sequences into different bins using the quantiles of the GC frequency distribution. Each bin has the same number of bacterial genomes. The Markov model for each bin was then constructed using the *k*-tuple vectors counted from all the genomic sequences in that bin. For a set of genomic sequences in a bin, let *Xw* be the count of *k*-tuple *w* of all these genomes and their complementary sequence. The Markov model of order *r* is defined as a 4*<sup>r</sup>* ×4 matrix of transition probabilities. The transition probabilities can be estimated based on the *r*-tuples and (*r*−1)-tuples, and the estimated probability of observing nucleotide *wr*+1 given preceding nucleotides *w*1*w*2···*wr* is *P w w w <sup>w</sup> <sup>X</sup> w w w w*

*X M r <sup>r</sup> w w w r r r* ( | ) <sup>+</sup> = <sup>+</sup> 1 1 2 1 2 1 1 2 , where *Xw w*1 2*wr* and

*Xw w*1 2*w wr r*<sup>+</sup>1 are the counts of *r*-tuple *w*1*w*2···*wr* and (*r*+1)-tuple *w*1*w*2···*wrwr*+1, respectively.

Once we have *C* different Markov models of order *r*, ( , *M M* , ,*M* ) *r r <sup>r</sup>* 1 2 <sup>C</sup> , to model the bacterial genomic sequences, we classify the reads in a metagenomic sample to the bins with the highest log-likelihood scores. In particular, suppose *Y*=*y*1*y*2···*yN* represents a read of length N in a metagenomic sample; then, the log-likelihood of the read under the Markov chain *Mr* could be calculated as

$$LL(Y\left|M\_r\right.) = \sum\_{i=1}^{N-r} \log P\_{M\_r}(\mathcal{Y}\_{i+r} \left| \mathcal{Y}\_i \mathcal{Y}\_{i+1} \cdots \mathcal{Y}\_{i+r-1} \right.)$$

Then, the classification of read could be defined as the model having the largest probability, or

$$l = \underset{c=1, L, C}{\text{arg}\max} \, LL(Y \left| M\_r^c \right.) \tag{3}$$

where *λ* is the predicted bin to which the read belongs.

Next, we calculate the *k*-tuple count and its expectation in each bin of NGS reads. The centralized count variables by using Markov model-based expectation such that all *C* bins are combined are as follows:and

$$
\overline{X}\_{\text{w}} = \sum\_{c=1}^{c} (X\_{\text{w}}^{c} - n\_{X}^{c} p\_{X,\text{w}}^{c}) \tag{4}
$$

$$
\overline{Y}\_{\text{w}} = \sum\_{c=1}^{c} (Y\_{\text{w}}^{c} - n\_{Y}^{c} p\_{Y,\text{w}}^{c})
$$

where *c* represents the calculation based on the *c*-th bin. Therefore, the two measures *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , could be defined using the new version of *Xw* and *Y <sup>w</sup>* .

#### Comparison With Other Reads Binning Approaches Without Reference Genomes

In addition to the above reads binning method, we also considered creating reference-free reads binning by first assembling reads into contigs and grouping contigs into bins. Metagenomic reads are then classified to different bins based on their similarity to the contigs in those bins. MetaSPAdes (Bankevich et al., 2012; Nurk et al., 2017) was used to cross-assemble the reads in the simulated datasets using the default setting. Contig coverages [Fragments Per Kilobase per Million reads (FPKMs)] were determined by mapping reads with Bowtie2 (Langmead and Salzberg, 2012), using the default settings, and were averaged for each bin. Sequence COmposition, read CoverAge, CO-alignment, and paired-end read LinkAge (COCACOLA) (Lu et al., 2017a) and MetaBAT (Kang et al., 2015) were used to cluster these assembled contigs (≥500 bp) based on sequence tetra-nucleotide frequencies and contig coverages normalized by contig length and number of mapped reads in samples, respectively. MetaBAT performed better than other approaches in the CAMI study (Meyer et al., 2018). The simulated reads were mapped to the set of contigs using Burrows-Wheeler-Aligner (BWA) software (Li and Durbin, 2009) to obtain the classification labels. The unmapped reads were binned together as an extra bin. We calculated the *k*-tuple counts and their expectation in each bin and then calculated the values of *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* .

#### Comparison With Other Reads Binning Approaches With Reference Genomes

We compared our method with two reference genome-based reads binning approaches, Kraken (Wood and Salzberg, 2014) and MBMC (Wang et al., 2016), to classify the metagenomic reads. Kraken is a program for assigning taxonomic labels to metagenomic DNA sequences and it has been shown to perform better than other binning approaches, such as Megablast (Chen et al., 2015), PhymmBL (Brady and Salzberg, 2009), NBC (Rosen et al., 2008) and MetaPhlAn (Segata et al., 2012). The core of Kraken is a database consisting of *k*-tuples and the lowest common ancestor (LCA) of all organisms whose genomes contain the *k*-tuples. Sequences are classified by querying the database for each *k*-tuple in a sequence, and then using the resulting set of LCA taxa to determine an appropriate label for the sequence. To compare with our method, the 100 bacterial genomes in simulations were used to construct the genome library for *k*-tuples and their LCAs in Kraken. MBMC is a recent approach for binning reads by measuring the similarity of reads to the trained Markov chains for different taxa using the ordinary least squares (OLS) method. Similarly, the 100 bacterial genomes in simulations were also used for constructing the Markov chains, respectively. Each of the two approaches was then used to classify reads into different bins individually. We calculated the *k*-tuple counts and their expectations in each bin to then calculate the values of *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* .

#### Beta-Diversity Analysis and Evaluation Methods

Detection of group relationships among metagenomic samples and the identification of external gradients driving shifts in microbial community structure are two major types of analytical tasks in microbial community comparison. Therefore, we evaluated the performance of our new alignment-free measures in metagenomic sample comparison by assessing how well they would detect the known group relationships or identify known environmental gradients.

For clustering analysis, we used the unweighted pair-group method with arithmetic means (UPGMA) algorithm (Murtagh, 1984) to cluster metagenomic samples based on the pairwise dissimilarity defined using our alignment-free measures, and then we compared the clustering tree with the true group relationship among the samples. We used the R package "phangorn" (Schliep, 2011) for clustering samples given the input of the pairwise dissimilarity matrix. The triplet distance was used to measure the distance between the tree built using our methods and the ground truth. Triplet distance was proposed by (Critchlow et al., 1996) as a measure for the distance between two rooted bifurcating phylogenetic trees, and it can be used for measuring the distance between binary (Critchlow et al., 1996) or non-binary trees (Bansal et al., 2011). This measure first decomposes the topologies of the input trees into triplets, i.e., all three-element subsets of the set of leaves, and then computes how many triplets of the two trees have different topologies. Because triplets are the basic building blocks of rooted and unrooted trees, in the sense that they are the smallest topological units that completely identify a phylogenetic tree, triplet-based distances provide a robust and fine-grained measure of the dissimilarities between trees (Bansal et al., 2011). This was finally developed into the TreeCmp toolbox (Bogdanowicz et al., 2012).

For the study of gradient relationships among the samples, the shift of metagenomic samples is visualized by PCoA (Principal Coordinates Analysis), which is a multidimensional scaling (MDS) method that converts between-sample dissimilarity matrix into two-dimensional, or three-dimensional, ordinates of samples and arranges the samples in ordinate space. We used the MASS package in R for PCoA (Anderson, 2003). Then, the influence of environmental gradient(s) on microbial communities could be investigated by calculating correlation, such as PCC, between the first principal coordinate and the gradient axis. In this way, the performance of the alignmentfree methods could be evaluated, as long as the gradient driving microbial communities is known.

#### Simulated Metagenomic Datasets

We simulated two NGS metagenomic datasets using Nextgeneration Sequencing Simulator for Metagenomics (NeSSM) (Jia et al., 2013), which supports single-end and paired-end sequencing for both 454 and Illumina platforms, with paired-end short reads of length 150 bp in an Illumina MiSeq setting mode based on abundance profiles. Since 1) the database for reference genome is not complete and 2) new genomes can be discovered in the future, we mimic the situation by splitting the reference genomes by May 2015 such that the genomes before this date were used for training the Markov chain models, and the genomes after this date were used to simulate the metagenomic datasets for testing. A set of 100 bacterial species randomly sampled from the 5,865 sequenced bacterial reference genomes from NCBI was used for simulation (**Table S1**). We designed two sets of metagenomic samples representing the two types of relationships among samples as has been done in (Jiang et al., 2012): the group relationship involving species abundance levels of the samples belonging to different groups and the gradient relationship involving species abundance levels that change continuously with some environmental variables, such as temperature or location.

In Simulation 1, we simulated 60 samples belonging to three groups. For each group, we randomly chose 100 genomes and assigned the i-th genome with relative abundance generated from the power-law (Zipf 's) distribution as *f m <sup>N</sup> <sup>m</sup> n <sup>N</sup>* ( ; , ) / / α α α = ∑ 1 1 ,

*n* = 1 m = 1, 2, …, N, where N = 100, and α is the value of the exponent characterizing the distribution. We set α=0.3and generated three relative abundance vectors from power-law distribution by randomly ordering the 100 genomes as the centers of the three groups. We next added to each component the absolute value of a Gaussian noise with mean zero and variance equal to 10 times each component and then renormalized each component to sum to 1. Each relative abundance vector was randomized and renormalized 20 times, and a total of 60 relative abundance vectors were obtained. Then, we used the relative abundance vectors to simulate 60 metagenomic samples.

In **Simulation 2**, we generated 20 samples consisting of the same 100 genomes, and the relative abundance vector of 100 genomes was generated by the power law (Zipf 's law) distribution as defined in the above simulation. In order to mimic the gradient model, the relative abundance vector shifts along a gradient axis of *α*from 0.30 to 0.70 by step 0.02. Again, absolute values of Gaussian noises were added to each component of the 20 abundance vectors with mean 0 and standard deviation equal to the value of that component. The vectors were renormalized after adding the noises. We generated 20 metagenomic samples according to these relative abundance vectors using NeSSM.

In all simulations, we generated datasets at two sequencing depths: 0.1M and 0.5M sequencing reads per sample. At each setting, we generated 30 duplicated datasets to simulate possible stochastic effects in real NGS data.

We analyzed three real shotgun metagenomic sequencing datasets published in recent years. For real datasets, we used all genomic sequences to train the Markov models.

# The Human Gut Datasets

The first dataset includes 107 fecal microbiome samples from Asia (Kurokawa et al., 2007; Qin et al., 2012), Europe (Qin et al., 2010) and North America (Turnbaugh et al., 2009). The dataset includes samples from two countries (China and Japan, n = 45 and 13) in Asia, two countries (Denmark and Spain, n = 21 and 10) in Europe, and one country (USA, n = 18) in North America. The accession numbers for the samples are given in **Table S2** in the supplementary material. We investigated this dataset at two levels. First, we considered the samples from different continents and studied the relationships among these samples. Then, we considered the samples from different countries and studied the relationships among these samples with respect to their countries of origin.

# The Human Microbiome Datasets

The second dataset includes 60 microbiome samples from four body sites: buccal mucosa, supragingival plaque, tongue dorsum and stool (Lloyd-Price et al., 2017). The accession numbers for the samples are given in **Table S3** in the supplementary material. We investigated the relationships among these microbial samples from different body sites.

# The Soil Metagenomic Dataset

This dataset includes 16 soil metagenomic samples from 16 sites: 3 from hot deserts, 6 from Antarctic cold deserts, and 7 from temperate and tropical forests, a prairie grassland, a tundra, and a boreal forest (Fierer et al., 2012). The accession numbers of these samples are given in **Table S4** in the supplementary material. The sites span a wide range of ecologically distinct microbiomes to examine how cold desert soils compare with those from hot deserts, forests, prairie, and tundra. We investigated the relationships among these different ecologically distinct microbiomes and explored their relationship to environmental factors, such as pH values.

# RESULTS

We conducted a series of computational experiments including both intensive simulations and real dataset analyses to study the effect of *k*-tuple-based alignment-free methods with or without reads binning on identifying group and gradient relationships of metagenomic samples. To accomplish this, we first simulated two types of metagenomic datasets to investigate the performance of our newly developed alignment-free measures *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , and the effect of several factors, such as the *k*-tuple size and Markov orders of background sequences, on their performance. The simulated datasets were generated based on sampling reads from one hundred bacterial genomes randomly chosen from those

detected after June 2015 with different abundance levels. The genomes discovered before May 2015 were used for training the Markov models for reads binning. We binned bacterial genomes by their GC content, and then, for each bin, we trained a Markov chain to model sequences in that bin. For reads in the simulated metagenomic samples, we classified them into different bins based on their likelihood evaluated under the corresponding Markov models [Eq. (3)]. The *k*-tuple frequency vectors were counted and normalized individually for each group [Eq. (4)]. Finally, the pairwise alignment-free dissimilarities, *d<sup>S</sup>* <sup>2</sup> and *d*2 \* , were computed between samples based on Eq. (1, 2), and β-diversity analysis was implemented to evaluate how well the true underlying relationship among samples could be recovered by our method. We also compared our newly developed methods with the original version of the alignment-free measures in (Jiang et al., 2012; Song et al., 2013) which were based on *k*-tuples, but without reads binning. In addition, we also compared our approach with two reference-free binning methods, COCACOLA and MetaBAT, and two other reference-based binning methods, Kraken and MBMC.

#### Simulation 1: Detecting Group Relationships Among Metagenomic Samples

In some situations, metagenomic samples may form different groups. For example, gut samples may group based on diet, and soil samples may group based on locations. In order to evaluate the ability of dissimilarity measures to detect such group relationships, we simulated datasets of 60 metagenomic samples belonging to three different groups (20 samples in each group) similar to the simulation design of (Jiang et al., 2012). Each sample was generated by simulating NGS reads from a mixture of 100 bacterial genomes detected after June 2015 with different abundance levels (see Materials and Methods for details).

We applied our newly developed alignment-free measures *dS* <sup>2</sup> and *d*<sup>2</sup> \* to detect group relationships of the 60 samples by clustering analysis. We studied various factors, including the number of bins, the order of the Markov model for the background sequences, the tuple size *k*, and sequencing depth, all affecting the performance of *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* in recovering the group relationships among the samples. **Figure 2** showed that both *d<sup>S</sup>* 2 and *d*<sup>2</sup> \* dissimilarity measures with reads binning outperform the original versions without reads binning. The best clustering result with the smallest triplet distance is obtained by *d<sup>S</sup>* <sup>2</sup> with reads binning using tuple size *k* = 5, Markov order 3 (**Figure 3**). To test if the lowest triplet distance is statistically significantly lower than the second lowest triplet distance, we generated 10 duplicated datasets to simulate possible stochastic effects in real NGS data and obtained the triplet distances between the inferred clustering and the reference cluster for each duplication. Using paired t-test, the resulting one side p-value is less than 0.0005 indicating that the lowest and the second lowest triplet distances are statistically significantly different. In **Table 1**, we fixed the tuple size at 5 for *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , and compared the effect of reads binning number on recovering group relationships. The results showed that alignment-free methods without reads binning had

TABLE 1 | The triplet distances between the reference and the clustering trees using various numbers of bins for the reads with tuple size k = 5 and background sequence Markov order from 0 to 3 for Simulation 1 at sequencing depth of 500,000 next-generation sequencing paired-end reads.

grouped to 4 bins, tuple size k = 5, and background sequence Markov order = 3.


The two lowest triplet scores are in boldface.

the largest values of triplet distance, i.e., the worst performance, compared to alignment-free methods with reads binning from 2 to 5 bins, which improved performance. Reads binning from 3, 4, or 5 bins could achieve similar performance. The simulations using a relatively shallow sequencing with 100,000 paired-end reads also gave results similar to those of deeper sequencing with 500,000 paired-end reads (**Figure S3**).

We next investigated the effects of sequencing errors on the performance of our methods and the results are shown in **Figure S1(a, b)** in the supplementary material. As expected, the sequencing errors could affect the accuracy of the reads assembly and contig binning, which in turn affect the clustering results. The triplet distance did not increase with sequencing error rate significantly until the sequencing error rate equals to 0.05 (**Figure S1**, p-value < 0.05 for t-tests). For reference, the sequencing error rates of Illumina and 454 platforms are ~0.001 or 0.01, respectively (Glenn, 2011), so sequencing errors only slightly impact the performance of the measures at the reported error rates for the NGS technologies.

We next considered other reference-independent and reference-dependent ways to construct Markov chain models. We cross-assembled the reads from the 60 metagenomic samples and used COCACOLA (Lu et al., 2017a) and MetaBAT (Kang et al., 2015), two reference-independent contig binning methods, to bin these contigs, respectively. We also used two reference-based reads binning methods, Kraken (Wood and Salzberg, 2014) and MBMC (Wang et al., 2016), based on bacterial genomes to group the metagenomic reads into different bins. Then, Markov chain models were constructed for each contig bin, and reads were then classified in the same way to each contig bin based on their likelihood under different Markov models. We compared these reads binning schemes with our approach. **Figure 2** show the corresponding results. It can be seen that all these reads binning schemes are better than the original version without any reads binning procedure, but they do not perform as well as the above scheme based on binning from Markov chains.

#### Simulation 2: Revealing Environmental Gradients From Metagenomic Samples

The second simulation experiment was designed to evaluate the effectiveness of the alignment-free methods for analyzing gradient variation of microbial communities. A set of 20 metagenomic samples was generated by simulating NGS reads from 100 bacterial species also used in the above simulations with varying abundance levels. We designed the proportion of the 100 genomes to vary from sample 1 to sample 20 in a way that would mimic gradient variation across the samples, and then, we evaluated the performance of the alignment-free methods in terms of revealing such gradient variations from the metagenomics data.

Dissimilarity matrices were calculated using the alignmentfree methods with different *k*-tuple sizes and Markov orders of background sequences as above. PCoA (Anderson, 2003), an effective approach to display β-diversity among multiple samples, mapped the 20 samples to a two-dimensional space. Then, the PCC was calculated between the first principal coordinate (PC1) given by PCoA and the predetermined gradient axis built into the simulation model. PCC can be taken as an index of how well the alignment-free method reveals the gradient variation in samples (see *Materials and Methods* for details). A higher PCC indicates better performance of the dissimilarity measure in recovering the gradient among the microbial samples.

Similar to Simulation 1, we generated two sequencing depths of 100,000 and 500,000 paired-end reads per sample. **Figure 4** showed the average PCC of the different dissimilarity measures at different tuple sizes and Markov orders of background sequences. Similar to the results in Simulation 1, reads binning improved the results compared to no binning for both alignmentfree measures, *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* . The PCC values increased with tuple size and Markov order. For a fixed bin number of reads and tuple size, the PCC values increased more than 0.10 from order 0 to order 4, indicating that higher order Markov chains could model the genomic sequences better. The performance of *d*<sup>2</sup> \* is slightly better than that of *d<sup>S</sup>* <sup>2</sup> for gradient detection. The best result with the largest PCC value was obtained by *d*<sup>2</sup> \* with reads binning using tuple size *k* = 9 and background Markov order 4. To test if the highest PCC is statistically significantly higher than the second highest PCC, we generated 10 duplicated datasets to simulate possible stochastic effects in real NGS data and obtained the PCC for each duplication. Using paired t-test, the resulting one-sided p-value is less than 0.0005. In **Table 2**, we fixed the tuple size as 9 for *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , and compared the effect of number of read bins on recovering gradient relationships. Again, results showed that the alignment-free methods without reads binning had the lowest values of PCC, i.e., worst performance, while methods with reads binning into 2 to 5 bins improved performance. For a given order of Markov chain, the PCCs corresponding to binning reads to 3, 4, or 5 bins are similar, indicating that that the number of reads bins does not markedly affect the performance of our methods when the bin number is at least 3. The simulations using a relatively shallow sequencing with 100,000 paired-end reads also gave results similar to those of deeper sequencing with 500,000 paired-end reads (**Figures S4** and **S5**). **Figure S1(c, d)** showed that the PCC values only decreased significantly when the sequencing error was 0.05 suggesting that sequencing errors only slightly impact the performance of the measures. **Figure 4** shows that all these reads binning schemes are better than the original version without any reads binning, but they do not perform as well as the above scheme based on binning from Markov chains.

### Detecting Group Relationships Among Human Gut Samples

We applied the alignment-free methods to analyze human gut metagenomic datasets from different countries. These datasets include 107 fecal microbiome samples from Asia (Kurokawa et al., 2007; Qin et al., 2012), Europe (Qin et al., 2010) and North America (Turnbaugh et al., 2009). Two countries (China and Japan, n = 45 and 13) are from Asia, two countries (Denmark and Spain, n = 21 and 10) are from Europe, and one country (USA, n = 18) is from North America. In the simulation results, we found that the triplet distance and PCC values of the alignment-free dissimilarity measures *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* could achieve the best performance when the NGS reads were classified to four bins. Consequently, in the real data analysis, we used all the bacterial genomic sequences both before May 2015 and after June 2015 to construct four different Markov Models to bin these NGS reads.

TABLE 2 | The Pearson correlation between the first principal coordinate and the simulated environmental gradient using different numbers of bins for the reads with tuple size k = 9 and Markov order from 0 to 4 for Simulation 2 at sequencing depth of 500,000 next-generation sequencing paired-end reads.


The two highest Pearson correlations are in boldface.

FIGURE 4 | The relative performance (Pearson correlation coefficient) of various reads binning methods in recovering gradient relationships of the metagenomic samples for Simulation 2 at sequencing depth of 500,000 next-generation sequencing paired-end reads. The background sequence Markov orders were two (a1, a2), three (b1, b2) and four (c1, c2). The dissimilarity measures *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* with binning into 4 bins outperform other binning methods in most situations. The corresponding figures based on Markov order zero and one are presented as Figure S4 in Supplementary Material.

First, we used alignment-free measures, *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , with tuple size 9 and Markov order 4 to explore the relationship among these human gut metagenomic samples. Similar to the simulation studies, we used UPGMA to cluster the samples based on the dissimilarity matrix, as defined by different dissimilarity measures based on sequence signatures. **Figure S6** showed that these human gut samples could be clustered into four different groups labeled with different colors. The Japanese and American samples could be clearly separated from other groups with no overlaps. Most Chinese and European samples could be grouped separately, but with some overlaps. The samples from Denmark and Spain could not be distinguished from each other. A previous study (Costea et al., 2018) showed that the gut microbial community of both Chinese and European samples was enriched with *Firmicutes, Bacteroides and Prevotella*; however, the American samples all indicated a highfat diet and were enriched with only *Bacteroides*. Therefore, both Chinese and European samples had similar microbial composition and should first be clustered together and then clustered again with the Japanese samples. The American samples have distinct gut microbial composition and should be separated from other samples.

We next calculated the triplet distance based on the four divided groups for *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* . The results of triplet distance scores for the different dissimilarity measures are summarized in **Table 3**. The smallest triplet distance score was achieved with *d<sup>S</sup>* <sup>2</sup> coupled with tuple size *k* = 6 and the fourth order Markov chain model of background sequences. When the order of Markov chains was four, the triplet distances were all lower than 30,000 for tuple size *k*

from 6 to 9. In addition, triplet distance decreased with increasing Markov order for any fixed tuple size. The best performance was achieved when tuple size was *k* = 6 or 7 and Markov order = 4, similar to the *k*-tuple in Simulation 1. **Figure 5** showed the cluster tree using UPGMA for *d<sup>S</sup>* <sup>2</sup> with tuple size *k* = 6 and Markov order 4. **Table S5** showed the confusion matrix for *d<sup>S</sup>* <sup>2</sup> with tuple size *k* = 6 and Markov order 4. **Figure S7** showed the PCoA plot of these 107 samples. In this rooted tree, we found that American samples were separated from other samples and that the Japanese samples were separated from the Chinese and European samples. Although some European samples were mixed with the Chinese samples, most European samples clustered together.

#### Detecting Group Relationships Among Human Body Sites

We applied the alignment-free methods to analyze human metagenomic datasets from four body sites: buccal mucosa, supragingival plaque, tongue dorsum, and stool (Lloyd-Price et al., 2017). Each body site had fifteen samples. We calculated the pairwise *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* dissimilarities for any pair of samples

American samples.

TABLE 3 | The triplet distance between the reference and the clustering trees for the 107 human fecal metagenomic samples using various reads binning methods with tuple size k = 5–9 and background sequence Markov order from 0 to 4.


The two lowest triplet distances are in boldface.

and build a hierarchical clustering tree. We next calculated the triplet distance between the clustering tree with the four divided groups based on body sites. **Table 4** showed that the smallest triplet distance score was achieved with *d<sup>S</sup>* <sup>2</sup> coupled with tuple size *k* = 6 and the fourth order Markov model of background sequences. **Figure 6** showed the cluster tree using UPGMA for *dS* <sup>2</sup> with tuple size *k* = 6 and Markov order 4. **Table S6** showed the confusion matrix for *d<sup>S</sup>* <sup>2</sup> with tuple size *k* = 6 and Markov order 4. In this rooted tree, we found that supragingival plaque and tongue dorsum samples were first grouped together and then clustered with the stool samples and buccal mucosa samples, consistent with the results from a previous study (Lloyd-Price et al., 2017).

#### Detecting Group and Gradient Variations in Soil Metagenomic Data

We next applied the alignment-free methods to analyze the metagenomic data of soil microbial communities collected from different geographic locations, spanning a wide range of ecologically distinct biomes, to examine how cold desert soils would compare with hot desert soils, forests, prairie, and tundra (Fierer et al., 2012).

The 16 soil samples form three ecologically distinct groups: hot deserts (n = 3), cold deserts (n = 6), and worldwide forests (n = 7). We conducted clustering analysis with sequence signatures of these samples and used triplet distance to study how well the grouping information was revealed (**Table 5**). Again, for all tuple size values, it can be seen that the performance of the


The two lowest triplet distances are in boldface

alignment-free methods improved along with reads binning. Under reads binning, *d*<sup>2</sup> \* coupled with tuple size *k* = 6 and the fourth order Markov model of background sequences achieved the best performance (**Tables 5** and **S7**, **Figure 7**). We observed that the three major groups identified by the alignment-free methods, *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* , reflected three major ecologically distinct conditions. The main factor that differentiates these soil samples is pH which, in polar and hot deserts, is higher than 7.00, but in worldwide forests lower than 7.00. These three groups of samples had different ranges of pH values. The pH of polar desert ranged from 8.15 to 9.95, while the pH values of hot desert ranged from 7.90 to 8.38. The pH values of worldwide forests ranged from 4.12 to 6.37. In the forest soil samples, the two samples from tropical forest (PE6) and Arctic tundra (TL1) with lowest pH values (4.12 and 4.58) were first clustered together and then clustered again with other forest samples. In order to test whether pH was the main environmental driver of microbial community composition, we tested the correlation between pH values and the first principal coordinate of these samples, and a highly significant negative correlation was found, as shown in **Figure S8** (Pearson correlation = −0.856, p-values = 0.0001). We also examined the correlation among the first to fourth principal coordinate of these samples with other environmental factors, including mean annual precipitation (MAP), mean

annual temperature (MAT), organic Carbon content (%C), Nitrogen content (%N), and Carbon : Nitrogen ratio (C:N ratio). The first principal coordinate was also associated with the %C, %N, and C:N ratio (p-values < 0.01). But for the second, third, and fourth principal coordinates, the associations were not significant (**Table S8**).

# DISCUSSION

In this study, we developed new alignment-free measures *dS* <sup>2</sup> and *d*<sup>2</sup> \* for the comparison of metagenomes that model metagenomic reads as from a mixture of multiple Markov chains. We investigated the applications of the new alignmentfree measures to compare metagenomic samples. Because of the high complexity of metagenomic data, the previous version of alignment-free measures *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* in (Jiang et al., 2012) that used only one background Markov model could not capture data heterogeneity. We proposed to first group reads in metagenomic samples into various bins using different Markov models. Then, *k*-tuple frequency vectors were counted and normalized individually in each bin. With the newly developed mixture model for computing the *k*-tuple expectations, we found that the modified *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* measures with reads binning outperformed the old ones in terms of recovering group and gradient relationships among samples from different environments. We extensively tested the methods on two sets of simulated metagenomic data and two sets of real metagenomic data, including metagenomes of human gut samples and worldwide soil samples. The effects of tuple size *k*, Markov order, and the bin number on the performance of our newly developed alignment-free measures were investigated, and the optimal ranges of those parameters were obtained.

There are several limitations of the current study. First, the performance of the new *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* measures depends on the number of bins for the reads. In this study, we let the number of bins be 1 to 5 and found that the optimal number of bins for the reads is between 3 and 5 in both simulation and real studies. In practice, we suggest setting the number of bins for the reads as 4. More studies are needed to see if

TABLE 5 | The triplet distance between the reference and the clustering trees for the 16 soil metagenomic samples from three ecologically distinct groups using various reads binning methods with tuple size k = 5–9 and background sequence Markov order from 0 to 4.


The two lowest triplet scores are in boldface

this conclusion is robust for most comparative studies of metagenomic datasets. Second, the tuple size *k* can markedly impact the performance of the new *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* measures, and the optimal range of *k* can increase with sequencing depth. In general, the tuple size from 6 to 9 can give reasonable results. Third, the optimal range of Markov order is between 3 and 4 in most of our studies. Finally, *d<sup>S</sup>* <sup>2</sup> and *d*<sup>2</sup> \* have similar performance, but *d<sup>S</sup>* <sup>2</sup> slightly outperforms *d*<sup>2</sup> \* in most studied scenarios. This result is consistent with the finding that the old version of *d<sup>S</sup>* <sup>2</sup> slightly outperforms the old version of *d*<sup>2</sup> \* without reads binning.

In this study, we focused on the comparison of metagenomic samples using alignment-free methods with reads binning. However, compared to alignment-based methods for mapping the reads to known genome or pathway databases and then comparing the genome and pathway abundance profiles, alignment-free methods cannot give insights about genomes and pathways responsible for the differences. From this perspective, we can say that alignment-free and alignmentbased methods for metagenome comparison complement each other and should be used interactively to understand the dynamics of microbial communities.

#### REFERENCES

Ahlgren, N. A., Ren, J., Lu, Y. Y., Fuhrman, J. A., and Sun, F. Z. (2017). Alignmentfree d2\* oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. *Nucleic Acids Res.* 45, 39–53. doi: 10.1093/nar/gkw1002

FIGURE 7 | The best clustering tree for the 16 soil metagenomic samples from three ecologically distinct groups based on the newly developed dissimilarity measure *d<sup>S</sup>* <sup>2</sup> coupled with tuple size k = 6 and background sequence Markov order = 4. Red squares: polar desert samples; blue squares: hot desert samples; green squares: forest samples.

## AUTHOR CONTRIBUTIONS

KS and FS conceived of the project and developed the methods. KS and JR performed the computations. All authors discussed the results and contributed to the final manuscript.

# FUNDING

The research was supported by the National Natural Science Foundation of China (11701546), U.S. National Institutes of Health (R01GM120624), and National Science Foundation (DMS-1518001).

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.01156/ full#supplementary-material


using unique clade-specific marker genes. *Nat. Methods* 9, 811. doi: 10.1038/ nmeth.2066


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Song, Ren and Sun. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# An Information-Based Approach for Mediation Analysis on High-Dimensional Metagenomic Data

Kyle M. Carter <sup>1</sup> , Meng Lu<sup>1</sup> , Hongmei Jiang<sup>2</sup> and Lingling An1,3,4\*

<sup>1</sup> Interdiciplanary Program in Statistics and Data Science, The University of Arizona, Tucson, AZ, United States, <sup>2</sup> Department of Statistics, Northwestern University, Evanston, IL, United States, <sup>3</sup> Department of Epidemiology and Biostatistics, The University of Arizona, Tucson, AZ, United States, <sup>4</sup> Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States

The human microbiome plays a critical role in the development of gut-related illnesses such as inflammatory bowel disease and clinical pouchitis. A mediation model can be used to describe the interaction between host gene expression, the gut microbiome, and clinical/health situation (e.g., diseased or not, inflammation level) and may provide insights into underlying disease mechanisms. Current mediation regression methodology cannot adequately model high-dimensional exposures and mediators or mixed data types. Additionally, regression based mediation models require some assumptions for the model parameters, and the relationships are usually assumed to be linear and additive. With the microbiome being the mediators, these assumptions are violated. We propose two novel nonparametric procedures utilizing information theory to detect significant mediation effects with high-dimensional exposures and mediators and varying data types while avoiding standard regression assumptions. Compared with available methods through comprehensive simulation studies, the proposed method shows higher power and lower error. The innovative method is applied to clinical pouchitis data as well and interesting results are obtained.

Keywords: high-dimension, mediation analysis, information, nonparametric, microbiome, host genome

# INTRODUCTION

Humans maintain a close symbiotic relationship with trillions of microorganisms that live upon and within their bodies. The human body relies on assorted communities of microbes to develop bodily functions such as metabolism and immune response as well as to protect the body from infections from harmful pathogens. Researchers have begun to recognize the importance of the interactions between host and microbiota and how they may impact human health. In particular, studying this interaction has become a key topic in numerous fields of research such as immunology (Rogers and Wesselingh, 2016; Rooks and Garret, 2016), oncology (Taur and Parmer, 2016), and metabolomics (Rostami et al., 2015; Galla et al., 2017; Kurilshikov et al., 2017). The current Integrative Human Microbiome Project (IHMP) aims to record behavior over time for host biology and the metagenome for the onset of Inflammatory Bowel Disease and Type 2 Diabetes as well as for neonatal development. With progressively more data available, a growing research interest has

Edited by:

Mariza De Andrade, Mayo Clinic, United States

#### Reviewed by:

Jaya M. Satagopan, Rutgers, The State University of New Jersey, United States Brandon Jason Coombes, Mayo Clinic, United States

> \*Correspondence: Lingling An anling@email.arizona.edu

#### Specialty section:

This article was submitted to Statistical Genetics and Methodology, a section of the journal Frontiers in Genetics

Received: 23 July 2019 Accepted: 10 February 2020 Published: 13 March 2020

#### Citation:

Carter KM, Lu M, Jiang H and An L (2020) An Information-Based Approach for Mediation Analysis on High-Dimensional Metagenomic Data. Front. Genet. 11:148. doi: 10.3389/fgene.2020.00148 emerged for integrative analysis of multiple omics data, for example, host transcriptome and human microbiome data.

One popular approach for integrating multiple omics datasets is mediation analysis. A mediation model aims to extract the mechanisms by which an exposure impacts the outcome variable by considering a set of potential variables which may mediate the effect. Identifying these mechanisms is a vital step in developing effective medication and therapy as well. In particular, the microbial community could be easier to manipulate using antibiotics and probiotics.

Simple mediation models with only one exposure and one mediator have been widely used in psychology for several decades (MacKinnon et al., 2006; Agler and De Boeck, 2017), with most recent notable development focused on models with multiple mediator variables (Daniel et al., 2015). However, the application of mediation models for biological data has introduced additional challenges, including the difficulty of incorporating multiple, high dimensional omics datasets with varying data structures. In this research, we aim to develop a nonparametric framework for mediation analysis to avoid the assumptions and pitfalls of current mediation models.

#### MATERIALS AND METHODS

#### Background

A simple mediation model aims to explain the mechanisms that underlay the relationship between an exposure variable (X) and a response variable (Y), by considering a tertiary mediator variable (M) which may mediate the effect of the exposure on the response (Figure 1). The total effect of the exposure variable can be decomposed into the direct effect, effect from exposure to response directly, and the indirect effect, effect of the exposure which is mediated by the mediator variable.

A mediation model is most commonly examined parametrically utilizing a linear structural equation model (LSEM):

$$Y = \gamma^\prime X + \varepsilon \tag{1}$$

$$M = \alpha X + \varepsilon\_M \tag{2}$$

$$Y = \gamma X + \beta M + \varepsilon\_Y$$

$$\varepsilon = (\gamma + \alpha \beta)X + \beta \varepsilon\_M + \varepsilon\_Y \tag{3}$$

where g ′ and g represent the total effect and direct effect, respectively. Baron and Kenny (1986) proposed to detect whether an indirect effect exist by testing either the product ab = 0 or the difference between the total and direct effects g ′ – g = 0. In addition to the traditional mediation assumptions of causal direction (i.e., additive effects and no unmeasured confounders or sequential confounders) (MacKinnon et al., 2006; Vanderwheele and Vansteelandt, 2014; Preacher, 2015), the LSEM approach requires standard regression assumptions such as linearity, no collinearity, known link function, exponential distribution of the error term, and sample size larger than parameter space. While the LSEM structure has seen widespread use and success in psychology applications where mediation analysis includes a single mediator and continuous exposure variables, many of these assumptions are violated in the context of genomics and metagenomics studies with counts data.

In response to these challenges, new statistical methods have been developed in the last few decades in an attempt to apply mediation modeling approaches for neural and biological data. Boca et al. (2014) constructed a distribution of the correlation between parameters by permuting the outcome in each of the LSEM equations. Huang and Pan (2015) developed a Monte-Carlo procedure to evaluate the mediation effect of highdimentional continuous mediators. Huang et al. (2015) performed an omnibus test by comparing L1 normalized terms from three logistic regression models based on the structural equations model. Kim et al. (2016) and Nguyen et al. (2016) utilized binary exposure to generate natural direct and indirect effect measures via expectation differences. Zhang et al. (2016) used minimax concave penalty regularized logistic regression models to estimate b effect (in eq(3)). Recently, Sohn and Li (2019) proposed a causal composition mediation model (CCMM) specifically for microbiome mediators which utilized a bootstrap covariance matrix to perform log-contrast compositional regression. While these approaches may avoid concerns associated with the n< < p paradigm (i.e., sample size is smaller than the parameter space), they often require a single exposure variable and a linear relationship between parameters. Many additionally enforce certain data type such as binary exposures or continuous responses.

In this research, we aim to evaluate the presence of indirect effects by developing a nonparametric framework based on information transfer. While applications of information theory in a biological context have been seldom, it has achieved some success in feature selection for gene expression data (Meyer et al., 2008; Radovic et al., 2017). Recent advances in this field include alternatives for finding relative contribution of variables using entropy methods. Radovic et al. (2017) approached this problem by introducing a penalty term for mutual information shared between selected variables. Liu et al. (2016) assigned a measure of feature quality by comparing conditional information of a variable on an outcome conditioned upon k-nearest-neighbor variables. By utilizing information-based methods, in our research, there is no need to assume underlying distributions or data types of genomic/metagenomic data, or response variable (e.g., clinical outcome) while nonlinear or non-additive relationships between variables can be explored.

#### Methods

Recent research has discovered that the abundance and diversity of the microbiome have an impact on the expression of human genes (Blekhman et al., 2015; Bonder et al., 2016; Davenport, 2017). In this study, we will focus on treating microbes as mediators for host genes. However, the proposed method itself is very general and can be applied in other types of studies, e.g., genomic or epigenomic study, or even studies in other fields.

To discover which microbial taxa mediate the effect of gene expression on a clinical outcome, we propose a nonparametric framework based on information theory feature reduction techniques, termed as Nonparametric Entropy Mediation (NPEM). Information theory compares joint distributions of two or more variables with the marginal distributions of subsets to measure association between variables. This can capture nonlinear and non-additive associations by observing changes in distribution of the outcome as compared to distance based and regression modeling approaches which can only capture linear association with the outcome (Roulston, 1999). The information can be measured using Shannon Entropy and Mutual Information (MI) (Shannon, 1949). Shannon entropy represents the uncertainty, potential information, from a discrete random variable or random vector, and is defined as amount of information produced by a stochastic process:

$$H(X) = -\Sigma\_{\mathbf{x} \in X} \mathfrak{p}(\mathfrak{x}) \text{log}\mathfrak{p}(\mathfrak{x}),\tag{4}$$

where p(x) represents the probability of observing X = x (if the variable is continuous, this definition is redefined by using the integral across the domain for continuous density functions instead of the summation across the domain of events). Shannon entropy of a multivariate process between two variables X and Y can be calculated using joint Shannon entropy:

$$H(X,Y) = -\Sigma\_{\mathbf{x}\in X} \Sigma\_{\mathbf{y}\in Y} p(\mathbf{x},\boldsymbol{\mathcal{y}}) \text{log}\boldsymbol{p}(\mathbf{x},\boldsymbol{\mathcal{y}}),\tag{5}$$

where p(x,y) represents the probability of observing X = x and Y = y (note: the notations X and Y here are just two common variables, different from the notations in the LSEM in Background).

Mutual information (MI) is defined as the overlap of information produced by multiple stochastic processes:

$$MI(X,Y) = H(Y) + H(X) - H(X,Y)$$

$$\varepsilon = \Sigma\_{\mathbf{x} \in X} \Sigma\_{\mathbf{y} \in Y} p(\mathbf{x}, \boldsymbol{\chi}) \log \frac{p(\mathbf{x}, \boldsymbol{\chi})}{p(\boldsymbol{\chi}) p(\boldsymbol{\chi})}.\tag{6}$$

Mutual information can be used as a measure of dependency between the variables in a multivariate stochastic process. If the included variables are independent, the information metric is zero.

To capture the unique mutual information from a variable X, we additionally define the contributed information to be the mutual information of one variable given a set of measured variables (W):

$$C(X, Y, \mathcal{W}) = MI(X, Y) - \Sigma\_{\mathbf{w} \in \mathcal{W}} \frac{MI(X, \mathcal{w})}{\left\| \left\| \mathcal{W} \right\| \right\|^2} \tag{7}$$

To investigate the mediated relationship between host gene expression and a clinical outcome, we propose to construct the mediation model as a multivariate stochastic process generating the set of I genes (X = {X1,…,XI}), the set of J microbial taxa (M = {M1,…,MJ}), and a clinical outcome Y (throughout the text of this paper we use bold symbols to represent sets of variables). If we maintain the causal direction and no intermediary confounding assumptions, we can examine the relationship between variables using the mutual information between variables from the stochastic processes. To mimic current LSEM structure, we define g′ as a label of relationship between X and Y, a as the relationship between X and M, and b as the relationship between M and Y when X is also included. Thus, we use these labels to represent the relationships between the variables based on the theory information in Figure 2.

Consider the b effect from M to Y as the overlap in information contained by M and Y, then it can be decomposed into b<sup>1</sup> representing the overlap of a and b, and b<sup>2</sup> representing the unique information from <sup>M</sup> as shown in Figure 2 such that <sup>b</sup> = b<sup>1</sup> + b2. Note that b<sup>2</sup> represents the value bϵ<sup>M</sup> in equation (3). If b<sup>2</sup> ≠ 0, then it follows b ≠ 0. Consider two possible outcomes when b<sup>2</sup> = 0: 1) if b<sup>1</sup> = 0 and b<sup>2</sup> = 0, then M does not offer any information about Y and there is no mediation effect. This is equivalent to b = 0 and by extension ab = 0 in the LSEM framework; 2) if b<sup>1</sup> ≠ 0 and b<sup>2</sup> = 0, all information M provides

about Y is also contained in X. Due to perfect collinearity, no conclusion can be drawn about the existence of mediation effects. For the purposes of our study, we will consider this scenario as not a mediation effect. Thus, the overlap of all variables is not sufficient and any scenario where b<sup>2</sup> = 0 would not be considered a mediation effect. The existence of mediation effects can be captured by measuring a and b2. The two relationships a and b<sup>2</sup> as shown in Figure 2 can be expressed in terms of mutual information as MI(X, M), and MI(M, Y) respectively.

In order to capture the effect of each gene or each taxon individually, we additionally consider collinearity between the variables. We will use contributed information to measure the relationship between gene i and taxon j, ai,j, as C(Xi,Mj,S), and the relationship between taxon j and the response (for the purpose of explanation we use one clinical response variable) Y, b2, as C(Mj,Y,T), where S and T represent a subset of other genes and other microbial taxa, respectively.

To non-parametrically estimate the mutual information and contributed information metrics, we employ kernel density estimation to approximate the distribution of each variable or a set of variables. To allow for varying data types in a joint distribution, we employ kernel product estimation developed by Li and Racine (2003). The choice of kernel will depend on the structure of the data. For continuous data, the distribution will be approximated using a second order Gaussian kernel, which is a common choice due to its smoothness and an ideal choice when integration is required. Distributions of discrete data will be approximated using an Aitchison-Aitken kernel to handle discrete entry frequencies. To avoid overfitting, bandwidths for kernels are approximated using Silverman's Rule of Thumb (Silverman, 1986). To get an accurate density estimator we only need to know the data type but not the shape.

In high resolution sequencing studies, limited genetic material and PCR amplification biases can lead to many OTUs (operational taxonomic units) with zero count, even when those taxa exist within a subject's gut microbiota. However, a concentration of counts at zero can lead to a problem when estimating the distribution using a Gaussian kernel density estimator. Most notably, the decreased variance can lead to smaller estimates for the kernel bandwidth. We propose two approaches for mediation testing using mutual information. In the simplest case, we use a single Guassian kernel to estimate the distribution of OTU abundance and to calculate the contributed information. We refer to this single kernel approach as a univariate entropy measure. To better represent the microbiome data and to avoid some of the potential pitfalls of kernel density estimation, we propose a bivariate approach which decomposes the microbiome data into two parts: presence-absence represented by an Aithison-Aitken kernel and nonzero counts represented by a Gaussian kernel. Contributed information metrics can be calculated separately for both presence–absence and nonzero counts, providing two measurements for each mediator. We refer to this two-kernel approach as a bivariate entropy measure.

#### Univariate Entropy Measure

When calculating mutual information, theoretically, the information metric should be zero if the variables are independent; however finite sample sizes and bandwidth approximation for the kernel density estimates may lead to a bias in the observed information. Out of a large number of taxa in a study, generally only some of them play mediating effect. Under this very general assumption, a vast majority of the signals observed are due to this bias effect. Therefore, we can search for information metrics which are substantially higher than the expected bias, as this indicates a true relationship between variables. For a particular taxon (j) to be a mediating taxon, there must be significant relationships from at least one gene through it to the response. Just like the regression model in Eq (2) where all exposure variables X are included for each mediator variable Mj, ai,j (representing the relationship between the exposure variable Xi and mediator Mj) must be evaluated across all exposures simultaneously within each fixed taxon j. For each taxon j the hypotheses are:

$$\begin{aligned} H\_0: &C\left(X\_i, M\_j, \mathbf{S}\right) \le \mathfrak{p}\_{\alpha, j}, \forall \ i \in \{1, \dots, I\} \quad \text{OR} \quad C\left(M\_j, Y, T\right) \le \mathfrak{p}\_{\beta 2} \\\\ Ha: &\exists \ i \in \{1, \dots, I\}: C\left(X\_i, M\_j, \mathbf{S}\right) > \mathfrak{p}\_{\alpha j} \quad \text{\&} \quad C\left(M\_j, Y, T\right) > \mathfrak{p}\_{\beta 2} \end{aligned}$$

The parameters ja,j and jb<sup>2</sup> represent the expected bias for contributed information with a fixed taxon j and Y respectively. Since the mutual information score should be zero for independent random variables, the bias terms ja,j and jb2 are conservatively estimated as the mean contributed information scores for taxon j and currently unselected genes as defined below, respectively:

$$\mathfrak{sp}\_{\alpha,j} = \Sigma\_{X\_l \in (X - \mathbf{S})} \frac{C(X\_i, M\_j, \mathbf{S})}{\|(X - \mathbf{S})\|} \tag{8}$$

$$\varphi\_{\beta 2} = \Sigma\_{M\_{\parallel} \in (M - T)} \frac{C\left(M\_{\slash}, Y, T\right)}{\|\left(M - T\right)\|} \tag{9}$$

where <sup>X</sup>-<sup>S</sup> represents the set of genes which are currently unselected and <sup>M</sup>-<sup>T</sup> represents the set of OTUs which are currently unselected. For our definition, both the contributed information and the expected bias depend on the components of set S or T. We propose to iteratively select the best predictive genes or taxa based on their contributed information and update S or T respectively after each selection by using a greedy search algorithm. Under this paradigm, we compare the largest contributed information to the average contributed information as defined in equations (8) and (9). This lends itself naturally to outlier detection tests which compare the maximum value to the mean for potential outlier points. Since there could be multiple features which contain true contributed information signals, we opt to use an iterative one-sided Extreme Studentized Deviate (ESD) test (Grubbs, 1950), which was developed for unusually high value detection. We evaluate a series of G statistics (Grubbs, 1950) as follows:

$$G = \frac{C\_{(1)}(\ldots) - \overline{C(\ldots)}}{sd(C(\ldots))}$$

where C(1) represents the highest contributed information to be compared, either for the relation between taxon (j) and genes, or for the relation between the outcome and taxa. C( … ) stands for the average of contributed information and sd represents standard deviation. Under the null hypothesis, the G statistic follows a central t-distribution with degrees of freedom df-2, where df represents the number of remaining unselected features. However, since the contributed information could change at each step, there is still uncertainty on when the hypothesis test should be performed. We propose Algorithm 1 which performs the hypothesis test at each iteration of the greedy search algorithm (NPEM : UV). To be specific, at each step of the algorithm, the contributed information from each gene to a fixed taxon or from each taxon to the clinical outcome is re-evaluated to identify the most informative feature. The highest value of contributed information is recorded, the hypothesis test is performed, and the selected feature is removed from the set of explanatory variables and added to the set of priors S or T. A modified version which performs the hypothesis test after the completion of the greedy search is provided in Supplementary File as Algorithm 1′ (NPEM : UVS). The details and trade-offs of each algorithm are elaborated in the Supplementary File.

ALGORITHM 1 | Non-Parametric Entropy Mediation: Univariate Test (NPEM:UV).

Input: A = {A1,A2,…,AK}: Set of explanatory variables; B: Response variable 1. Initialize an empty set W.

2. Evaluate Contributed Information Ci = C(Ai ,B,W) for each Ai which is not in W. When W is empty, Ci = MI(Ai ,B).

3. Let C denote the vector of the Ci values, and C(1) denote the largest Contributed Information.

4. Calculate Grubb's ESD Test Statistic: <sup>G</sup> <sup>=</sup> <sup>C</sup>(1) <sup>−</sup> <sup>C</sup> sd(C) , where <sup>C</sup> is the average value and sd represents standard deviation.

5. Perform significance test with the distribution tdf-2 to obtain p-value, where df is the length of C.


8. For the variables which do not belong to W, assign the p-value to be 1.

9. For each response variable, apply FDR correction (Benjamini and Hochberg, 1995) to the p-values of all explanatory variables.

This algorithm is general and can be applied to evaluate the significance of all a and b<sup>2</sup> relationships defined in Methods. For the a relationship, A is the full gene set X and B is an individual microbial taxon (Mj), and the resulting p-value pa,j is the FDR corrected p-values. For the b<sup>2</sup> relationships, A is the set of all microbial taxa M and B is the clinical response (Y). The resulting p-value pb,j is FDR corrected. To complete the hypothesis test for mediation effects, we composite the results with conservative measure pj = max (pa,j,p<sup>b</sup>,j), which represents the final p-value for testing the mediation effect of taxon j.

#### Bivariate Entropy Measure

When we represent the abundance of each microbial taxon by decomposing the feature into presence-absence and nonzero counts, the contributed information can be calculated for both presence-absence and nonzero counts individually. Our final decision will leverage both contributed information scores. To test whether a relationship is significant or not, we propose a general hypothesis as follows:

$$H\_0: \left\| \begin{array}{c} \overrightarrow{\mathbf{C}} \right\| \leq \mathfrak{o} \text{ } \mathfrak{v}s. H\_\mathbf{a}: \left\| \begin{array}{c} \overrightarrow{\mathbf{C}} \right\| > \mathfrak{o} \end{array} \right\|$$

where ‖ C \* ‖ represents any norm or distance metric for the vector of two contributed information metrics C \* from zero and nonzero counts. To account for the difference in scale and correlation between presence-absence and nonzero counts, we will utilize Mahalanobis distance (Mahalanbois, 1936):

$$MD\left(\overrightarrow{C}\right) = \sqrt{\left(\overrightarrow{C} - \overrightarrow{\mu}\right) \left' \Sigma^{-1} \left(\overrightarrow{C} - \overrightarrow{\mu}\right)\right]}$$

where m \* represents the vector of means for C and S represents the covariance of the two contributed information scores in C \* . The Mahalanobis distance is distance metric which projects data along its principal components. Each axis is re-scaled to ensure a mean value of zero and variance of 1. By projecting the two contributed information scores onto their principal components, we no longer need to consider correlation between scores. We can now rewrite our hypothesis using the distance from expected bias:

$$H\_0: MD\left(\overrightarrow{\mathcal{C}}\right) \leq \varphi \text{ vs. } H\_a: MD\left(\overrightarrow{\mathcal{C}}\right) > \varphi$$

As in the univariate case (i.e., do not separate the zero and nonzeros counts for each taxon) in Univariate Entropy Measure, for a particular taxon to be a mediating taxon, there must be a significant mediation structure or bridge from at least one gene and then through the taxon to the clinical response. For each fixed taxon j, the hypotheses are as follows:

$$\begin{aligned} \left(H\_0: MD\left(\overrightarrow{\mathbf{C}\_{\alpha;ij}}\right)\right) &\leq \mathfrak{g}\_{\alpha j}, \; \forall \; i \in \{1, \ldots, I\} \; \; OR \; \; MD\left(\overrightarrow{\mathbf{C}\_{\beta\_2 j}}\right) \leq \mathfrak{g}\_{\beta\_2} \\\\ H\_a: \exists i \in \{1, \ldots, I\} : MD\left(\overrightarrow{\mathbf{C}\_{\alpha;ij}}\right) &> \mathfrak{g}\_{\alpha j} \quad \& \; \; MD\left(\overrightarrow{\mathbf{C}\_{\beta\_2 j}}\right) > \mathfrak{g}\_{\beta\_2} \end{aligned}$$

Since the Mahalanobis projection has two dimensions (i.e., for zero and nonzero parts), we compare the Mahalanobis distance to the Chi-Square distribution with 2 degrees of freedom to identify unusually high contributed information values (De Maesschalck et al., 2000). We provide Algorithm 2 below which performs the hypothesis test at each iteration of the greedy search algorithm (termed as NPEM : BV). A modified version which performs the hypothesis test after the greedy search algorithm has completed is provided in Supplementary File as Algorithm 2′ (NPEM : BVS). The algorithm follows the same logic as the univariate case, except that we evaluate the contributed information twice, once for the presence-absence data and once for nonzero counts data, with the most informative feature being decided by the largest Mahalanobis distance. The details for obtaining the final p-values are the same as for the univariate test approach.

#### ALGORITHM 2 | Non-Parametric Entropy Mediation: Bivariate Test (NPEM:BV).

Input: A = {A1,A2,…,AK}: Set of explanatory variables; B: Response variable 1. Initialize an empty set W.

2. For each mediator, decompose into presence-absence and nonzero count (Z, M′)

3. Evaluate Contributed Information for both parts (e.g. Ci \* <sup>=</sup> <sup>f</sup>CZ <sup>=</sup> <sup>C</sup>(Ai, <sup>Z</sup>, <sup>W</sup>), CM<sup>0</sup> <sup>=</sup> <sup>C</sup>(Ai, <sup>M</sup><sup>0</sup> , <sup>W</sup>)g) for each Ai which is not in <sup>W</sup>.

4. Evaluate the Mahalanobis distance for each vector of contributed information scores Ci \* .

5. Move variable Ak into set W.

6. Calculate the Chi-Square Test Statistic: c<sup>2</sup> = MD( C(1) \* )

7. If the p-value is below a threshold (e.g., 0.05), move the variable A(1)

corresponding to the largest Mahalanobis distance MD( C(1) \* ) into set W. 8. Repeat steps 3 through 7 until a specified threshold is reached (e.g. 0.05) or until two or fewer variables remain.

9. For the variables which do not belong to W, assign the p-value to be 1. 10. For each response variable, apply FDR correction to the p-values of all explanatory variables.

## Data

#### Simulation Studies

To evaluate the performance of NPEM, we compare our method to existing methods, a nonparametric permutation test, MedTest (Boca et al., 2014), and a method developed to handle SNP counts data, Integrative Genome Wide Association Study, iGWAS (Huang et al., 2015). We simulate biological data for a dichotomous clinical outcome (e.g., healthy or diseased) under various model settings. Gene expression data was simulated for 300 genes using a normal distribution. The first 150 were generated using a standard deviation of 0.5, and the second half with 2.0. Taxon counts were generated using a negative binomial distribution with excess zeros added, with the probability of excess zeros weighted by the log ratio of abundance to population mean (see the Supplementary File). The relationships between variables are presented in Table 1 below.

Three separate simulation studies are performed to examine the behaviour of NPEM under different scenario settings:

i. The first study investigates the performance of different models with various sample size (40 and 80 per group)

TABLE 1 | Existence of relationships for combinations of gene and taxon indices. True mediation effects require g′ (total effect), a, and b<sup>2</sup> relationships. Here taxa 1–10 are the true mediators for genes 1–20, and taxa 151–160 are the mediators for genes 151–170. The rest taxa are not mediators.


and excess zero probabilities at a high level (80%) or low level (50%) for a total of four data scenarios. The signal strength is fixed at 50%, which is defined as follows:

$$\text{signal strength} = \frac{\delta}{\sigma}$$

where d represents the average difference between healthy and diseased groups and s represents the standard deviation of the noise.


$$\kappa = \frac{c}{\sqrt{\lambda + 1}}.$$

where l represents the mean count and the constant c is set to 1000 for high dispersion and 100 for low dispersion. We fix the sample size to be 40 per group and the excess zero proportion at high (80%) to capture the worse-case. Signal strength ranges from 10% to 50% as in simulation (ii).

For each scenario a total of 20 data sets are generated and evaluated. The results of the simulation studies are presented in Results.

#### Pouchitis Data

Pouchitis, inflammation of a post-operation ileal pouch, affects almost half of all ileal pouch-anal anastomosis recipients, with up to 20% of these patients developing chronic pouchitis. We apply NPEM to pouchitis patient data from Morgan et al. (2015), including host gene expression, microbial abundance, and clinical diagnosis, to investigate the relationship of the host gene expression and microbiome. While extensive research has shown host gene expression and the microbiome can influence pouchitis, the causal mechanisms and interactions are not studied well and the authors only found weak association between host gene expression and the microbiome's effects on the clinical diagnosis.

The clinical data includes samples from 219 patients with information about body location, inflammatory score, antibiotic use, and clinical diagnosis of "No Pouchitis", "Acute Pouchitis", "Chronic Pouchitis", "Crohn's Disease-Like", and "Familial Adenomatous Polyposis". For comparison purposes, we have limited our study to patients with either "No Pouchitis" or "Acute Pouchitis" diagnoses, and no prescribed antibiotics given. This results in an effective sample size of 101 patients. Gene expression data contains 33,297 genes. Transcripts were filtered to remove genes with no annotation, and a log-2 fold change with a conservative cut-off of 0.15 was used to trim the gene set. After filtering, 1103 genes remained. High throughput next-generation sequencing microbiome abundance data recorded 293 operational taxonomic units (OTUs) at the genus level. OTUs that were absent in over 90% of patients were removed, resulting in 103 OTUs.

## RESULTS

#### Simulation Study Results

With a false positve rate of 5%, NPEM algorithms have higher power than MedTest, while iGWAS fails to discover any significant mediators (Figure 3). From the study (i) where the signal strength is high, we find that the UV version of the univariate approach consistently performs the best and the UVS does not perform as well as other NPEM algorithms. Particularly, for a high proportion of zeros and small sample size the UV surpasses the others. As the signal size decreases from 50% to 10% (Figure 4), the performance of this univariate test decreases, regardless of the levels of proportion of zeros. However, the bivariate approach maintains better performance. In particular, the single test (BVS) of the bivariate approach is the most consistent and has the highest power when the proportion of zeros in the dataset is high; for a lower proportion of zeros the BV approach is recommended.

For the overdispersion study (i.e., setting iii), the lower the overdispersion, the higher the power (Figure 5). The UV approach always outperforms the alternatives when the signal strength is higher, regardless the overdispersion levels; the BVS is always the superior method when the signal size is lower.

For all simulation settings and all methods, the empirical false positive rates are well controlled at pre-specified level. For instance, under simulation setting (i) and using an adjusted pvalue cut-off at 0.05, the false positive rates are well controlled (Figure 6). The results for settings ii) and iii) are available in the Supplementary File.

#### Pouchitis Study Results

Due to zero proportions ranging from 20% to 90%, moderate sample size, and small expected signals in the pouchitis OTU data, we applied the proposed approach BVS on this dataset. Six mediating OTUs were detected at 5% FDR level and the corresponding genera are summarized in Table 2. To visualize the relationship between the detected genera and their significantly

TABLE 2 | Top 6 selected Genera with adjusted P-values from NPEM : BVS algorithm.


associated genes, a network plot using significant relationships identified by NPEM : BVS is provided in Figure 7.

While research on how bacteria impacts the body is still ongoing, the selected microbial genera are well known to be related to intestinal health. Fusobacterium and stenotrophomonas are well known to be pro-inflammatory (Sasaki and Klapproth, 2012; Shaw et al., 2016), while propionibacterium has recently been found to regulate inflammatory response (Ple et al., 2015; Colliou et al., 2017). Fusobacterium and aldercreutzia are also found to relate to the health of the host mucosal wall (Shaw et al., 2016). Degraded mucosal walls may lead to greater risk of infections due to bacteria growing in the folds of the intestinal wall. Scardovia and spirochaeta have been commonly discovered to be associated with ulcerative and ischemic colitis (Lee et al., 1971; Sasaki and Klapproth, 2012; Xun et al., 2018), two of the primary diseases resulting in ileal pouch-anal anastomosis. Though the exact mechanisms are yet understood, these choices correlate to existing findings and suggest further research is necessary.

When looking at the selected genes, we see a few unique patterns. A number of genes, particularly those related to Scardovia and Stenotropohomonas, are only located on the Y chromosome. Patient gender was not included in the provided metadata, so we were not able to test whether this effect is somehow related to gender or the specific gene. Many selected genes are in the Caspases (CASP) or Small Nucleotide RNA C/D Box (SNORD) groups. CASP genes regulate inflammation response (Scott and Saleh, 2007), which is what we expect. The SNORD gene group regulates expression of other gene groups. In particular, recent research has found correlation between SNORD-116 segments and gut metabolism (Qi et al., 2016). These genes may be a prime candidate for future research.

#### DISCUSSION

In this paper, we propose nonparametric entropy models to discover significant mediation structures for microbial

mediators. This method is flexible and capable in handling continuous, discrete, and mixed data types for any variable in the model. Though we only discuss continuous and categorical data here, ordinal data may be used in the model by applying a modified Wang-van Ryzen kernel as proposed by Li and Racine (2003) or any other appropriate kernel type. Through simulation studies, we have shown that NPEM outperforms the existing nonparametric test and count-based regression model. In application, our method identifies unique mediation structures undiscovered in the original report relating inflammatory bacteria to host gut health.

The performance of NPEM depends on the data characteristics and selected test statistic. The signal strength in the data is the largest factor separating the performance of the univariate and bivariate options. The bivariate single test (BVS) method is recommended for weak signal size. For the test statistic selection, the poor performance of a singular Grubb's test is expected; the Grubb's test is designed to select singular outliers, thus requires sequential selection. Comparison between the bivariate Chi-Square tests is not straightforward since the correlation structure is reevaluated at each step of the sequential selection algorithm. The proportion of zeros in the data also affects the test selection. When the excess zero proportion is high, a singular test performs stronger than a sequential test. It is important to recognize that the Mahalanobis distance metric does not consider directionality, and unusually low signals may also be selected. A detailed check may be helpful when the noise signals are large.

The alternative causal compositional mediation model, CCMM (Sohn and Li, 2019) was attempted, however, due to the high proportion of zeros and large number of taxonomic units in our experiment, the CCMM algorithm failed to converge. In toy data experiments with no zero counts, CCMM displays higher power in detecting mediating taxa, however it produces much higher false positive rates for associations between host gene expression and taxonomic abundance since the method does not correct for correlation between exposures. The NPEM methods perform much stronger at detecting the correct associations for this particular path a. CCMM is proposed for continuous response, though theoretically a logit link function could handle a binary response.

The performance of our model may be improved through further tuning. The Gaussian kernel is chosen for approximating log-expression density functions for its smoothness and continuous properties. Other kernel types may provide a more

#### REFERENCES


accurate fit of the true distribution. Further research is necessary to conclusively decide on the optimal kernel structures for a given dataset. Additionally, the information metrics may be more accurately estimated by implementing leave-one-out cross-validation at the cost of decreased computation speed. However, this research will be the first research to explore the mediation effect from a brand new point of view, an informationbased theory.

# DATA AVAILABILITY STATEMENT

16S sequence data for this project was downloaded from Bioproject PRJNA269954 [https://www.ncbi.nlm.nih.gov/ bioproject/?term=PRJNA269954]. Microarray data are available from GEO as GSE65270 [https://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GSE65270]. Metadata are available at [http:// huttenhower.sph.harvard.edu/pouchitis2015].

# AUTHOR CONTRIBUTIONS

LA and KC conceived the study. KC designed the methods and algorithms. LA and ML assisted in tuning and critiquing proposed methods. LA and HJ proposed the real data analysis and KC performed the real data analysis. KC drafted the manuscript and all authors edited it.

# FUNDING

This work was partially supported by the National Science Foundation [DMS-1222592 to LA]; and the United States Department of Agriculture [ARZT-1360830-H22-138 and ARZT-1361620-H22-149] to L.A.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2020. 00148/full#supplementary-material


Conflict of Interest: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Carter, Lu, Jiang and An. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

digital media

of impactful research

article's readership