HARMONIES: A Hybrid Approach for Microbiome Networks Inference via Exploiting Sparsity

The human microbiome is a collection of microorganisms. They form complex communities and collectively affect host health. Recently, the advances in next-generation sequencing technology enable the high-throughput profiling of the human microbiome. This calls for a statistical model to construct microbial networks from the microbiome sequencing count data. As microbiome count data are high-dimensional and suffer from uneven sampling depth, over-dispersion, and zero-inflation, these characteristics can bias the network estimation and require specialized analytical tools. Here we propose a general framework, HARMONIES, Hybrid Approach foR MicrobiOme Network Inferences via Exploiting Sparsity, to infer a sparse microbiome network. HARMONIES first utilizes a zero-inflated negative binomial (ZINB) distribution to model the skewness and excess zeros in the microbiome data, as well as incorporates a stochastic process prior for sample-wise normalization. This approach infers a sparse and stable network by imposing non-trivial regularizations based on the Gaussian graphical model. In comprehensive simulation studies, HARMONIES outperformed four other commonly used methods. When using published microbiome data from a colorectal cancer study, it discovered a novel community with disease-enriched bacteria. In summary, HARMONIES is a novel and useful statistical framework for microbiome network inference, and it is available at https://github.com/shuangj00/HARMONIES.

by introducing latent auxiliary variables to specify how each sample (in terms of log s i ) is assigned to any of the inner and outer mixture components. More specifically, we can introduce an n × 1 vector of assignment indicators g, with g i = m indicating that log s i is a sample from the m-th component of the outer mixture. The weight ψ m determines the probability of each value g i = m, with m = 1, . . . , M . Similarly, we can consider an n × 1 vector of binary elements i , where i = 1 indicates that, given g i = m, log s i is drawn from the first component of the inner mixture, i.e. N(ν m , σ 2 s ) with probability t m , and i = 0 indicates that log s i is drawn from the second component of the inner mixture, i.e. N − tmνm 1−tm , σ 2 s , with probability 1 − t m . Thus, the Dirichlet process prior (DPP) model can be rewritten as where t and ν denote the collections of t m and ν m , respectively. Therefore, the update of the size factor s i , i = 1, . . . , n can proceed by using a random walk Metropolis-Hastings algorithm. We propose a new log s * i from N(log s i , τ 2 s ) and accept it with probability min(1, m MH ), where .
Note that the last term, which is the proposal density ratio, equals 1 for this random walk Metropolis update. Since g, , t, and ν have conjugate full conditionals, we use Gibbs samplers to update them one after another: • Gibbs sampler for updating g i , i = 1, . . . , n, by sampling from the normalized version of the following conditional: • Gibbs sampler for updating i , i = 1, . . . , n, by sampling from the normalized version of the following conditional: • Gibbs sampler for updating t m , m = 1, . . . , M : I(g i = m)I( i = 0)).
• Gibbs sampler for updating ν m , m = 1, . . . , M : • Gibbs sampler for updating ψ m , m = 1, . . . , M by stick-breaking process: For the sake of convenience, we have copied Equation (3) in the main text here, Update of normalized abundance: We update each α ij , i = 1, . . . , n, j = 1, . . . , p by using a Metropolis-Hastings random walk algorithm. We first propose a new α ij * from N(α ij , τ 2 α ), and then accept the proposed value with probability min(1, m MH ), where Note that the last term, which is the proposal density ratio, equals 1 for this random walk Metropolis update.

SUPPLEMENT TO SECTION 3.2: ANALYSIS OF MICROBIOME DATA FROM COLORECTAL CANCER PATIENTS
We carried out a simple sensitivity analysis to evaluate the model performance to the choice of the filtering threshold. As discussed in Section 3.2, we filtered out the genera with more than 50% of nonzero counts across the samples. Here, we changed the threshold from 50% to 10%. The new threshold left 92 and 84 genera in the CRC and control group, respectively. Figure S2 shows the network inferred under this new setting.
Our first observation is that there were a large number of similarities between the networks. For instance, in the CRC group, the relatively stronger associations remained the same, such as positive associations between Bacteroides and Alistipes, Barnesiella and Alistipes, Blautia and Dorea, Streptococcus and Haemophilus. Though the new CRC network did not show any negative partial correlations (denoted by the red edges), the negative associations in the original network were relatively weak and might not be as stable as the other edges. Notably, the "positive triangle" among the three genera Fusobacterium, Peptostreptococcus, and Parvimonas was again confirmed here. In the control group, the strong associations can be still detected, such as the positive associations between Alistipes and Parabacteroides, Alistipes and Bacteroides, and negative associations between Bacteroides and Dorea. It is interesting to point out that the networks based on the new threshold tended to have fewer edges.
We also observed that increasing the number of genera in the networks could introduce novel associations. For example, genera Pantoea and Escherichia were dropped from the original CRC group network, yet they established a positive association in the new network. Similarly, genera Pantoea, Escherichia, Abiotrophia, Oxalobacter, and Desulfovibrio were of low abundance in the healthy controls and hence were not considered in the model before. However, they were connected in the new network. These results estimated from those highly sparse genera may hint further biological validations.

INFER THE NORMALIZED ABUNDANCES FOR MULTIPLE GROUPS
In practice, when there are two groups of subjects in a microbiome study (e.g., subjects with two distinct phenotypes), the sequencing data usually include measurements on the same taxonomic features for all the subjects. Then, if the abundance of a taxon j does not differ between two groups, we can improve the posterior influence of log α ·j by merging two groups to increase the sample size. On the other hand, if the taxon is associated with subject's condition, i.e., a taxon that changes its abundance between two groups in the study, the inference of log α ·j should rely on each subject group.
With the goal of borrowing information to improve the posterior inference for certain taxa, we combine the original count matrix from two different groups, to generate the count matrix Y n×p . Here, the sample size is n = n 1 +n 2 , with n 1 , n 2 representing the number of subjects in the first and the second groups, respectively. Meanwhile, we let z = (z 1 , . . . , z n ) T to allocate the n subjects into two groups, with z i = 1 or 2 indicating the group label of subject i. In practice, if taxon j is irrelevant to the subject's phenotype, its abundances should not be differentiating between two groups. However, if taxon j is associated with the disease, its abundance could either increase or decrease from healthy subjects to patients. Therefore, we model the normalized abundance α ij as following: (S1) Here, γ j is a latent binary variable, with γ j = 1 if taxon j is differentially abundant between two groups, and γ j = 0 otherwise. For the taxa with γ j = 0, we can borrow information between groups to increase the sample size in estimating the corresponding posterior of log α ·j . As an extension to Section 2.1 where we assume log α ij ∼ N(µ j , σ 2 j ), the current model includes µ 0j , µ 1j , and µ 2j as the mean parameters for the normal mixture model. Again, we take the conjugate Bayesian approach and impose the following priors for the parameters in the normal mixture model: µ 0j ∼ N(0, h 0 σ 2 0 ), σ 2 0j ∼ IG(a 0 , b 0 ), µ kj ∼ N(0, h k σ 2 k ) and σ 2 kj ∼ IG(a k , b k ) for k = 1, 2.
The estimation of γ j 's determines the resulted normalized abundance matrix. Specifically, for taxon j with γ j = 0, we can impute the zeros due to missing by the posterior mean of log α ·j calculated using information from both groups. As an extension to Equation (3) in the main text , the posterior of α ·j |γ j is as following: Frontiers Therefore, we can obtain the posterior mean of log α ·j by averaging over the log-transformed MCMC samples of α ·j after burn-in.

(a) (b)
Microbiome network for CRC patients  Figure S1. CRC case study: The estimated networks by HARMONIES for (a) CRC patients and (b) healthy controls. All nodes are labeled in their genus names.

(a) (b)
Microbiome network for CRC patients (10% nonzero)  Figure S2. CRC case study: The estimated networks by HARMONIES for (a) CRC patients and (b) healthy controls. The nodes are genera that have at least 10% nonzero observations across the samples in each group.