A Bayesian Semiparametric Regression Model for Joint Analysis of Microbiome Data

The successional dynamics of microbial communities are influenced by the synergistic interactions of physical and biological factors. In our motivating data, ocean microbiome samples were collected from the Santa Cruz Municipal Wharf, Monterey Bay at multiple time points and then 16S ribosomal RNA (rRNA) sequenced. We develop a Bayesian semiparametric regression model to investigate how microbial abundance and succession change with covarying physical and biological factors including algal bloom and domoic acid concentration level using 16S rRNA sequencing data. A generalized linear regression model is built using the Laplace prior, a sparse inducing prior, to improve estimation of covariate effects on mean abundances of microbial species represented by operational taxonomic units (OTUs). A nonparametric prior model is used to facilitate borrowing strength across OTUs, across samples and across time points. It flexibly estimates baseline mean abundances of OTUs and provides the basis for improved quantification of covariate effects. The proposed method does not require prior normalization of OTU counts to adjust differences in sample total counts. Instead, the normalization and estimation of covariate effects on OTU abundance are simultaneously carried out for joint analysis of all OTUs. Using simulation studies and a real data analysis, we demonstrate improved inference compared to an existing method.


INTRODUCTION
Microbial communities are influenced by several factors whether they live in the host's guts or other occupied niches. Their successional dynamics could further change in response to perturbations of the host or of the surrounding environments (Turnbaugh et al., 2009;Needham and Fuhrman, 2016). Understanding how abiotic and biotic factors influence the dynamics of microbial communities is of great interest in the field of microbiome studies. Recent revolutionary advances in next-generation sequencing (NGS) technologies along with rapidly decreasing costs, have facilitated the accumulation of large datasets of 16S ribosomal RNA (rRNA) amplicon sequences across various disciplines such as medicine, biology, ecology, and environmental sciences (Woo et al., 2008). Sequencing data is usually pre-treated for quality filtering, noise removal and chimera checking through bioinformatics algorithms and the filtered sequences are clustered into Operational Taxonomic Units (OTUs), which represent similar organisms (microbial species) based on sequence homology (called OTU picking). An OTU abundance table is generated, recording counts for OTUs in samples. Further statistical data analyses are then performed using the OTU table to answer biological and ecological questions.
Analysis of huge NGS data is computationally expensive and challenging. One of the key challenges is the normalization of counts across samples. Total counts (often called library size or sequencing depth) may vastly vary across different samples due to technical reasons. Thus, observed counts are not directly comparable across samples and cannot be used as a measure of the abundance of an OTU. Normalized counts through rarefaction or relative frequencies are commonly used for easy comparison of OTU abundance across samples. However, such ad hoc normalization procedures have been criticized from a statistical perspective since using pre-normalized quantities may undermine the performance of downstream analysis (McMurdie and Holmes, 2014). Another challenge is high dimensionality and sparsity in OTU count data. A dataset typically includes hundreds or thousands of OTUs and a majority of them has zero or very low frequencies in most of samples. For example, Figure 1A illustrates a heatmap of OTU counts in our motivating dataset described in section 2.3. It shows that a majority of OTUs has very low counts (gray) in a sample, and the set of OTUs having large counts (blue) vary across samples. Due to such sparsity in data, borrowing strength across OTUs through joint analysis of all OTUs is crucial for improved inference. Recently, various statistical methods including Romero et al. (2014), Chen and Li (2016), Gibbons et al. (2017), and Zhang et al. (2017) have been developed for microbiome studies using NGS data. For example, Zhang et al. (2017) used a negative binomial mixed regression model to study interactions between the microbiome and host environmental/clinical factors. Random effects are used to induce correlation among samples from a group. Common to most of recent methods including Zhang et al. (2017) is separately analyzing each OTU at a time.
We develop a Bayesian semiparametric generalized linear regression model to study the effects of physical and biological factors on abundance of microbes. The proposed method performs mode-based normalization through a hierarchical model, which enables direct modeling of OTU counts. Furthermore, the hierarchical model facilitates borrowing strength between OTUs, between samples, and between time points through joint analysis and improves inference on the effects of covariates X on OTU abundance which are the parameters of our primary interest. Specifically, a negative binomial (NB) distribution parameterized by a mean parameter µ and an overdispersion parameter s is assumed for OTU counts. The NB distribution flexibly accommodates overdispersion often seen in NGS data and is commonly used as a robust alternative to a Poisson distribution (Anders and Huber, 2010). The expected count µ of an OTU is decomposed as a product of factors, a baseline mean count g and a nonnegative function η(X) of covariates that describes their effects on the mean count. We use the log link function for η(X) and assume that change in a covariate has a multiplicative effect on mean count, where the associated coefficient quantifies the size and direction of the effect. We consider a Laplace prior for the coefficients, a shrinkage prior that is essential in a high dimensional regression setting. Shrinkage priors in regression yield sparse point estimates of the coefficients, where many of the coefficients have values close to zero and few have large values. The sparse estimates improve out-of-sample prediction and produce more interpretable models (Park and Casella, 2008). In addition, shrinkage priors such as a Laplace prior in a regression problem mitigate potential problems by multicollinearity and yield improved coefficient estimates when covariates are high-dimensional and potentially highly correlated (Polson and Scott, 2012). For baseline mean counts, we develop a nonparametric model to combine all OTUs for joint analysis. Baseline mean counts may vary across samples and OTUs. Also, as in our motivating data for which samples were taken over time, there may be temporal dependence in baseline mean counts. To tackle the problem, we further decompose the baseline count g into sample size factor (r), OTU size factor (α 0 ), and OTU and time factor (α t ), that is, g = r × α 0 × α t . Due to the overparametrization of the baseline mean abundance, individual factors are not identifiable. To avoid identifiability issues, we place the regularizing priors with mean constraints (Li et al., 2017) for sample size factor r and OTU size factor α 0 . In addition, we model a temporal dependence structure between the baseline expected counts for an OTU through a convolutional Gaussian process (Higdon, 1998). The process convolution approach is often used as an alternative approach of the Gaussian process to construct a dependent process due to its efficient computation (Lee et al., 2005;Liang and Lee, 2014). Through simulation studies, we show that estimates of individual parameters r, α 0 , and α t are not fully interpretable under the proposed model, but baseline mean counts g are identifiable. The model also provides a posterior distribution of g for uncertainty quantification.
The rest of the paper is organized as follows. In section 2 we describe the proposed model and discuss the prior formulations and the resulting posterior inference. We perform simulation studies to assess the proposed model and perform comparison with an existing method that analyzes one OTU at a time. We then apply the proposed model to an ocean microbiome dataset. Section 3 presents the performance of the proposed model from the simulation experiment and the ocean microbime data. Section 4 concludes the paper with a discussion on limitations and possible extensions.

Bayesian Semiparametric Regression Model
Suppose that samples are taken at n different time points, 0 ≤ t i ≤ T, i = 1, . . . , n, and with K i replicates at time point t i . We consider count y t i ,k,j of OTU j in replicate k taken at time t i , where i = 1, . . . , n, k = 1, . . . , K i , and j = 1, . . . , J. A sample is thus indexed by t i and k. We let the total number of samples N = n i=1 K i . Let Y = [y t i ,k,j ] denote the N × J matrix of counts, where y t i ,k,j is integer-valued and nonnegative. Also, suppose that covariates X t i = (X t i ,1 , . . . , X t i ,P ) ′ are recorded at time t i . For example, covariates are physical and biological factors in our motivating data.

Sampling Model
Count data by NGS methods is often modeled through a Poisson distribution. The assumption under the Poisson distribution that the variance is equal to the mean is often too restrictive to accommodate overdispersion that variation in data exceeds the mean. The negative binomial (NB) distribution is a popular and convenient alternative to address the overdispersion problem and is widely recognized as a model that provides improved inference to NGS count data (for example, see Robinson and Smyth, 2007;Anders and Huber, 2010). A NB distribution can be characterized by mean and overdispersion parameters. We suppress index i for simpler notation and assume a NB model for count y t,k,j of OTU j in replicate k at time t, where mean count µ t,k,j > 0 and overdispersion parameter s j > 0. The model in Equation (1) implies that count of OTU j in replicate k at time t has mean E(y t,k,j | µ t,k,j ) = µ t,k,j and variance Var(y t,k,j | µ t,k,j , s j ) = µ t,k,j + µ 2 t,k,j s j . The model allows different dispersion levels across OTUs through OTU-specific overdispersion parameters s j . In the limit as s j → 0, the model in Equation (1) yields the Poisson distribution with mean µ t,k,j . We assume a gamma distribution for a prior distribution of s j , s j iid ∼ Ga(a s , b s ), j = 1, . . . , J, with fixed a s and b s . Frontiers in Microbiology | www.frontiersin.org

Model for Regression
We next model the mean count µ t,k,j of y t,k,j . We decompose the mean count into factors, a baseline mean count and a function of covariates, µ t,k,j = g t,k,j η j (X t ). Here parameter g t,k,j denotes the baseline mean abundance of OTU j in sample (t, k) and η j (X t ) is a function of covariates X t for OTU j to model the covariate effects. We construct a generalized regression model by letting log(η j (X t )) = X ′ t β j , where β j = (β j1 , . . . , β jP ) ′ is a P-dimensional vector of regression coefficients of OTU j (Lawless, 1987;McCullagh and Nelder, 1989). The coefficient β j,p quantifies the effect of covariate p X p on the mean abundance of OTU j. A vector β j close to the zero vector produces a value of η j (X t ) close to 1, and the mean count remains similar to the baseline mean count g t,k,j , implying insignificant covariate effects. A negative (positive) of β j,p implies a negative (positive) association between mean counts and the p-th covariate, and a larger value of X j,p decreases (increases) the mean count, while holding the other covariates constant. We consider a Laplace prior on β j . Specifically, we express the Laplace distribution as a scale mixture of normals and assume for j = 1, . . . , J and p = 1, . . . , P, where a λ , b λ , a σ , and b σ are fixed. σ 2 j and φ j,p denote the global and local shrinkage parameters, respectively, for OTU j. After integrating φ j,p out, the prior distribution of β j,p is the Laplace distribution with location parameter 0 and scale parameter σ 2 j /λ j , that is, p(β j,p | λ 2 j , σ 2 j ) ∝ exp −λ j |β j,p |/ σ 2 j . Compared to a normal distribution that is a common choice for the prior of β j,p , the Laplace distribution has more concentration around zero but allows heavier tails. The regularized regression through the Laplace prior more shrinks the coefficients of insignificantly related covariates into zero and less pulls the coefficients of important covariates toward zero. Shrinkage of β estimates through the model in Equation (2) mitigates possible issues due to multicollinearity and efficiently improves estimation of β in a high dimensional setting (Polson and Scott, 2012).

Model for Baseline Mean Count
We next build a prior probability model for the baseline mean count g t,k,j of OTU j in sample (t, k). We assume g t,k,j = r t,k α 0,j α t,j to separate sample (r t,k ), OTU (α 0,j ), and OTU-time (α t,j ) factors. Sample total counts y t,k,· = J j=1 y t,k,j may greatly differ for different samples possibly due to experimental artifacts. For example, counts of an OTU even in the replicates taken at a time point may vastly differ. Sample specific size factors r t,k account for different total counts in different samples and expected counts normalized by r t,k are comparable across samples. Factor α 0,j explains variabilities in baseline mean abundances of OTUs and α t,j models temporal dependence of the mean counts for an OTU, respectively. Factors α 0,j and α t,j are not indexed by replicate k and account for stochastic change over time in normalized baseline expected counts of OTU j. Collecting all, we write the mean count as The model for g t,k,j in Equation (3) is overparameterized and the individual parameters are not identifiable. To avoid potential identifiability issues, many of NB models rely on some form of approximation for the baseline mean counts. For example, one may find the maximum likelihood estimates (MLEs) of baseline mean abundance under some constraints and plug in those estimates to infer the mean abundance levels µ t i ,j of OTUs (Witten, 2011). Plugging in MLEs is simple but may not be robust. In particular, the inference is greatly affected by a small change in a few OTUs that have large counts. Moreover, the errors introduced in the baseline mean count estimation will not be reflected in the inference. Several approaches to robustify the estimates are proposed (for example, see Anders and Huber, 2010;Witten, 2011). To circumvent the identifiability issue and provide uncertainty quantification for estimation of g t,k,j , we take an alternative in Li et al. (2017) by imposing regularizing priors with mean constraints for r t,k and α 0,j . We let the logarithm of the factorsr t,k = log(r t,k ) andα 0,j = log(α 0,j ), and assume the regularizing prior distribution with mean constraints, where φ(η, v 2 ) is the probability density function of the normal distribution with mean η and variance v 2 , constraints for the mixture weights L r ℓ=1 ψ r ℓ = L α ℓ=1 ψ α ℓ = 1 with 0 < ψ r ℓ < 1 and 0 < ψ α ℓ < 1 , 0 < w r ℓ < 1, and 0 < w α ℓ < 1 for all ℓ. Mixture models as in Equation (4) are often used as a basis to approximate any distribution. Each component in Equation (4) is further composed of a mixture of two normals, N(η ℓ , v 2 ) and N (c−w ℓ η ℓ ) (1−w ℓ ) , v 2 with weights w ℓ and 1 − w ℓ , respectively, and the mean of the component is c. In consequence, the prior and posterior ofr andα under the model in Equation (4) satisfy their prespecified mean constraints c r and c α , respectively. Li et al. (2017) showed that the model in Equation (4) flexibly accommodates various features in a distribution such as skeweness or multi-modality while satisfying the constraints. Furthermore, the model based normalization through Equation (4) enables joint analysis of all OTUs and can further improve estimation of the covariate effects. With the regularizing priors, baseline mean counts g t,k,j are identifiable, while r t,k , α 0,j , and α t,j are not directly interpretable. More importantly, the parameters of primary interest η j (X t ) can be uniquely estimated and β j,p 's keep their interpretation as parameters that quantify the effects of covariates on mean counts. We used an empirical approach to fix the mean constraints c r and c α . Sensitivity analyses were conducted to assess the robustness to the specification of c r and c α and show that the model provides reasonable estimates of g t,k,j and moderate changes in the values of c r and c α minimally change the estimates. More details of the specification of c r and c α are discussed in section 3.1. We fix the numbers of mixture components, L r and L α and variances v 2 r and v 2 α . We let η r Recall that samples are collected over time points t 1 , . . . , t n in [0, T] and α t,j accounts for temporal dependence in the baseline mean count for an OTU. We letα t,j = log(α t,j ) a function in time t and use a stochastic process to model temporal dependence among µ t,k,j . The Gaussian process (GP) is one of the most popular stochastic models for the underlying process in spatial and spatio-temporal data (for example, see Cressie, 1992; Banerjee et al., 2014 among many others). The GP effectively represents the underlying phenomenon in a variety of applications, but it has some drawbacks such as a complex computation that requires a matrix decomposition and problematic estimation of the parameters in its covariance function, potentially leading to difficulties in exploring the posterior distribution (Lee et al., 2005;Liang and Lee, 2014). To alleviate such difficulties of GP models while still maintaining their flexibility and adaptiveness, we use a convolution approach with a kernel function developed in Higdon (1998Higdon ( , 2002. For each OTU, we specify the latent process θ j (t) to be nonzero only at the time points u 1 , . . . , u M in [0, T]. Specifically, we consider the GP convolution model, }. The number of basis points M, their locations u m and the range parameter γ can be treated as random variables by placing prior distributions, e.g., consider a gamma prior for γ . For simplicity, we fix them as follows. We first choose a value for M and let u m evenly spaced over time (2015), we let the range parameter γ 2 = ((2T + t ′ 1 + t ′ 2 )/M) 2 , that is, the range parameter depends on the value of M. Through simulations, we studied the impact of different values of M on the posterior inference of g t,k,j . A discussion is included in section 3.1. Given the number of basis points M, we assume θ m,j | τ 2 j indep ∼ N(0, τ 2 j ) and τ 2 j iid ∼ IG(a τ , b τ ), m = 1, . . . , M and j = 1, . . . , J.
We implement posterior inference on the parametersθ = (β j , σ 2 j , λ 2 j , φ j,p ,r t,k , ψ r , w r ℓ , η r ℓ ,α 0,j ,α t,j , ψ α , w α ℓ , η α ℓ , θ j , τ 2 j , s j ) via a Markov chain Monte Carlo (MCMC) method based on Metropolis-Hastings algorithm and Gibbs sampling. Each of the parameters is iteratively updated conditional on the currently computed values of all other parameters to simulate a sample from the posterior distribution. The parametersr andα 0 jointly determine baseline mean counts and joint updating ofr andα 0 may greatly improve the mixing. In our ocean microbiome data, some discretized covariates are missing. We treat them as random variables by assuming a uniform distribution over possible categories, and impute their values in MCMC simulation. Full details of our MCMC algorithm are given in Supplementary section 1. We diagnose convergence and mixing of the described posterior MCMC simulation using trace plots and autocorrelation plots of imputed parameters. For the upcoming simulation examples and the data analysis, we found no evidence of practical convergence problems. An R package of the code used for simulations and the analysis of the ocean microbiome dataset in the following sections is available from the authors website https://users.soe.ucsc.edu/~juheelee/.

Simulation Experiment: Data Generation and Comparative Study
We conducted simulation studies to assess the performance of our model. We compared the model to an alternative model, the negative binomial mixed model (NBMM) in Zhang et al. (2017). We assumed a sample of J = 200 OTUs. We used the same time points (t i ) and numbers of replicates (K i ) of our ocean microbiome data as shown in Figure 1B. We let β TR j,p = 0 with probability 0.85. For β TR j,p = 0 we simulated β TR j,p from either of N(−1.5, 0.05 2 ) or N(1.5, 0.05 2 ) with equal probability, where N(a, b 2 ) denotes the normal distribution with mean a and variance b 2 . It implies that a covariate has no effect on OTU abundance with probability 0.85 or may significantly affect mean abundance with the remaining probability 0.15. To specify r TR t,k and α TR 0,j , we did not assume any distribution and used their classical estimates from our ocean microbiome data; following Witten (2011), we first computed estimates of sample size factors r ′ t i ,k and OTU size factors α ′ 0,j using the ocean microbiome data, y t i ,k,j and y ··· = J j=1 y ··,j . We then randomly sampled from the pool of r ′ t,k and α ′ 0,j to specify the true values. To simulate temporal dependence in OTU abundance, we let  (1, 10). We used the covariate matrix of the ocean microbiome data illustrated in Figure 2 for the simulation study. For the missing covariates in the data, we generated a value of possible categories with equal probability. We finally simulated OTU counts y t i ,k,j from the negative binomial distribution y t i ,k,j indep ∼ NB(µ TR t i ,k,j , s TR j ), where µ TR t i ,k,j = r TR t i ,k α TR 0,j exp(α TR t i ,j + X TR t β TR j ). For comparison, we used the negative binomial mixed model (NBMM) in Zhang et al. (2017). Similar to the proposed model, the NBMM uses a negative binomial distribution with mean µ NBMM and shape parameter θ NBMM to model OTU counts and assumes log(µ NBMM t,k,j ) = log(y t,k,· )+β NBMM 0,j where X t and Z t,k are the covariate matrices for fixed effects and random effects, respectively. It assumes random effects b NBMM j ∼ N(0, ). By letting the replicates at a time point share the same random effect, OTU abundances in the replicates at a time point are correlated. The NBMM normalizes OTU counts by sample total counts. That is, sample total counts y t,k,· are used as an offset to adjust for the variability in total counts across samples. Similar to other existing methods, the NBMM performs separate analyses of OTUs. An iterative weighted least squares algorithm is developed to produce the MLEs under the NBMM and implemented in a R function glmm in R package BhGLM.

Ocean Microbiome Data: Data Description and Preprocessing
We applied the proposed statistical method to ocean microbiome data. Seawater samples were collected weekly at the end of Santa Cruz Municipal Wharf (SCW), Monterey Bay ( For bacterial RNA samples, three depth-integrated (0, 5, and 10 ft) water samples were collected at a total of 55 time points between April 2014 and November 2015. Two or three samples are sequenced at each time point. The numbers of replicates at the time points are illustrated in Figure 1B. Microbial RNA in the samples was extracted for 16S rRNA sequencing. Postprocessing of sequences was performed using the Quantitative Insights Into Microbial Ecology (QIIME 1.9.1) pipeline. A total of nearly 39,823 OTUs were obtained in data after removing singletons. We restricted our attention to OTUs that have greater than or equal to five counts on average. The rule leaves in the end J = 263 OTUs for the 150 samples for the analysis. A heatmap of the counts in the filtered data is shown in Figure 1. The primary goal of the study is to investigate the effects of physical and biological factors on abundance levels of OTUs, while accounting for baseline abundance levels of OTUs in samples.

Simulation Experiment: Model Fitting and Comparison
To fit the proposed model for the simulated data designed in section 2.2, we specified hyperparameter values of the model as follows; for the Laplace prior of β j,p , we let a λ = b λ = 0.5 for a gamma prior of λ 2 j (with mean a λ /b λ and variance a λ /b 2 λ ) and a σ = b σ = 0.3 for an inverse gamma prior for common variances σ 2 j . For the regularizing priors ofr t i ,k andα 0,j , we fixed d α = d r = 10, a r = b r = a α = b α = 1, ω 2 r = ω 2 α = 1.0, v 2 r = 1, and v 2 α = 2.0. We also fixed the number of mixture components for the regularizing priors L r = 30 and L α = 50.
To specify values of the mean constraints c r and c α , we took an empirical approach. We used the simulated y t i ,k,j , computed estimates of r t i ,k,j and α j,0 as described in section 2.2 and fixed the mean constraints at the means of the logarithm of the estimates, respectively. Note that the specified values of c r and c α were very different from the means of their true values. For the process convolution prior of OTU-time factorα t i ,j , we chose a value of M such that the kernel function at a basis point is not entirely located in a place where no sample is obtained. We let the number of basis M = 13 and basis u m , m = 1, . . . , M evenly spaced between −10 and T i + 10. For overdispersion parameter s j we let a s = 1 and b s = 2. To run MCMC simulation, we initialized the parameters by simulating with their prior distributions. We then implemented posterior inference using MCMC simulation over 25,000 iterations, discarding the first 10,000 iterations as burn-in and choosing every other sample as thinning. Figure 3 illustrates the comparison of posterior estimates of β j,p to their true values β TR j,p for some selected covariates. In the figure, dots and blue dashed lines represent posterior meansβ j,p of β j,p and their 95% credible intervals, respectively.β j,p s are around the 45 degree line (red dotted line) for most of (j, p) and most of the interval estimates captures the true values. It implies that the proposed model reasonably recovers β TR j,p . For categories 3 and 4 of X 4 in Figures 3I,J, the credible intervals tend to be wider due to their low frequencies in the data as shown FIGURE 2 | Ocean microbiome data. Bar plots of discretized covariates, concentration levels of Alexandrium (Ax, X 1 ) and Dinophysis (Dp, X 2 ), Pseudo-nitzchia (Pn, X 3 ), domoic acid (DA, X 4 ) in (A-D). The values of −1, 0, 1, 2, 3, and 4 represent a missing value, none and low, medium, high, and highest concentration levels, respectively. Histograms of continuous covariates, concentration levels of ammonia (NH 4 , X 5 ), nitrate (N, X 6 ), phosphate (P, X 7 ), silicate (Si, X 8 ), water temperature (T, X 9 ), concentration level of chlorophyll (Chl, X 10 ) in (E-J). The variables are standardized to have mean 0 and variance 1 prior to analysis. A scatterplot of the concentrations of ammonia and silicate and a scatterplot of the concentrations of phosphate and silicate are shown in (K,L), respectively.
in Figure 2D. The insert plot in each panel illustrates a scatter plot ofβ j,p for (j, p) with β TR j,p = 0 . It shows that the proposed regression model with the Laplace prior effectively shrinks β j,p with β TR j,p = 0 to zero, as is desired in our simulation setup. Supplementary Figure 1 has plots for all covariates.
Figures 4A-C illustrate plots of g TR t,k,j vs estimates of g t,k,j with their means (black dots) and 95% credible intervals (blue vertical lines) for some selected OTUs, j = 8, 34, and 48. Recall that we do not attempt to recover the true values of individual r t i ,k , α 0,j , and α t,j , but we rather aim to reasonably recover the true baseline mean counts, g TR t i ,k,j = r TR t i ,k α TR 0,j exp(α TR t i ,j ). In the figure the estimates are tightly around the 45 degree line, providing evidence that reasonable estimates of baseline mean counts are obtained under the proposed model. Figure 4D has a histogram of averaged differences between baseline mean count estimates and their true values, t i ,k,j )/N. The averaged differences are around zero, implying that the proposed model provides reasonable estimates of baseline mean counts for most of OTUs. We further examined individual parameters. Figures 4E-G shows the comparison of estimates ofα 0,j +α t,j to their true values over time for the same OTUs in Figures 4A-C. Black dots and blue vertical lines represent estimates of posterior means ofα 0,j +α t,j and their 95% credible intervals, respectively. Red squares represent their true values. From the figure, the estimates ofα 0,j +α t,j are consistently greater than their true values at all time points, but capture their overall temporal trend. Figure 4H   intervals denote estimates of posterior means and 95% credible intervals, respectively, and the gray horizontal line is at c r used for analysis. Different from the estimates ofα 0,j +α t,j , the estimates ofr t,k fall below the 45 degree line approximately by the same distance for all OTUs. It shows that estimates ofα 0,j +α t,j and r t,k have discrepancies from their true values but in the opposite direction and the model can produce reasonable estimates of g t,k,j as seen in Figures 4A-D. The true overdispersion parameters s TR j are reasonably well estimated as shown in Figure 4I. We check the posterior predictive distribution of Y t,k,j . The posterior predicted values of Y t i ,k,j with their 95% predictive intervals for OTUs j = 8, 34, and 48 are compared to their observed values in Supplementary Figure 2. The figure indicates a reasonable model fit.
In addition, we conducted a sensitivity analysis to the specification of mean constraints c r and c α for the priors ofr andα 0 . We used different values for c r and c α and compared the estimates of g t,k,j to their truth. Supplementary Figures 3ac has histograms of averaged differences D j betweenĝ t i ,k,j and g TR t i ,k,j for different specification of c r and c α . The histograms show FIGURE 4 | Simulation 1-proposed model. Panels (A-C) illustrate plots of the true baseline mean counts g TR t,k,j vs their estimatesĝ t,k,j for some selected OTUs j = 8, 34, 48. Panel (D) shows a histogram of averaged differences between g TR t,k,j andĝ t,k,j for each OTU. Panels (E-G) show plots of estimates ofα 0,j +α t,j over time for OTUs j = 8, 34, 48. Panel (H) has a scatterplot ofr t,k vs.r TR . Panel (I) has a scatterplot ofŝ vs. s TR . Dots represent posterior mean estimates and blue vertical dotted lines 95% credible intervals. Red squares represent the true values. minor change in estimates of g t i ,k,j under different specifications of c r and c α . An sensitivity analysis to the specification of the number M of basis points in the GP convolution model forα t,j was also performed. We used M = 8, 13, and 18 and examined estimates of the baseline mean counts, g t,k,j . Supplementary  Figures 3a,d, For comparison, we used the NBMM to the simulated data. Since the NBMM does not accommodate missing covariates, we used X TR to fit the NBMM. Figure 5 compares the MLEŝ β NBMM j,p of β j,p to the true values for the same covariates used in Figure 3. Dots and blue vertical lines represent the MLEs under the NBMM and their 95% confidence intervals, respectively. Comparing Figure 5 to Figure 3, the NBMM produces poor estimates. The MLEs are biased for some covariates (e.g., Figure 5A). Also, confidence intervals under the NBMM often fail to capture the true values and their interval estimates under the NBMM tend to be much wider than those under the proposed model. Normalization through observed sample total counts and inducing correlation in replicates through iid (independent and identically distributed) random effects under the NBMM may lead to poor estimation of the baseline mean abundance for the simulated data, resulting in deterioration of coefficient estimation. In addition, separate analyses of OTUs under the NBMM do not allow to strengthen estimates through combining information across OTUs. Comparing the insert plots in Figure 5 to those in Figure 3,β NBMM j,p with β TR j,p = 0 tends to more widely spread out from zero and often their confidence intervals fail to capture zero. Supplementary Figure 4 has plots of β j,p for all covariates. Supplementary Figures 4, 5 shows the comparison of the estimatesθ NBMM of overdispersion parameters under the NBMM to their true values. Note that θ NBMM is the inverse of s in our model. The NBMM underestimates s j for many OTUs, and yields poor predicted values, implying the lack of a fit.
We further examined the performance of the proposed model through additional simulation studies, Simulations 2 and 3 in Supplementary section 2. In these simulations, we studied the robustness of the model when different simulation setups are used to simulate data. In Simulation 2, we assumed no temporal dependence among OTU abundance and generated independent samples from a normal distribution forα t,j . We assumed that all β TR j,p has nonzero effects for all OTUs and simulated β j,p from a mixture of normals. The performance of our model is almost the same as in Simulation 1 (see Supplementary Figures  6-8). Interestingly, the NBMM that assumes iid random effects performs poorly for β estimation. In Simulation 3, we simulated α TR t,j from a discontinuous function. The results show that when the temporal dependence pattern is not smooth as assumed for the GP, estimates of the baseline mean counts under the proposed model are slightly deteriorated but the model produces reasonable inference on β j,p (see Supplementary Figures 9-11). A more detailed summary of the additional simulations is given in Supplementary section 2.

Ocean Microbiome Data: Model Fitting and Comparison
We specified hyperparameters similar to those in the simulations and analyzed the microbiome data in section 2.3. The MCMC simulation was run over 25,000 iterations. The first 15,000 iterations were discarded as burn-in and every other sample was kept as thinning and used for inference. Figure 6 illustrates inference on covariate effects for some selected OTUs, j = 16, 34, and 49, taxonomically belonging to Alteromonadales, Halomonas sp., and Alteromonadales in the Gamma-proteobacteria phyla, respectively. Dots and vertical solid lines represent the posterior meanβ j,p and 95% credible interval estimates, respectively. Similar to the results of the simulation study, the credible intervals for high and highest levels of the discretized covariates tend to be wider due to their low frequencies in the data. From Figure 6A, on average the medium concentration level of domoic acid (DA, X 4 ) and the concentration level of nitrate (N, X 6 ) significantly decrease the mean abundance of OTU 16 by a multiplicative factor of exp(−0.572) = 0.564 and exp(−0.260) = 0.771, respectively. One may infer that the medium concentration level of domoic acid is significantly associated with lower expected counts for the OTU compared to those with category none of the domoic acid concentration level. A similar argument can be applied to the inference on the nitrate concentration level. Interestingly, we observed statistically significant reduction in abundance from many OTUs belonging to Gammaproteobacteria including those OTUs for increasing domoic acid concentration (not shown). The resulting inference was further validated through a lab experiment. Most notably, four bacterial cultured isolates belonging to Gamma-proteobacteria (three among them are Alteromonadales) were observed to be severely retarded in growth after 2 days of exposure to increasing domoic acid of 0 to 150 µg/ml in the experiment (Sison-Mangus, unpublished data). This demonstrates that the proposed model successfully identifies important OTUs in ocean bacterial community dynamics for further investigation. More results are presented in Supplementary section 3. Supplementary Figures 12a-c illustrates the posterior estimates of baseline expected countsα 0,j +α t,j normalized by sample size factors for the OTUs. From the figure, the baseline expected counts vary over time for those OTUs and the temporal pattern of OTU j = 34 is different from those of OTUs j = 16 and 49. Histograms of the posterior mean estimatesβ j,p of β j,p , are illustrated in Supplementary Figure 13. The figure does not show clear overall tendency in the direction of association between covariates and OTU counts. Posterior inference on sample size factors r t i ,k and OTU specific overdispersion parameters s j is illustrated in Supplementary Figures 12d,e. For comparison, we fitted the NBMM to the data. Since the NBMM does not account for missing values, we use the maximum a posteriori estimates of the missing values under the proposed for the NBMM. We used the R function glmm and the algorithm produced warning messages on convergence for 32 OTUs.

DISCUSSION AND CONCLUSIONS
In this paper, we developed a Bayesian semiparametric regression model for joint analysis of microbiome data. We formulated the mean counts of OTUs as a product of factors and built models for the factors. We utilized the regularizing priors with mean constraints to avoid possible idenfiability issues, and the process convolution model to capture the temporal dependence structure in the baseline mean abundance of an OTU. The flexible model developed for baseline abundance enables joint analysis of all OTUs in the data and allows borrowing information across OTUs, across samples, and across time points. The model produces accurate estimates of the baseline mean counts and yields improved estimates of the effects of the covariates. We incorporated the Laplace distribution, a sparsity inducing shrinkage prior for the coefficients and the proposed model produces sparse estimates that is more desirable when the problem is high-dimensional and covariates are highly correlated. We compared the proposed model to a comparable frequentist model that does separate analyses for individual OTUs. The comparisons through the simulation study and real data analysis show better performance of the proposed model.
Although we focused on the analysis of NGS count data, the proposed model is quite general and can be applied for analyses of any count data. Future work will explore alternative approaches to model the effects of covariates on the mean counts. For example, one may consider a nonparametric model using linear combinations of basis functions (Kohn et al., 2001) to flexibly capture shape in the response function. In such a case, an elaborate development of the prior model may be needed to produce a robust inference since both the baseline mean counts and the covariate effects are nonparametrically modeled. Other possible extensions are to include a variable selection method such as a stochastic search variable selection (George and McCulloch, 1993) if it is reasonable to assume that some covariate effects are exactly zero, and to let coefficients vary over time if covariate effects evolve with time. For time varying coefficients, we may use the random walk process in Leybourne (1993) to induce relationship between β j,p,t−1 and β j,p,t . Considering the high dimensionality in OTU data, posterior computation may need to be carefully handled. Also, prior information may be needed to produce sensible inference due to sparsity in data.

AUTHOR CONTRIBUTIONS
JL developed the statistical model and conducted simulation studies and data analysis. She also prepared the first draft and