# A Bayesian Semiparametric Regression Model for Joint Analysis of Microbiome Data

^{1}Department of Applied Mathematics and Statistics, University of California, Santa Cruz, Santa Cruz, CA, United States^{2}Department of Ocean Sciences, University of California, Santa Cruz, Santa Cruz, CA, United States

The successional dynamics of microbial communities are influenced by the synergistic interactions of physical and biological factors. In our motivating data, ocean microbiome samples were collected from the Santa Cruz Municipal Wharf, Monterey Bay at multiple time points and then 16S ribosomal RNA (rRNA) sequenced. We develop a Bayesian semiparametric regression model to investigate how microbial abundance and succession change with covarying physical and biological factors including algal bloom and domoic acid concentration level using 16S rRNA sequencing data. A generalized linear regression model is built using the Laplace prior, a sparse inducing prior, to improve estimation of covariate effects on mean abundances of microbial species represented by operational taxonomic units (OTUs). A nonparametric prior model is used to facilitate borrowing strength across OTUs, across samples and across time points. It flexibly estimates baseline mean abundances of OTUs and provides the basis for improved quantification of covariate effects. The proposed method does not require prior normalization of OTU counts to adjust differences in sample total counts. Instead, the normalization and estimation of covariate effects on OTU abundance are simultaneously carried out for joint analysis of all OTUs. Using simulation studies and a real data analysis, we demonstrate improved inference compared to an existing method.

## 1. Introduction

Microbial communities are influenced by several factors whether they live in the host's guts or other occupied niches. Their successional dynamics could further change in response to perturbations of the host or of the surrounding environments (Turnbaugh et al., 2009; Needham and Fuhrman, 2016). Understanding how abiotic and biotic factors influence the dynamics of microbial communities is of great interest in the field of microbiome studies. Recent revolutionary advances in next-generation sequencing (NGS) technologies along with rapidly decreasing costs, have facilitated the accumulation of large datasets of 16S ribosomal RNA (rRNA) amplicon sequences across various disciplines such as medicine, biology, ecology, and environmental sciences (Woo et al., 2008). Sequencing data is usually pre-treated for quality filtering, noise removal and chimera checking through bioinformatics algorithms and the filtered sequences are clustered into Operational Taxonomic Units (OTUs), which represent similar organisms (microbial species) based on sequence homology (called OTU picking). An OTU abundance table is generated, recording counts for OTUs in samples. Further statistical data analyses are then performed using the OTU table to answer biological and ecological questions.

Analysis of huge NGS data is computationally expensive and challenging. One of the key challenges is the normalization of counts across samples. Total counts (often called library size or sequencing depth) may vastly vary across different samples due to technical reasons. Thus, observed counts are not directly comparable across samples and cannot be used as a measure of the abundance of an OTU. Normalized counts through rarefaction or relative frequencies are commonly used for easy comparison of OTU abundance across samples. However, such *ad hoc* normalization procedures have been criticized from a statistical perspective since using pre-normalized quantities may undermine the performance of downstream analysis (McMurdie and Holmes, 2014). Another challenge is high dimensionality and sparsity in OTU count data. A dataset typically includes hundreds or thousands of OTUs and a majority of them has zero or very low frequencies in most of samples. For example, Figure 1A illustrates a heatmap of OTU counts in our motivating dataset described in section 2.3. It shows that a majority of OTUs has very low counts (gray) in a sample, and the set of OTUs having large counts (blue) vary across samples. Due to such sparsity in data, borrowing strength across OTUs through joint analysis of all OTUs is crucial for improved inference. Recently, various statistical methods including Romero et al. (2014), Chen and Li (2016), Gibbons et al. (2017), and Zhang et al. (2017) have been developed for microbiome studies using NGS data. For example, Zhang et al. (2017) used a negative binomial mixed regression model to study interactions between the microbiome and host environmental/clinical factors. Random effects are used to induce correlation among samples from a group. Common to most of recent methods including Zhang et al. (2017) is separately analyzing each OTU at a time.

**Figure 1**. Ocean microbiome data. **(A)** Heatmap of OTU counts (*y*_{ti,k, j}). OTU and samples are in rows and columns, respectively. OTU counts are rescaled within a sample for better illustration. **(B)** 55 time points where ocean microbiome samples were collected are marked on the X-axis and the number of dots at a time point represents the number of replicates (*K*_{i}) at the time point.

We develop a Bayesian semiparametric generalized linear regression model to study the effects of physical and biological factors on abundance of microbes. The proposed method performs mode-based normalization through a hierarchical model, which enables direct modeling of OTU counts. Furthermore, the hierarchical model facilitates borrowing strength between OTUs, between samples, and between time points through joint analysis and improves inference on the effects of covariates **X** on OTU abundance which are the parameters of our primary interest. Specifically, a negative binomial (NB) distribution parameterized by a mean parameter μ and an overdispersion parameter *s* is assumed for OTU counts. The NB distribution flexibly accommodates overdispersion often seen in NGS data and is commonly used as a robust alternative to a Poisson distribution (Anders and Huber, 2010). The expected count μ of an OTU is decomposed as a product of factors, a baseline mean count *g* and a nonnegative function η(**X**) of covariates that describes their effects on the mean count. We use the log link function for η(**X**) and assume that change in a covariate has a multiplicative effect on mean count, where the associated coefficient quantifies the size and direction of the effect. We consider a Laplace prior for the coefficients, a shrinkage prior that is essential in a high dimensional regression setting. Shrinkage priors in regression yield sparse point estimates of the coefficients, where many of the coefficients have values close to zero and few have large values. The sparse estimates improve out-of-sample prediction and produce more interpretable models (Park and Casella, 2008). In addition, shrinkage priors such as a Laplace prior in a regression problem mitigate potential problems by multicollinearity and yield improved coefficient estimates when covariates are high-dimensional and potentially highly correlated (Polson and Scott, 2012). For baseline mean counts, we develop a nonparametric model to combine all OTUs for joint analysis. Baseline mean counts may vary across samples and OTUs. Also, as in our motivating data for which samples were taken over time, there may be temporal dependence in baseline mean counts. To tackle the problem, we further decompose the baseline count *g* into sample size factor (*r*), OTU size factor (α_{0}), and OTU and time factor (α_{t}), that is, *g* = *r* × α_{0} × α_{t}. Due to the overparametrization of the baseline mean abundance, individual factors are not identifiable. To avoid identifiability issues, we place the regularizing priors with mean constraints (Li et al., 2017) for sample size factor *r* and OTU size factor α_{0}. In addition, we model a temporal dependence structure between the baseline expected counts for an OTU through a convolutional Gaussian process (Higdon, 1998). The process convolution approach is often used as an alternative approach of the Gaussian process to construct a dependent process due to its efficient computation (Lee et al., 2005; Liang and Lee, 2014). Through simulation studies, we show that estimates of individual parameters *r*, α_{0}, and α_{t} are not fully interpretable under the proposed model, but baseline mean counts *g* are identifiable. The model also provides a posterior distribution of *g* for uncertainty quantification.

The rest of the paper is organized as follows. In section 2 we describe the proposed model and discuss the prior formulations and the resulting posterior inference. We perform simulation studies to assess the proposed model and perform comparison with an existing method that analyzes one OTU at a time. We then apply the proposed model to an ocean microbiome dataset. Section 3 presents the performance of the proposed model from the simulation experiment and the ocean microbime data. Section 4 concludes the paper with a discussion on limitations and possible extensions.

## 2. Materials and Methods

### 2.1. Bayesian Semiparametric Regression Model

Suppose that samples are taken at *n* different time points, 0 ≤ *t*_{i} ≤ *T*, *i* = 1, …, *n*, and with *K*_{i} replicates at time point *t*_{i}. We consider count *y*_{ti,k, j} of OTU *j* in replicate *k* taken at time *t*_{i}, where *i* = 1, …, *n*, *k* = 1, …, *K*_{i}, and *j* = 1, …, *J*. A sample is thus indexed by *t*_{i} and *k*. We let the total number of samples $N=\sum _{i=1}^{n}{K}_{i}$. Let **Y** = [*y*_{ti,k, j}] denote the *N*×*J* matrix of counts, where *y*_{ti,k, j} is integer-valued and nonnegative. Also, suppose that covariates ${\text{X}}_{{t}_{i}}={({X}_{{t}_{i},1},\dots ,{X}_{{t}_{i},P})}^{\prime}$ are recorded at time *t*_{i}. For example, covariates are physical and biological factors in our motivating data.

#### 2.1.1. Sampling Model

Count data by NGS methods is often modeled through a Poisson distribution. The assumption under the Poisson distribution that the variance is equal to the mean is often too restrictive to accommodate overdispersion that variation in data exceeds the mean. The negative binomial (NB) distribution is a popular and convenient alternative to address the overdispersion problem and is widely recognized as a model that provides improved inference to NGS count data (for example, see Robinson and Smyth, 2007; Anders and Huber, 2010). A NB distribution can be characterized by mean and overdispersion parameters. We suppress index *i* for simpler notation and assume a NB model for count *y*_{t,k,j} of OTU *j* in replicate *k* at time *t*,

where mean count μ_{t,k,j} > 0 and overdispersion parameter *s*_{j} > 0. The model in Equation (1) implies that count of OTU *j* in replicate *k* at time *t* has mean E(*y*_{t,k,j} ∣ μ_{t,k,j}) = μ_{t,k,j} and variance $\mathrm{\text{Var}}({y}_{t,k,j}\mid {\mu}_{t,k,j},{s}_{j})={\mu}_{t,k,j}+{\mu}_{t,k,j}^{2}{s}_{j}$. The model allows different dispersion levels across OTUs through OTU-specific overdispersion parameters *s*_{j}. In the limit as *s*_{j} → 0, the model in Equation (1) yields the Poisson distribution with mean μ_{t,k,j}. We assume a gamma distribution for a prior distribution of *s*_{j}, ${s}_{j}\stackrel{iid}{~}\mathrm{\text{Ga}}({a}_{s},{b}_{s})$, *j* = 1, …, *J*, with fixed *a*_{s} and *b*_{s}.

#### 2.1.2. Model for Regression

We next model the mean count μ_{t,k,j} of *y*_{t,k,j}. We decompose the mean count into factors, a baseline mean count and a function of covariates, μ_{t,k,j} = *g*_{t,k,j}η_{j}(**X**_{t}). Here parameter *g*_{t,k,j} denotes the baseline mean abundance of OTU *j* in sample (*t, k*) and η_{j}(**X**_{t}) is a function of covariates **X**_{t} for OTU *j* to model the covariate effects. We construct a generalized regression model by letting $log({\eta}_{j}({\text{X}}_{t}))={\text{X}}_{t}^{\prime}{\beta}_{j}$, where ${\beta}_{j}={({\beta}_{j1},\dots ,{\beta}_{jP})}^{\prime}$ is a *P*-dimensional vector of regression coefficients of OTU *j* (Lawless, 1987; McCullagh and Nelder, 1989). The coefficient β_{j, p} quantifies the effect of covariate *p* *X*_{p} on the mean abundance of OTU *j*. A vector β_{j} close to the zero vector produces a value of η_{j}(**X**_{t}) close to 1, and the mean count remains similar to the baseline mean count *g*_{t,k,j}, implying insignificant covariate effects. A negative (positive) of β_{j, p} implies a negative (positive) association between mean counts and the *p*-th covariate, and a larger value of *X*_{j,p} decreases (increases) the mean count, while holding the other covariates constant. We consider a Laplace prior on β_{j}. Specifically, we express the Laplace distribution as a scale mixture of normals and assume for *j* = 1, …, *J* and *p* = 1, …, *P*,

where *a*_{λ}, *b*_{λ}, *a*_{σ}, and *b*_{σ} are fixed. ${\sigma}_{j}^{2}$ and ϕ_{j,p} denote the global and local shrinkage parameters, respectively, for OTU *j*. After integrating ϕ_{j,p} out, the prior distribution of β_{j,p} is the Laplace distribution with location parameter 0 and scale parameter $\sqrt{{\sigma}_{j}^{2}}/{\lambda}_{j}$, that is, $p({\beta}_{j,p}\mid {\lambda}_{j}^{2},{\sigma}_{j}^{2})\propto $ $exp\left(-{\lambda}_{j}|{\beta}_{j,p}|/\sqrt{{\sigma}_{j}^{2}}\right)$. Compared to a normal distribution that is a common choice for the prior of β_{j,p}, the Laplace distribution has more concentration around zero but allows heavier tails. The regularized regression through the Laplace prior more shrinks the coefficients of insignificantly related covariates into zero and less pulls the coefficients of important covariates toward zero. Shrinkage of β estimates through the model in Equation (2) mitigates possible issues due to multicollinearity and efficiently improves estimation of β in a high dimensional setting (Polson and Scott, 2012).

#### 2.1.3. Model for Baseline Mean Count

We next build a prior probability model for the baseline mean count *g*_{t,k,j} of OTU *j* in sample (*t, k*). We assume *g*_{t,k,j} = *r*_{t,k}α_{0, j}α_{t, j} to separate sample (*r*_{t,k}), OTU (α_{0, j}), and OTU-time (α_{t, j}) factors. Sample total counts ${y}_{t,k,\xb7}=\sum _{j=1}^{J}{y}_{t,k,j}$ may greatly differ for different samples possibly due to experimental artifacts. For example, counts of an OTU even in the replicates taken at a time point may vastly differ. Sample specific size factors *r*_{t,k} account for different total counts in different samples and expected counts normalized by *r*_{t,k} are comparable across samples. Factor α_{0, j} explains variabilities in baseline mean abundances of OTUs and α_{t,j} models temporal dependence of the mean counts for an OTU, respectively. Factors α_{0, j} and α_{t,j} are not indexed by replicate *k* and account for stochastic change over time in normalized baseline expected counts of OTU *j*. Collecting all, we write the mean count as

The model for *g*_{t,k,j} in Equation (3) is overparameterized and the individual parameters are not identifiable. To avoid potential identifiability issues, many of NB models rely on some form of approximation for the baseline mean counts. For example, one may find the maximum likelihood estimates (MLEs) of baseline mean abundance under some constraints and plug in those estimates to infer the mean abundance levels μ_{ti, j} of OTUs (Witten, 2011). Plugging in MLEs is simple but may not be robust. In particular, the inference is greatly affected by a small change in a few OTUs that have large counts. Moreover, the errors introduced in the baseline mean count estimation will not be reflected in the inference. Several approaches to robustify the estimates are proposed (for example, see Anders and Huber, 2010; Witten, 2011). To circumvent the identifiability issue and provide uncertainty quantification for estimation of *g*_{t,k,j}, we take an alternative in Li et al. (2017) by imposing regularizing priors with mean constraints for *r*_{t,k} and α_{0, j}. We let the logarithm of the factors ${\stackrel{~}{r}}_{t,k}=log({r}_{t,k})$ and ${\stackrel{~}{\alpha}}_{0,j}=log({\alpha}_{0,j})$, and assume the regularizing prior distribution with mean constraints,

where ϕ(η, *v*^{2}) is the probability density function of the normal distribution with mean η and variance *v*^{2}, constraints for the mixture weights $\sum _{\ell =1}^{{L}^{r}}{\psi}_{\ell}^{r}=\sum _{\ell =1}^{{L}^{\alpha}}{\psi}_{\ell}^{\alpha}=1$ with $0<{\psi}_{\ell}^{r}<1$ and $0<{\psi}_{\ell}^{\alpha}<1$, $0<{w}_{\ell}^{r}<1$, and $0<{w}_{\ell}^{\alpha}<1$ for all ℓ. Mixture models as in Equation (4) are often used as a basis to approximate any distribution. Each component in Equation (4) is further composed of a mixture of two normals, $\mathrm{\text{N}}({\eta}_{\ell},{v}^{2})$ and $\mathrm{\text{N}}\left(\frac{(c-{w}_{\ell}{\eta}_{\ell})}{(1-{w}_{\ell})},{v}^{2}\right)$ with weights *w*_{ℓ} and 1 − *w*_{ℓ}, respectively, and the mean of the component is *c*. In consequence, the prior and posterior of $\stackrel{~}{r}$ and $\stackrel{~}{\alpha}$ under the model in Equation (4) satisfy their prespecified mean constraints *c*_{r} and *c*_{α}, respectively. Li et al. (2017) showed that the model in Equation (4) flexibly accommodates various features in a distribution such as skeweness or multi-modality while satisfying the constraints. Furthermore, the model based normalization through Equation (4) enables joint analysis of all OTUs and can further improve estimation of the covariate effects. With the regularizing priors, baseline mean counts *g*_{t,k,j} are identifiable, while *r*_{t,k}, α_{0, j}, and α_{t,j} are not directly interpretable. More importantly, the parameters of primary interest η_{j}(**X**_{t}) can be uniquely estimated and β_{j,p}'s keep their interpretation as parameters that quantify the effects of covariates on mean counts. We used an empirical approach to fix the mean constraints *c*_{r} and *c*_{α}. Sensitivity analyses were conducted to assess the robustness to the specification of *c*_{r} and *c*_{α} and show that the model provides reasonable estimates of *g*_{t,k,j} and moderate changes in the values of *c*_{r} and *c*_{α} minimally change the estimates. More details of the specification of *c*_{r} and *c*_{α} are discussed in section 3.1. We fix the numbers of mixture components, *L*^{r} and *L*^{α} and variances ${v}_{r}^{2}$ and ${v}_{\alpha}^{2}$. We let ${\eta}_{\ell}^{r}\stackrel{iid}{~}\mathrm{\text{N}}({c}_{r},{\omega}_{r}^{2})$ and ${\eta}_{\ell}^{\alpha}\stackrel{iid}{~}\mathrm{\text{N}}({c}_{\alpha},{\omega}_{\alpha}^{2})$, where ${\omega}_{r}^{2}$ and ${\omega}_{\alpha}^{2}$ are fixed. We assume ${\psi}^{r}=({\psi}_{1}^{r},\dots ,{\psi}_{{L}^{r}}^{r})~\mathrm{\text{Dir}}({d}_{r},\dots ,{d}_{r})$ and ${\psi}^{\alpha}=({\psi}_{1}^{\alpha},\dots ,{\psi}_{{L}^{\alpha}}^{\alpha})~\mathrm{\text{Dir}}({d}_{\alpha},\dots ,{d}_{\alpha})$, with fixed *d*_{r} and *d*_{α}. We let ${w}_{\ell}^{r}\stackrel{iid}{~}\mathrm{\text{Be}}({a}_{r},{b}_{r})$, ℓ = 1, …, *L*^{r} and ${w}_{\ell}^{\alpha}\stackrel{iid}{~}\mathrm{\text{Be}}({a}_{\alpha},{b}_{\alpha})$, ℓ = 1, …, *L*^{α} with fixed *a*_{r}, *b*_{r}, *a*_{α}, and *b*_{α}.

Recall that samples are collected over time points *t*_{1}, …, *t*_{n} in [0, *T*] and α_{t,j} accounts for temporal dependence in the baseline mean count for an OTU. We let ${\stackrel{~}{\alpha}}_{t,j}=log({\alpha}_{t,j})$ a function in time *t* and use a stochastic process to model temporal dependence among μ_{t,k,j}. The Gaussian process (GP) is one of the most popular stochastic models for the underlying process in spatial and spatio-temporal data (for example, see Cressie, 1992; Banerjee et al., 2014 among many others). The GP effectively represents the underlying phenomenon in a variety of applications, but it has some drawbacks such as a complex computation that requires a matrix decomposition and problematic estimation of the parameters in its covariance function, potentially leading to difficulties in exploring the posterior distribution (Lee et al., 2005; Liang and Lee, 2014). To alleviate such difficulties of GP models while still maintaining their flexibility and adaptiveness, we use a convolution approach with a kernel function developed in Higdon (1998, 2002). For each OTU, we specify the latent process **θ**_{j}(*t*) to be nonzero only at the time points *u*_{1}, …, *u*_{M} in [0, *T*]. Specifically, we consider the GP convolution model,

where {*u*_{1}, …, *u*_{M}} a set of basis points in $\left[-{t}_{1}^{\prime},T+{t}_{2}^{\prime}\right]$ with ${t}_{1}^{\prime},{t}_{2}^{\prime}>0$, and *Z*(*t* − *u*_{m}) a Gaussian kernel centered at *u*_{m}, $Z(t-{u}_{m})=\frac{1}{\sqrt{2\pi {\gamma}^{2}}}exp\{\frac{-{(t-{u}_{m})}^{2}}{2{\gamma}^{2}}\}$. The number of basis points *M*, their locations *u*_{m} and the range parameter γ can be treated as random variables by placing prior distributions, e.g., consider a gamma prior for γ. For simplicity, we fix them as follows. We first choose a value for *M* and let *u*_{m} evenly spaced over time $\left[-{t}_{1}^{\prime},T+{t}_{2}^{\prime}\right]$. Following Xiao (2015), we let the range parameter ${\gamma}^{2}={((2T+{t}_{1}^{\prime}+{t}_{2}^{\prime})/M)}^{2}$, that is, the range parameter depends on the value of *M*. Through simulations, we studied the impact of different values of *M* on the posterior inference of *g*_{t,k,j}. A discussion is included in section 3.1. Given the number of basis points *M*, we assume ${\theta}_{m,j}\mid {\tau}_{j}^{2}\stackrel{indep}{~}\mathrm{\text{N}}(0,{\tau}_{j}^{2})$ and ${\tau}_{j}^{2}\stackrel{iid}{~}\mathrm{\text{IG}}({a}_{\tau},{b}_{\tau})$, *m* = 1, …, *M* and *j* = 1, …, *J*.

We implement posterior inference on the parameters $\tilde{\theta}=({\beta}_{j},{\sigma}_{j}^{2},{\lambda}_{j}^{2},{\varphi}_{j,p},{\tilde{r}}_{t,k},{\psi}^{r},{w}_{\ell}^{r},{\eta}_{\ell}^{r},{\tilde{\alpha}}_{0,j},\text{}{\tilde{\alpha}}_{t,j},{\psi}^{\alpha},{w}_{\ell}^{\alpha},{\eta}_{\ell}^{\alpha},\theta j,{\tau}_{j}^{2},{s}_{j})$ via a Markov chain Monte Carlo (MCMC) method based on Metropolis-Hastings algorithm and Gibbs sampling. Each of the parameters is iteratively updated conditional on the currently computed values of all other parameters to simulate a sample from the posterior distribution. The parameters $\stackrel{~}{r}$ and ${\stackrel{~}{\alpha}}_{0}$ jointly determine baseline mean counts and joint updating of $\stackrel{~}{r}$ and ${\stackrel{~}{\alpha}}_{0}$ may greatly improve the mixing. In our ocean microbiome data, some discretized covariates are missing. We treat them as random variables by assuming a uniform distribution over possible categories, and impute their values in MCMC simulation. Full details of our MCMC algorithm are given in Supplementary section 1. We diagnose convergence and mixing of the described posterior MCMC simulation using trace plots and autocorrelation plots of imputed parameters. For the upcoming simulation examples and the data analysis, we found no evidence of practical convergence problems. An R package of the code used for simulations and the analysis of the ocean microbiome dataset in the following sections is available from the authors website https://users.soe.ucsc.edu/~juheelee/.

### 2.2. Simulation Experiment: Data Generation and Comparative Study

We conducted simulation studies to assess the performance of our model. We compared the model to an alternative model, the negative binomial mixed model (NBMM) in Zhang et al. (2017). We assumed a sample of *J* = 200 OTUs. We used the same time points (*t*_{i}) and numbers of replicates (*K*_{i}) of our ocean microbiome data as shown in Figure 1B. We let ${\beta}_{j,p}^{\mathrm{\text{TR}}}=0$ with probability 0.85. For ${\beta}_{j,p}^{\mathrm{\text{TR}}}\ne 0$ we simulated ${\beta}_{j,p}^{\mathrm{\text{TR}}}$ from either of N(−1.5, 0.05^{2}) or N(1.5, 0.05^{2}) with equal probability, where N(*a, b*^{2}) denotes the normal distribution with mean *a* and variance *b*^{2}. It implies that a covariate has no effect on OTU abundance with probability 0.85 or may significantly affect mean abundance with the remaining probability 0.15. To specify ${r}_{t,k}^{\mathrm{\text{TR}}}$ and ${\alpha}_{0,j}^{\mathrm{\text{TR}}}$, we did not assume any distribution and used their classical estimates from our ocean microbiome data; following Witten (2011), we first computed estimates of sample size factors ${r}_{{t}_{i},k}^{\prime}$ and OTU size factors ${\alpha}_{0,j}^{\prime}$ using the ocean microbiome data, ${r}_{{t}_{i},k}^{\prime}={y}_{{t}_{i},k,\xb7}/{y}_{\xb7\xb7\xb7}$ and ${\alpha}_{0,j}^{\prime}=\frac{1}{N}\sum _{i=1}^{n}\sum _{k=1}^{{K}_{i}}{y}_{{t}_{i},k,j}/{r}_{{t}_{i},k}^{\prime}$ where ${y}_{{t}_{i},k,\xb7}=\sum _{j=1}^{J}{y}_{{t}_{i},k,j}$ and ${y}_{\xb7\xb7\xb7}=\sum _{j=1}^{J}{y}_{\xb7\xb7,j}$. We then randomly sampled from the pool of ${r}_{t,k}^{\prime}$ and ${\alpha}_{0,j}^{\prime}$ to specify the true values. To simulate temporal dependence in OTU abundance, we let ${\stackrel{~}{\alpha}}_{{t}_{i},j}^{\mathrm{\text{TR}}}={a}_{{t}_{i},j}cos(2\pi ({\stackrel{~}{t}}_{i}-{b}_{{t}_{i},j}))+{c}_{{t}_{i},j}{({\stackrel{~}{t}}_{i}-{\stackrel{~}{t}}^{\star})}^{2}.$ Here ${\stackrel{~}{t}}_{i}$ denotes time *t*_{i} in year and ${\stackrel{~}{t}}^{\star}$ the median of ${\stackrel{~}{t}}_{i}$. We let ${a}_{t,j}\stackrel{iid}{~}\mathrm{\text{N}}(0.15,0.{1}^{2})$, ${b}_{t,j}\stackrel{iid}{~}\mathrm{\text{N}}(0,0.{5}^{2})$, and ${c}_{t,j}\stackrel{iid}{~}\mathrm{\text{N}}(0.1,0.{1}^{2})$ to have different patterns for OTUs. For some OTUs, ${\stackrel{~}{\alpha}}_{{t}_{i},j}^{\mathrm{\text{TR}}}$ are illustrated in red squares in Figures 4E–G. We generated ${s}_{j}^{\mathrm{\text{TR}}}\stackrel{iid}{~}\mathrm{\text{Ga}}(1,10)$. We used the covariate matrix of the ocean microbiome data illustrated in Figure 2 for the simulation study. For the missing covariates in the data, we generated a value of possible categories with equal probability. We finally simulated OTU counts *y*_{ti,k, j} from the negative binomial distribution ${y}_{{t}_{i},k,j}\stackrel{indep}{~}\mathrm{\text{NB}}({\mu}_{{t}_{i},k,j}^{\mathrm{\text{TR}}},{s}_{j}^{\mathrm{\text{TR}}}),$ where ${\mu}_{{t}_{i},k,j}^{\mathrm{\text{TR}}}={r}_{{t}_{i},k}^{\mathrm{\text{TR}}}{\alpha}_{0,j}^{\mathrm{\text{TR}}}exp({\stackrel{~}{\alpha}}_{{t}_{i},j}^{\mathrm{\text{TR}}}+{\text{X}}_{t}^{\mathrm{\text{TR}}}{\beta}_{j}^{\mathrm{\text{TR}}})$.

**Figure 2**. Ocean microbiome data. Bar plots of discretized covariates, concentration levels of *Alexandrium* (Ax, *X*_{1}) and *Dinophysis* (Dp, *X*_{2}), Pseudo-nitzchia (Pn, *X*_{3}), domoic acid (DA, *X*_{4}) in **(A–D)**. The values of −1, 0, 1, 2, 3, and 4 represent a missing value, none and low, medium, high, and highest concentration levels, respectively. Histograms of continuous covariates, concentration levels of ammonia (NH_{4}, *X*_{5}), nitrate (N, *X*_{6}), phosphate (P, *X*_{7}), silicate (Si, *X*_{8}), water temperature (T, *X*_{9}), concentration level of chlorophyll (Chl, *X*_{10}) in **(E–J)**. The variables are standardized to have mean 0 and variance 1 prior to analysis. A scatterplot of the concentrations of ammonia and silicate and a scatterplot of the concentrations of phosphate and silicate are shown in **(K,L)**, respectively.

For comparison, we used the negative binomial mixed model (NBMM) in Zhang et al. (2017). Similar to the proposed model, the NBMM uses a negative binomial distribution with mean μ^{NBMM} and shape parameter θ^{NBMM} to model OTU counts and assumes $\mathrm{log}({\mu}_{t,k,j}^{\text{NBMM}})=\mathrm{log}(yt,k,\xb7)+{\beta}_{0,j}^{\text{NBMM}}+t{\beta}_{j}^{\text{NBMM}}+{Z}_{t,k}{b}_{j}^{\text{NBMM}}$ where **X**_{t} and **Z**_{t,k} are the covariate matrices for fixed effects and random effects, respectively. It assumes random effects ${b}_{j}^{\mathrm{\text{NBMM}}}~\mathrm{\text{N}}(0,\text{\Psi})$. By letting the replicates at a time point share the same random effect, OTU abundances in the replicates at a time point are correlated. The NBMM normalizes OTU counts by sample total counts. That is, sample total counts *y*_{t,k, ·} are used as an offset to adjust for the variability in total counts across samples. Similar to other existing methods, the NBMM performs separate analyses of OTUs. An iterative weighted least squares algorithm is developed to produce the MLEs under the NBMM and implemented in a R function *glmm* in R package *BhGLM*.

### 2.3. Ocean Microbiome Data: Data Description and Preprocessing

We applied the proposed statistical method to ocean microbiome data. Seawater samples were collected weekly at the end of Santa Cruz Municipal Wharf (SCW), Monterey Bay (36.958 ^{o}N, −122.017 ^{o}W), with an approximate depth of 10 m. SCW is one of the ocean observing sites in Central and Northern California (CenCOOS), where harmful algal bloom species [HAB species: *Alexandrium* (Ax), *Dinophysis* (Dp), *Pseudo-nitzschia* (Pn) etc.] are monitored weekly along with nutrient measurements [ammonia (NH_{4}), silicate (Si), nitrate (N), phosphate (P)], temperature (T), domoic acid (DA) concentration, and chlorophyll (Chl). Details of phytoplankton net tow sampling of measuring phytoplankton abundance, measurement of physical (nutrients and temperature) and biological parameters (chlorophyll α and DA concentration) are described in Sison-Mangus et al. (2016). *Pseudo-nitzschia, Dinophysis*, and *Alexandrium* cells were counted with a Sedgewick rafter counter under the microscope. Data for physical and biological factors are available from the website link http://www.sccoos.org/query/?project=Harmful%20Algal%20Blooms&study[]=Santa%20Cruz%20Wharf. Among the 10 variables, the concentration levels of *Alexandrium, Dinophysis, Pseudo-nitzschia*, and domoic acid have highly right-skewed distributions and are discretized into categories based on their biological properties for our analysis. The ranges of the concentration levels for the discretization are in Supplementary Table 1 and Figures 2A–J illustrates all covariates included for analysis. The values of −1, 0, 1, 2, 3, and 4 represent missing values and the categories of None, Low, Medium, High, and Very High for the discretized variables, respectively. Due to high right skeweness, categories corresponding to high concentration levels have low frequencies. Values of the *Dinophysis* concentration level are missing at 20 time points among 55 points used for analysis. Sample correlations between the factors are relatively strong. Figures 2K,L shows scatterplots for some selected pairs of the factors.

For bacterial RNA samples, three depth-integrated (0, 5, and 10 ft) water samples were collected at a total of 55 time points between April 2014 and November 2015. Two or three samples are sequenced at each time point. The numbers of replicates at the time points are illustrated in Figure 1B. Microbial RNA in the samples was extracted for 16S rRNA sequencing. Post-processing of sequences was performed using the Quantitative Insights Into Microbial Ecology (QIIME 1.9.1) pipeline. A total of nearly 39,823 OTUs were obtained in data after removing singletons. We restricted our attention to OTUs that have greater than or equal to five counts on average. The rule leaves in the end *J* = 263 OTUs for the 150 samples for the analysis. A heatmap of the counts in the filtered data is shown in Figure 1. The primary goal of the study is to investigate the effects of physical and biological factors on abundance levels of OTUs, while accounting for baseline abundance levels of OTUs in samples.

## 3. Results

### 3.1. Simulation Experiment: Model Fitting and Comparison

To fit the proposed model for the simulated data designed in section 2.2, we specified hyperparameter values of the model as follows; for the Laplace prior of β_{j,p}, we let *a*_{λ} = *b*_{λ} = 0.5 for a gamma prior of ${\lambda}_{j}^{2}$ (with mean *a*_{λ}/*b*_{λ} and variance ${a}_{\lambda}/{b}_{\lambda}^{2}$) and *a*_{σ} = *b*_{σ} = 0.3 for an inverse gamma prior for common variances ${\sigma}_{j}^{2}$. For the regularizing priors of ${\stackrel{~}{r}}_{{t}_{i},k}$ and ${\stackrel{~}{\alpha}}_{0,j}$, we fixed *d*_{α} = *d*_{r} = 10, *a*_{r} = *b*_{r} = *a*_{α} = *b*_{α} = 1, ${\omega}_{r}^{2}={\omega}_{\alpha}^{2}=1.0$, ${v}_{r}^{2}=1$, and ${v}_{\alpha}^{2}=2.0$. We also fixed the number of mixture components for the regularizing priors *L*_{r} = 30 and *L*_{α} = 50. To specify values of the mean constraints *c*_{r} and *c*_{α}, we took an empirical approach. We used the simulated *y*_{ti,k, j}, computed estimates of *r*_{ti,k, j} and α_{j, 0} as described in section 2.2 and fixed the mean constraints at the means of the logarithm of the estimates, respectively. Note that the specified values of *c*_{r} and *c*_{α} were very different from the means of their true values. For the process convolution prior of OTU-time factor ${\stackrel{~}{\alpha}}_{{t}_{i},j}$, we chose a value of *M* such that the kernel function at a basis point is not entirely located in a place where no sample is obtained. We let the number of basis *M* = 13 and basis *u*_{m}, *m* = 1, …, *M* evenly spaced between −10 and *T*_{i} + 10. For overdispersion parameter *s*_{j} we let *a*_{s} = 1 and *b*_{s} = 2. To run MCMC simulation, we initialized the parameters by simulating with their prior distributions. We then implemented posterior inference using MCMC simulation over 25,000 iterations, discarding the first 10,000 iterations as burn-in and choosing every other sample as thinning.

Figure 3 illustrates the comparison of posterior estimates of β_{j,p} to their true values ${\beta}_{j,p}^{\mathrm{\text{TR}}}$ for some selected covariates. In the figure, dots and blue dashed lines represent posterior means ${\widehat{\beta}}_{j,p}$ of β_{j,p} and their 95% credible intervals, respectively. ${\widehat{\beta}}_{j,p}$s are around the 45 degree line (red dotted line) for most of (*j,p*) and most of the interval estimates captures the true values. It implies that the proposed model reasonably recovers ${\beta}_{j,p}^{\mathrm{\text{TR}}}$. For categories 3 and 4 of *X*_{4} in Figures 3I,J, the credible intervals tend to be wider due to their low frequencies in the data as shown in Figure 2D. The insert plot in each panel illustrates a scatter plot of ${\widehat{\beta}}_{j,p}$ for (*j,p*) with ${\beta}_{j,p}^{\mathrm{\text{TR}}}=0$. It shows that the proposed regression model with the Laplace prior effectively shrinks β_{j,p} with ${\beta}_{j,p}^{\mathrm{\text{TR}}}=0$ to zero, as is desired in our simulation setup. Supplementary Figure 1 has plots for all covariates.

**Figure 3**. Simulation 1—proposed model. Comparison of the true values ${\beta}_{j,p}^{\mathrm{\text{TR}}}$ and the posterior estimates of β_{j,p} under the proposed model for some selected covariates. Dots and blue dashed lines represent estimates of posterior means ${\widehat{\beta}}_{j,p}$ of β_{j,p} and 95% credible intervals (CIs) of β_{j,p}, respectively. The insert plot in each panel is a scatter plot of ${\widehat{\beta}}_{j,p}$ and ${\beta}_{j,p}^{\mathrm{\text{TR}}}$ for (*j,p*) with ${\beta}_{j,p}^{\mathrm{\text{TR}}}=0$.

Figures 4A–C illustrate plots of ${g}_{t,k,j}^{\mathrm{\text{TR}}}$ vs estimates of *g*_{t,k,j} with their means (black dots) and 95% credible intervals (blue vertical lines) for some selected OTUs, *j* = 8, 34, and 48. Recall that we do not attempt to recover the true values of individual *r*_{ti,k}, α_{0, j}, and α_{t,j}, but we rather aim to reasonably recover the true baseline mean counts, ${g}_{{t}_{i},k,j}^{\mathrm{\text{TR}}}={r}_{{t}_{i},k}^{\mathrm{\text{TR}}}{\alpha}_{0,j}^{\mathrm{\text{TR}}}exp({\stackrel{~}{\alpha}}_{{t}_{i},j}^{\mathrm{\text{TR}}})$. In the figure the estimates are tightly around the 45 degree line, providing evidence that reasonable estimates of baseline mean counts are obtained under the proposed model. Figure 4D has a histogram of averaged differences between baseline mean count estimates and their true values, ${D}_{j}=\sum _{i=1}^{n}\sum _{k=1}^{{K}_{i}}({\u011d}_{{t}_{i},k,j}-{g}_{{t}_{i},k,j}^{\mathrm{\text{TR}}})/N$. The averaged differences are around zero, implying that the proposed model provides reasonable estimates of baseline mean counts for most of OTUs. We further examined individual parameters. Figures 4E–G shows the comparison of estimates of ${\stackrel{~}{\alpha}}_{0,j}+{\stackrel{~}{\alpha}}_{t,j}$ to their true values over time for the same OTUs in Figures 4A–C. Black dots and blue vertical lines represent estimates of posterior means of ${\stackrel{~}{\alpha}}_{0,j}+{\stackrel{~}{\alpha}}_{t,j}$ and their 95% credible intervals, respectively. Red squares represent their true values. From the figure, the estimates of ${\stackrel{~}{\alpha}}_{0,j}+{\stackrel{~}{\alpha}}_{t,j}$ are consistently greater than their true values at all time points, but capture their overall temporal trend. Figure 4H illustrates a scatterplot of ${\stackrel{~}{r}}_{t,k}^{\mathrm{\text{TR}}}$ and their posterior estimates of ${\stackrel{~}{r}}_{t,k}$, where dots and blue vertical intervals denote estimates of posterior means and 95% credible intervals, respectively, and the gray horizontal line is at *c*_{r} used for analysis. Different from the estimates of ${\stackrel{~}{\alpha}}_{0,j}+{\stackrel{~}{\alpha}}_{t,j}$, the estimates of ${\stackrel{~}{r}}_{t,k}$ fall below the 45 degree line approximately by the same distance for all OTUs. It shows that estimates of ${\stackrel{~}{\alpha}}_{0,j}+{\stackrel{~}{\alpha}}_{t,j}$ and ${\stackrel{~}{r}}_{t,k}$ have discrepancies from their true values but in the opposite direction and the model can produce reasonable estimates of *g*_{t,k,j} as seen in Figures 4A–D. The true overdispersion parameters ${s}_{j}^{\mathrm{\text{TR}}}$ are reasonably well estimated as shown in Figure 4I. We check the posterior predictive distribution of *Y*_{t,k,j}. The posterior predicted values of *Y*_{ti,k, j} with their 95% predictive intervals for OTUs *j* = 8, 34, and 48 are compared to their observed values in Supplementary Figure 2. The figure indicates a reasonable model fit.

**Figure 4**. Simulation 1—proposed model. Panels **(A–C)** illustrate plots of the true baseline mean counts ${g}_{t,k,j}^{\mathrm{\text{TR}}}$ vs their estimates ĝ_{t,k,j} for some selected OTUs *j* = 8, 34, 48. Panel **(D)** shows a histogram of averaged differences between ${g}_{t,k,j}^{\mathrm{\text{TR}}}$ and ĝ_{t,k,j} for each OTU. Panels **(E–G)** show plots of estimates of ${\stackrel{~}{\alpha}}_{0,j}+{\stackrel{~}{\alpha}}_{t,j}$ over time for OTUs *j* = 8, 34, 48. Panel **(H)** has a scatterplot of ${\widehat{\stackrel{~}{r}}}_{t,k}$ vs. ${\stackrel{~}{r}}^{\mathrm{\text{TR}}}$. Panel **(I)** has a scatterplot of ŝ vs. *s*^{TR}. Dots represent posterior mean estimates and blue vertical dotted lines 95% credible intervals. Red squares represent the true values.

In addition, we conducted a sensitivity analysis to the specification of mean constraints *c*_{r} and *c*_{α} for the priors of $\stackrel{~}{r}$ and ${\stackrel{~}{\alpha}}_{0}$. We used different values for *c*_{r} and *c*_{α} and compared the estimates of *g*_{t,k,j} to their truth. Supplementary Figures 3a–c has histograms of averaged differences *D*_{j} between ĝ_{ti,k, j} and ${g}_{{t}_{i},k,j}^{\mathrm{\text{TR}}}$ for different specification of *c*_{r} and *c*_{α}. The histograms show minor change in estimates of *g*_{ti,k,j} under different specifications of *c*_{r} and *c*_{α}. An sensitivity analysis to the specification of the number *M* of basis points in the GP convolution model for ${\stackrel{~}{\alpha}}_{t,j}$ was also performed. We used *M* = 8, 13, and 18 and examined estimates of the baseline mean counts, *g*_{t,k,j}. Supplementary Figures 3a,d,e has histograms of averaged differences *D*_{j} for each of *M*. The results indicate that the baseline mean counts are reasonably estimated for a range of values of *M* in the simulation study.

For comparison, we used the NBMM to the simulated data. Since the NBMM does not accommodate missing covariates, we used **X**^{TR} to fit the NBMM. Figure 5 compares the MLEs ${\widehat{\beta}}_{j,p}^{\mathrm{\text{NBMM}}}$ of β_{j,p} to the true values for the same covariates used in Figure 3. Dots and blue vertical lines represent the MLEs under the NBMM and their 95% confidence intervals, respectively. Comparing Figure 5 to Figure 3, the NBMM produces poor estimates. The MLEs are biased for some covariates (e.g., Figure 5A). Also, confidence intervals under the NBMM often fail to capture the true values and their interval estimates under the NBMM tend to be much wider than those under the proposed model. Normalization through observed sample total counts and inducing correlation in replicates through iid (independent and identically distributed) random effects under the NBMM may lead to poor estimation of the baseline mean abundance for the simulated data, resulting in deterioration of coefficient estimation. In addition, separate analyses of OTUs under the NBMM do not allow to strengthen estimates through combining information across OTUs. Comparing the insert plots in Figure 5 to those in Figure 3, ${\widehat{\beta}}_{j,p}^{\mathrm{\text{NBMM}}}$ with ${\beta}_{j,p}^{\mathrm{\text{TR}}}=0$ tends to more widely spread out from zero and often their confidence intervals fail to capture zero. Supplementary Figure 4 has plots of β_{j,p} for all covariates. Supplementary Figures 4,5 shows the comparison of the estimates ${\widehat{\theta}}^{\mathrm{\text{NBMM}}}$ of overdispersion parameters under the NBMM to their true values. Note that θ^{NBMM} is the inverse of *s* in our model. The NBMM underestimates *s*_{j} for many OTUs, and yields poor predicted values, implying the lack of a fit.

**Figure 5**. Simulation 1—NBMM. Comparison of the true values ${\beta}_{j,p}^{\mathrm{\text{TR}}}$ and maximum likelihood estimates ${\widehat{\beta}}_{j,p}^{\mathrm{\text{NBMM}}}$ of β_{j,p} under the negative binomial mixed model (NBMM) for some selected covariates. Dots and blue dashed lines represent ${\widehat{\beta}}^{\mathrm{\text{NBMM}}}$ and their 95% confidence intervals, respectively. The insert plot in each panel is a scatter plot of ${\widehat{\beta}}_{j,p}^{\mathrm{\text{NBMM}}}$ and ${\beta}_{j,p}^{\mathrm{\text{TR}}}$ for (*j,p*) with ${\beta}_{j,p}^{\mathrm{\text{TR}}}=0$.

We further examined the performance of the proposed model through additional simulation studies, Simulations 2 and 3 in Supplementary section 2. In these simulations, we studied the robustness of the model when different simulation setups are used to simulate data. In Simulation 2, we assumed no temporal dependence among OTU abundance and generated independent samples from a normal distribution for ${\stackrel{~}{\alpha}}_{t,j}$. We assumed that all ${\beta}_{j,p}^{\mathrm{\text{TR}}}$ has nonzero effects for all OTUs and simulated β_{j,p} from a mixture of normals. The performance of our model is almost the same as in Simulation 1 (see Supplementary Figures 6–8). Interestingly, the NBMM that assumes iid random effects performs poorly for β estimation. In Simulation 3, we simulated ${\stackrel{~}{\alpha}}_{t,j}^{\mathrm{\text{TR}}}$ from a discontinuous function. The results show that when the temporal dependence pattern is not smooth as assumed for the GP, estimates of the baseline mean counts under the proposed model are slightly deteriorated but the model produces reasonable inference on β_{j,p} (see Supplementary Figures 9–11). A more detailed summary of the additional simulations is given in Supplementary section 2.

### 3.2. Ocean Microbiome Data: Model Fitting and Comparison

We specified hyperparameters similar to those in the simulations and analyzed the microbiome data in section 2.3. The MCMC simulation was run over 25,000 iterations. The first 15,000 iterations were discarded as burn-in and every other sample was kept as thinning and used for inference. Figure 6 illustrates inference on covariate effects for some selected OTUs, *j* = 16, 34, and 49, taxonomically belonging to *Alteromonadales, Halomonas* sp., and *Alteromonadales* in the Gamma-proteobacteria phyla, respectively. Dots and vertical solid lines represent the posterior mean ${\widehat{\beta}}_{j,p}$ and 95% credible interval estimates, respectively. Similar to the results of the simulation study, the credible intervals for high and highest levels of the discretized covariates tend to be wider due to their low frequencies in the data. From Figure 6A, on average the medium concentration level of domoic acid (DA, *X*_{4}) and the concentration level of nitrate (N, *X*_{6}) significantly decrease the mean abundance of OTU 16 by a multiplicative factor of exp(−0.572) = 0.564 and exp(−0.260) = 0.771, respectively. One may infer that the medium concentration level of domoic acid is significantly associated with lower expected counts for the OTU compared to those with category none of the domoic acid concentration level. A similar argument can be applied to the inference on the nitrate concentration level. Interestingly, we observed statistically significant reduction in abundance from many OTUs belonging to Gamma-proteobacteria including those OTUs for increasing domoic acid concentration (not shown). The resulting inference was further validated through a lab experiment. Most notably, four bacterial cultured isolates belonging to Gamma-proteobacteria (three among them are *Alteromonadales*) were observed to be severely retarded in growth after 2 days of exposure to increasing domoic acid of 0 to 150 μg/ml in the experiment (Sison-Mangus, unpublished data). This demonstrates that the proposed model successfully identifies important OTUs in ocean bacterial community dynamics for further investigation. More results are presented in Supplementary Section 3. Supplementary Figures 12a–c illustrates the posterior estimates of baseline expected counts ${\stackrel{~}{\alpha}}_{0,j}+{\stackrel{~}{\alpha}}_{t,j}$ normalized by sample size factors for the OTUs. From the figure, the baseline expected counts vary over time for those OTUs and the temporal pattern of OTU *j* = 34 is different from those of OTUs *j* = 16 and 49. Histograms of the posterior mean estimates ${\widehat{\beta}}_{j,p}$ of β_{j,p}, are illustrated in Supplementary Figure 13. The figure does not show clear overall tendency in the direction of association between covariates and OTU counts. Posterior inference on sample size factors *r*_{ti,k} and OTU specific overdispersion parameters *s*_{j} is illustrated in Supplementary Figures 12d,e.

**Figure 6**. Ocean microbiome data—proposed model. Posterior Inference on β_{j} for some selected OTUs (*j* = 16, 34, 49). Dots represent the posterior means ${\widehat{\beta}}_{j,p}$ of β_{j,p}. Each vertical line connects the lower bounds and the upper bounds of 95% credible intervals.

For comparison, we fitted the NBMM to the data. Since the NBMM does not account for missing values, we use the maximum a posteriori estimates of the missing values under the proposed for the NBMM. We used the R function *glmm* and the algorithm produced warning messages on convergence for 32 OTUs. Figure 7 illustrates ${\widehat{\beta}}_{j,p}^{\mathrm{\text{NBMM}}}$ (dots) with their 95% confidence intervals (solid vertical lines) for OTUs *j* = 16, 34, and 49. Inference on the covariate effects is different from that under the proposed. For example, domoic acid (DA) levels do not have significant effects on the mean counts for OTU *j* = 16 and 49 from Figures 7A,C. Comparing Figures 7A,C to Figure 7, the NBMM produces wider interval estimates for β_{j,p}. Histograms of the MLEs of β_{j,p}, ${\widehat{\beta}}_{j,p}^{\mathrm{\text{NBMM}}}$ under the NBMM are shown in Supplementary Figure 14. The histograms are much dispersed than those under the proposed model shown in Supplementary Figure 13. Estimates ${\widehat{\beta}}_{j,p}$ and ${\widehat{\beta}}_{j,p}^{\mathrm{\text{NBMM}}}$ for all covariates are also compared in Supplementary Figure 15. From the figure, the NBMM yields extremely large or small values for ${\widehat{\beta}}_{j,p}$ for some OTUs, possibly due to the convergence problem. The insert plots show that for regions of small values of ${\widehat{\beta}}_{j,p}$, the estimates under the proposed are more shrunken toward zero than those under the NBMM, similar to the results in section 3.1. The overdispersion parameter estimates under the NBMM tend to be smaller than those under the proposed (shown in Supplementary Figure 12f), which may lead to different predictive distributions of OTU counts.

**Figure 7**. Ocean microbiome data—NBMM. Inference on β_{j} for some selected OTUs (*j* = 16, 34, 49) under the negative binomial mixed model (NBMM). Dots represent the maximum likelihood estimates ${\widehat{\beta}}_{j,p}^{\mathrm{\text{NBMM}}}$ of β_{j,p}. Each vertical line connects the lower bounds and the upper bounds of 95% confidence intervals.

## 4. Discussion and Conclusions

In this paper, we developed a Bayesian semiparametric regression model for joint analysis of microbiome data. We formulated the mean counts of OTUs as a product of factors and built models for the factors. We utilized the regularizing priors with mean constraints to avoid possible idenfiability issues, and the process convolution model to capture the temporal dependence structure in the baseline mean abundance of an OTU. The flexible model developed for baseline abundance enables joint analysis of all OTUs in the data and allows borrowing information across OTUs, across samples, and across time points. The model produces accurate estimates of the baseline mean counts and yields improved estimates of the effects of the covariates. We incorporated the Laplace distribution, a sparsity inducing shrinkage prior for the coefficients and the proposed model produces sparse estimates that is more desirable when the problem is high-dimensional and covariates are highly correlated. We compared the proposed model to a comparable frequentist model that does separate analyses for individual OTUs. The comparisons through the simulation study and real data analysis show better performance of the proposed model.

Although we focused on the analysis of NGS count data, the proposed model is quite general and can be applied for analyses of any count data. Future work will explore alternative approaches to model the effects of covariates on the mean counts. For example, one may consider a nonparametric model using linear combinations of basis functions (Kohn et al., 2001) to flexibly capture shape in the response function. In such a case, an elaborate development of the prior model may be needed to produce a robust inference since both the baseline mean counts and the covariate effects are nonparametrically modeled. Other possible extensions are to include a variable selection method such as a stochastic search variable selection (George and McCulloch, 1993) if it is reasonable to assume that some covariate effects are exactly zero, and to let coefficients vary over time if covariate effects evolve with time. For time varying coefficients, we may use the random walk process in Leybourne (1993) to induce relationship between β_{j,p,t−1} and β_{j,p,t}. Considering the high dimensionality in OTU data, posterior computation may need to be carefully handled. Also, prior information may be needed to produce sensible inference due to sparsity in data.

## Author Contributions

JL developed the statistical model and conducted simulation studies and data analysis. She also prepared the first draft and led the collaboration with MS-M for statistical analysis. MS-M provided the ocean microbiome data, participated the statistical model development, provided biological interpretation of the resulting inference and edited the manuscript.

## Funding

This work was supported by NSF grant DMS-1662427 (JL) and NOAA-ECOHAB PROGRAM (Grant No. NA17NOS4780183, ECOHAB #905) (MS-M and JL). Collection of environmental data from the Santa Cruz Municipal Wharf was supported by Cal-PReEMPT with funding from the NOAA-MERHAB (#206), ECOHAB and IOOS programs.

## Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Acknowledgments

We gratefully acknowledge Raphael Kudela who provided the environmental data in this study, Michael Kempnich and Sanjin Mehic for doing the water sampling and processing the water samples for DNA extraction.

## Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fmicb.2018.00522/full#supplementary-material

## References

Anders, S., and Huber, W. (2010). Differential expression analysis for sequence count data. *Genome Biol.* 11:R106. doi: 10.1186/gb-2010-11-10-r106

Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2014). *Hierarchical Modeling and Analysis for Spatial Data. 2nd Edn.* Boca Raton, FL: CRC Press; Chapman & Hall.

Chen, E. Z., and Li, H. (2016). A two-part mixed-effects model for analyzing longitudinal microbiome compositional data. *Bioinformatics* 32, 2611–2617. doi: 10.1093/bioinformatics/btw308

George, E. I., and McCulloch, R. E. (1993). Variable selection via gibbs sampling. *J. Am. Stat. Assoc.* 88, 881–889.

Gibbons, S. M., Kearney, S. M., Smillie, C. S., and Alm, E. J. (2017). Two dynamic regimes in the human gut microbiome. *PLoS Comput. Biol.* 13:e1005364. doi: 10.1371/journal.pcbi.1005364

Higdon, D. (1998). A process-convolution approach to modelling temperatures in the north atlantic ocean. *Environ. Ecol. Stat.* 5, 173–190.

Higdon, D. (2002). “Space and space-time modeling using process convolutions,” in *Quantitative Methods for Current Environmental Issues*, eds C. W. Anderson, V. Barnett, P. C. Chatwin, and A. H. El-Shaarawi (London: Springer), 37–56.

Kohn, R., Smith, M., and Chan, D. (2001). Nonparametric regression using linear combinations of basis functions. *Stat. Comput.* 11, 313–322. doi: 10.1023/A:1011916902934

Leybourne, S. J. (1993). Estimation and testing of time-varying coefficient regression models in the presence of linear restrictions. *J. Forecast.* 12, 49–62.

Lee, H. K., Higdon, D. M., Calder, C. A., and Holloman, C. H. (2005). Efficient models for correlated data via convolutions of intrinsic processes. *Stat. Model.* 5, 53–74. doi: 10.1191/1471082X05st085oa

Li, Q., Guindani, M., Reich, B., Bondell, H., and Vannucci, M. (2017). A bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints. *Stat. Anal. Data Mining* 10, 393–409. doi: 10.1002/sam.11350

Liang, W. W., and Lee, H. K. (2014). Sequential process convolution gaussian process models via particle learning. *Stat. Interface* 7, 465–475. doi: 10.4310/SII.2014.v7.n4.a4

McCullagh, P., and Nelder, J. A. (1989). *Generalized Linear Models, No. 37 in Monograph on Statistics and Applied Probability.* Boca Raton, FL: Chapman & Hall/CRC.

McMurdie, P. J., and Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. *PLoS Comput. Biol.* 10:e1003531. doi: 10.1371/journal.pcbi.1003531

Needham, D. M., and Fuhrman, J. A. (2016). Pronounced daily succession of phytoplankton, archaea and bacteria following a spring bloom. *Nat. Microbiol.* 1:16005. doi: 10.1038/nmicrobiol.2016.5

Park, T., and Casella, G. (2008). The bayesian lasso. *J. Am. Stat. Assoc.* 103, 681–686. doi: 10.1198/016214508000000337

Polson, N. G., and Scott, J. G. (2012). Local shrinkage rules, lévy processes and regularized regression. *J. R. Stat. Soc. Ser. B* 74, 287–311. doi: 10.1111/j.1467-9868.2011.01015.x

Robinson, M. D., and Smyth, G. K. (2007). Moderated statistical tests for assessing differences in tag abundance. *Bioinformatics* 23, 2881–2887. doi: 10.1093/bioinformatics/btm453

Romero, R., Hassan, S. S., Gajer, P., Tarca, A. L., Fadrosh, D. W., Nikita, L., et al. (2014). The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women. *Microbiome* 2:4. doi: 10.1186/2049-2618-2-4

Sison-Mangus, M. P., Jiang, S., Kudela, R. M., and Mehic, S. (2016). Phytoplankton-associated bacterial community composition and succession during toxic diatom bloom and non-bloom events. *Front. Microbiol.* 7:1433. doi: 10.3389/fmicb.2016.01433

Turnbaugh, P. J., Hamady, M., Yatsunenko, T., Cantarel, B. L., Duncan, A., Ley, R. E., et al. (2009). A core gut microbiome in obese and lean twins. *Nature* 457, 480–484. doi: 10.1038/nature07540

Witten, D. M. (2011). Classification and clustering of sequencing data using a poisson model. *Ann. Appl. Stat.* 5, 2493–2518. doi: 10.1214/11-AOAS493

Woo, P., Lau, S., Teng, J., Tse, H., and Yuen, K.-Y. (2008). Then and now: use of 16s rDNA gene sequencing for bacterial identification and discovery of novel bacteria in clinical microbiology laboratories. *Clin. Microbiol. Infect.* 14, 908–934. doi: 10.1111/j.1469-0691.2008.02070.x

Xiao, S. (2015). *Bayesian Nonparametric Modeling for Some Classes of Temporal Point Processes*. Ph.D. thesis, University of California, Santa Cruz.

Keywords: count data, Laplace prior, metagenomics, microbiome, regularizing prior, process convolution, negative binomial model, 16S ribosomal RNA sequencing

Citation: Lee J and Sison-Mangus M (2018) A Bayesian Semiparametric Regression Model for Joint Analysis of Microbiome Data. *Front. Microbiol*. 9:522. doi: 10.3389/fmicb.2018.00522

Received: 15 November 2018; Accepted: 08 March 2018;

Published: 26 March 2018.

Edited by:

Michele Guindani, University of California, Irvine, United StatesReviewed by:

Yanxun Xu, Johns Hopkins University, United StatesMichael Pester, Deutsche Sammlung von Mikroorganismen und Zellkulturen (DSMZ), Germany

Copyright © 2018 Lee and Sison-Mangus. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Juhee Lee, juheelee@soe.ucsc.edu