Analysis of Microbiome Data in the Presence of Excess Zeros

Motivation: An important feature of microbiome count data is the presence of a large number of zeros. A common strategy to handle these excess zeros is to add a small number called pseudo-count (e.g., 1). Other strategies include using various probability models to model the excess zero counts. Although adding a pseudo-count is simple and widely used, as demonstrated in this paper, it is not ideal. On the other hand, methods that model excess zeros using a probability model often make an implicit assumption that all zeros can be explained by a common probability models. As described in this article, this is not always recommended as there are potentially three types/sources of zeros in a microbiome data. The purpose of this paper is to develop a simple methodology to identify and accomodate three different types of zeros and to test hypotheses regarding the relative abundance of taxa in two or more experimental groups. Another major contribution of this paper is to perform constrained (directional or ordered) inference when there are more than two ordered experimental groups (e.g., subjects ordered by diet or age groups or environmental exposure groups). As far as we know this is the first paper that addresses such problems in the analysis of microbiome data. Results: Using extensive simulation studies, we demonstrate that the proposed methodology not only controls the false discovery rate at a desired level of significance while competing well in terms of power with DESeq2, a popular procedure derived from RNASeq literature. As expected, the method using pseudo-counts tends to be very conservative and the classical t-test that ignores the underlying simplex structure in the data has an inflated FDR.


INTRODUCTION
Microbial count data are represented using operational taxonomic units (OTUs) from 16S rRNA studies. For each specimen (e.g. fecal sample) drawn from an ecosystem (e.g. gut), the number of occurrences of each OTU is measured and the resulting OTU table is summarized to obtain relative abundance for bacterial taxa in a specimen. These OTU counts may be summarized at any level of the bacterial phylogeny, e.g., species, genus, family, order, etc. Throughout this paper we use the generic term "taxa" to denote a particular phylogenetic classification. Since the relative abundances of taxa in a specimen sum to 1, these are compositional data and they reside in a simplex rather than the entire Euclidean space. Another important feature of these microbiome data is that not all taxa may be present in each sample, i.e., some of the OTUs may take zero values. Using such microbial compositional data, researchers are interested in understanding the interplay between microbiome, diet, genome and human health (Clemente et al., 2012;den Besten et al., 2013). Accordingly, there is an urgent need for statistical methods for analyzing these complex microbial count data. This is an active area of research and a variety of statistical and computational methods have been proposed in the literature to answer a variety of scientific questions. For a review one may refer to Li (2015) and Mandal et al. (2015). The latter described in detail various statistical parameters associated with microbial compositional data and discuss which are estimable, and hence testable, and which are not. They proposed Aitchison's log-ratio based methodology (Aitchison, 1982(Aitchison, , 1985(Aitchison, , 1986 called ANCOM for comparing the taxa abundance at the ecosystem level in two or more groups or populations. Earlier, Xia et al. (2013) also considered Aitchison's log-ratio based methodology for microbiome data and proposed a penalized likelihood based methodology to select covariates influencing microbiome expression.
Excess zeros in microbiome data present a challenge when analyzing these data, specifically when comparing two or more experimental groups. A common strategy to handle these excess zeros is to add a small number called pseudo-count (e.g., 1, cf. Xia et al., 2013;Mandal et al., 2015). Although adding a pseudocount appears to be a reasonable and a simple strategy, it is adhoc. Other strategies include modeling excess zeros using various probability models (Paulson et al., 2013;Chen and Li, 2016). However, such models often make an implicit assumption that all zeros can be explained by a common probability model. As described in this article, this is not always the case as there are potentially three different sources of zeros in microbiome data. The first major contribution of this paper is a method which identifies the three major types or sources of zeros in microbiome data. The second major contribution of this paper is to compare the mean relative abundance of taxa in two or more groups while taking into consideration the compositional structure and the type of zeros in the data. Unlike ANCOM (Mandal et al., 2015), which compares the taxa abundance in the ecosystem of two or more groups, the proposed methodology compares the abundance of taxa relative to a background value. The method is general enough that the reference background value can be a specific taxon the user is interested in or it can be some suitable background value specific to each specimen, such as the geometric mean (Aitchison's centered log-ratios). The main idea is to normalize data within each specimen so that any background values within the specimen are eliminated. This idea is analogous to what is often done in gene expression studies. If a particular taxon is used as the reference taxon or reference value , then we assume that the taxon is present in all specimens. Thus the normalizing variable is same across all specimens. From our experience, in practice this condition is not particularly stringent, especially if the researcher is interested in studying microbiome at the genus or a higher level of the phylogenetic tree. For example, in the Yatsunenko et al. (2012) study consisting of 531 samples over three geographical locations (US, Venezuela and Malawi) there exist at least one taxon (at the genus level) that is present in all samples. These data are discussed later in this manuscript. If no such taxon exists, then the proposed methodology can be implemented using the geometric mean as the reference to correct for the background abundance levels of each specimen.
In some applications researchers are interested in performing inferences regarding mean relative abundances of individual taxon in the ecosystems of more than two ordered groups. For example, one may be interested in comparing the mean relative abundances of individual taxon in subjects ordered by different levels of fat intake or levels of dietary supplements or subjects belong to different age groups etc. In all such situations the classical two-sided tests are not as informative or powerful as the constrained inference (or order restrictions) based tests (Farnan et al., 2014;Jelsema and Peddada, 2016). Since the proposed methodology converts the simplex data to Euclidean space data, constrained inference theory developed in Farnan et al. (2014) is directly applicable to the present setting. Thus the third major contribution of this paper is to perform constrained inference when there are more than two ordered experimental groups. As far as we know this is the first paper that addresses such problems in the analysis of microbiome data. Owing to the generality of Farnan et. al. methodology to (a) cross-sectional as well as repeated measures/longitudinal designs, (b) detecting trends in the relative abundances of taxa in two or more ordered experimental groups such as in time course experiments, dose-response studies or when comparing subjects at stages of disease, (c) multiple pairwise comparisons of several experimental groups against a pre-specified control group, the methodology described in this paper is therefore very broadly applicable. Thus, the proposed methodology can be used for testing a wide range of hypotheses while controlling for false discovery rate (FDR) at the desired nominal level. Extensive simulations are performed to demonstrate that the proposed methodology controls the FDR in a variety settings considered in the simulation study while enjoying higher power than some commonly used methods including those based on pseudocounts. We illustrate the methodology using the global gut data of Yatsunenko et al. (2012).

NOTATION AND PROBLEM FORMULATION
Suppose a sample of n j specimens are drawn from the j th experimental group, j = 1, 2, . . . , J. On each specimen suppose the abundance of p taxa are obtained. Here the word "taxa" could be at any level of the bacterial phylogeny, e.g., species, genus, family, order, etc., or just the counts of OTU categories themselves. Let z ijk denote the observed abundance of k th taxon, k = 1, 2, . . . , p, in the i th specimen from the j th experimental group. In vector notation we have z ij = (z ij1 , . . . , z ijp ). For simplicity of exposition throughout this paper, we shall take n j = n, j = 1, 2, . . . , J even though the methodology does not require the design to be balanced. As explained in Mandal et al. (2015), unlike most commonly encountered biological data, the basic counts of OTU categories within each specimen cannot be regarded as absolute values but only relative values as they depend upon the sampling depth corresponding to each specimen. In other words, it does not make sense to compare the expected value of the observed counts between two experimental groups. To draw any meaningful inferences regarding the taxa abundance in two or more groups one needs to "normalize" the data within each specimen. Since classical inference, such as ttests or ANOVA are not valid in the present context due to the simplex constraint, following Aitchison (1980) and Mandal et al. (2015) worked with log-ratios of relative abundances within each specimen. This is equivalent to computing log-ratios of abundances of each taxon relative to a "reference value." Thus, for the i th specimen in the j th experimental group, one may consider the following expression to normalize the data z ijk : (2.1) using some pre-specified "reference value" f ij (z ij1 , . . . , z ijp ). For example, f ij (z ij1 , ..., z ijp ) = log z ijb , where z ijb is the count corresponding to a pre-specified reference taxon b. Alternatively, using the non-zero values z ijk , k = 1, 2, ..., p, the user may , the logarithm of the geometric mean of the OTU counts within each experimental group j = 1, . . . , J (Aittchisons centered log-ratio).
Although the above normalization procedure eliminates the effect of the library size within specimen, it does not account for differences in the library sizes across specimens. To deal with this, we make another correction to the above normalization step. We make the assumption that all specimens within an experimental group are a random sample from a common population of specimens so that the observed background value for a given specimen is a random realization from a common population of all background values. Thus we have the following one-way ANOVA model describing the observed background value: where µ j is the fixed effect due to the experimental group j = 1, . . . , J and ε ij ∼ N (0, σ ε ) is a random variable that captures variation due to the sampling depth. This quantity can then be predicted by the . . , z ijp ) which can be interpreted as the best linear unbiased predictor (BLUP) in the assumed model.
Hence in place of the typical normalization (2.1), we normalize the raw abundances using the following normalized formula: n i∈j th group f ij (z ij1 , ..., z ijp ). This normalization procedure can be easily extended to the case when there are covariates present in the model. Of course, in the above formula, all logarithms are calculated under the assumption that there are no zero values. However, as mentioned earlier, this is not true with the microbiome data. We address this problem in the next section.

ZEROS
A special feature of a microbiome data matrix is that it is higly sparse, i.e., a very high proportion of data entries are zero (absent taxa). For example, at the genera level, nearly 80% of the data matrix in the Global gut data of Yatsunenko et al. (2012) are zero. Furthermore, corresponding to a given taxon, the counts may vary from 0 to the order of 10 5 across samples within an experimental group. In this section we develop a pre-processing step that not only helps us potentially understand the different types of zeros in the data but address them accordingly.

Outlier Zeros
For a given taxon k in the j th group, we declare the sample i to be an "outlier zero" if its count is zero and is declared to be an outlier by the methodology described below. In our assessment, this taxon is recorded as zero due to some extraneous reasons but not because it is below detection limits due to sampling depth. Thus, as far as taxon k is concerned, the i th sample within group j is an outlier.
We first convert the count data into continuous scale by adding a pseudo-count of 1 and normalize the data using the transformation pseudo-count (2.3). Let y ij = (y ij1 , ..., y ijp ) denote the p dimensional vector for i th observation in the j th group, then for each j, k, we model y ijk using the following mixture of normal distributions. Since our outlier detection algorithm is applied to each experimental group j and each taxon k, for simplicity of exposition, we drop the subscript j and k from the following: The main idea of our methodology is that when means of the two normal distributions N (µ 1 , σ 1 ) and N (µ 2 , σ 2 ) in the above mixture are "well separated and the left cluster, i.e. cluster corresponding to mean µ 1 , forms only a small fraction of the total number of observations of the group, i.e. π is small, then it is reasonable to assume that the left cluster is a collection of outlier observations in the group and the observed zero might be a potential outlier. On the other hand, if the two groups are not well separated then the observed zero may not be an outlier zero but zero due to other reasons. Such zeros are handled later in this section. Identification of two clusters: For a given taxon within a group, we declare that its distribution is a mixture of two "distant" normal distributions if the following two criteria are satisfied: 1. Separation: The 97.5th percentile of the first distribution does not overlap with the 2.5th percentile of the second distribution, i.e., µ 1 + 1.96σ 1 < µ 2 − 1.96σ 2 . 2. Frequency: One distribution is "c % heavier" than other, i.e., π < c for some pre-specified c.
The above determinations, along with the estimation of parameters π, µ 1 , µ 2 , σ 1 , σ 2 of the mixture (3.1) can be performed efficiently by an algorithm due to Peddada and Hwang (2002). We refer to the data cells identified by this mechanism as "outlier zeros" which are ignorable entries (replaced by NA in the data).

Structural Zeros
In many cases, because of the nature of the experimental groups, some taxa are not supposed to be present in samples obtained from some groups but may be present in others. For example, babies exposed to antibiotics may be devoid of some taxa in their fecal samples, which are present in healthy babies not exposed to antibiotics. Although, in theory the antibiotics exposed babies are expected to be completely devoid to some taxa, due to variability in the exposure and other factors, such taxa may not be 100% missing in the antibiotics exposed babies. Suppose p represents the proportion of non-zero taxa across all specimens in an experimental group. Then we expect p to be close to zero, if not exactly zero, in experimental groups where the taxon is not expected to be present. We refer to such zeros as structural zeros. For the j th taxon in the k th experimental group, letp jk = n i=1 z ijk n. Then we declare the taxon to have a structural zero value if either of the following is true.
Taxa that are identified as structural zeros in any given group are ignored from all future analyses for that group. Thus, for example, if in a study there are three experimental groups and if a particular taxon t is declared to have structural zero in Group 1 but not in Groups 2 and 3, then we automatically declare that taxon t is differentially abundant in Group 2 relative to Group 1 as well as in Group 3 relative to Group 1. We then compare the relative abundance of t between Groups 2 and 3 using the methodology developed in this paper.

Sampling Zeros
If an observed zero in the data does not qualify as an outlier zero or as a structural zero, then we declare such a zero to be sampling zero, perhaps caused by the sampling depth. In other words, these zeros are potentially due to the fact the taxon is relatively a rare taxon compared to other taxa in the specimen and due to technological (or other) reasons it was not observed. These sampling zeros are imputed by using a small pseudo-count value (e.g., 1) before analyzing the data. More generally, an imputation approach could also be applied to these left over zeros, however this is outside the scope of this manuscript.
To summarize, using the above process, we obtain a modified data set where; (a) samples with structural zeros are suitably removed from the data matrix, (b) the outlier zeros are treated as missing at random (MAR) in the sense of Rubin (1976) and the corresponding entries are replaced as "NA", and (c) the sampling zeros are imputed as 1.

ANALYSIS OF TWO OR MORE GROUPS
In rest of this paper, we work with normalized data y described in Equation (2.3) after suitably dealing with zeros as described in the previous section. For the k th taxon in the j th experimental group, for i = 1, 2 . . . , n, let µ jk = E(y ijk) and σ 2 jk = Var(y ijk ). Using the zeros corrected data, we obtain the following unconstrained estimators for µ jk and σ 2 jk , for j = 1, 2, . . . , J and k = 1, 2, . . . , p:μ In many applications, researchers are interested in comparing taxa relative abundances in two or more experimental groups.
Depending upon the scientific question, one may perform a wide range of analyses. In this section we describe four different classes of analyses one may perform. In each case the statistical parameters of interest are µ jk , j = 1, 2, . . . , J, k = 1, 2, . . . , p. Note that, by construction, within each group j, p k=1 µ jk = 0. Hence without loss of generality, we limit rest of the discussion to the first p − 1 taxa because µ jp = − p−1 k=1 µ jk .

H 1 : Two-Sided Global Hypotheses
Since the data y ijk belong to the Euclidean space, therefore for each taxon k, k = 1, 2, . . . , p − 1, we can use standard linear model based methodology to test such hypotheses on the group means µ 1k , µ 2k , . . . , µ Gk . adjusting for any covariates present in the data. If there are repeated measures or longitudinal data, then one can invoke the standard linear mixed effects models theory and test two-sided global hypotheses such as:

Vs.
µ rk = µ sk , for some r = s. The p-values obtained for each taxon k, k = 1, 2, . . . , p − 1, can be corrected for multiple testing using a suitable multiple testing correction procedure, such as Bonferroni or Benjamini-Hochberg (BH), depending upon the criterion of interest, namely, the Familywise error rate (FWER) or the false discovery rate (FDR).

H 2 : Directional Multiple Pairwise Testing
For each taxon k, k = 1, 2, . . . , p − 1, often researchers are not interested in testing the global hypotheses H 1 but are interested in pairwise comparisons among some (or all) pre-specified experimental groups. Furthermore, within each pairwise comparison, a researcher may be interested in knowing if the (relative) abundance of a taxon increased or decreased from one group to the other. For example, a researcher may be interested in testing whether there is a greater (relative) abundance of Bifidobacterium Sp. in vaginally born babies who were never exposed to antibiotics during the first four months of life, than vaginally born babies who received at least one dose of antibiotics during the first four months. To draw such directional inferences in pairwise comparisons while controlling for the overall false discovery rate, one may apply the mdFDR (mixed directional FDR) controlling procedure introduced in Guo et al. (2010). When there are no covariates present, the Guo et al. (2010) procedure is available in the software ORIOGEN 4.1. https://www.niehs.nih.gov/research/atniehs/labs/ bb/staff/peddada/.

H 3 : Directional Multiple Pairwise Testing against a Specific Experimental Group
Hypotheses H 2 deals pairwise comparisons among some prespecified subset (or all) experimental groups. However, there are instances where researchers may be interested in testing all experimental groups against one pre-specified experimental group, such as, for example the control group. In such cases the power of Guo et al. (2010) procedure can be improved by appealing to the Dunnett's type test derived in Grandhi et al. (2016). The R-code for the method is provided in Grandhi et al. (2016).

H 4 : Testing for Patterns
In some applications, a researcher may not be specifically interested in pairwise comparisons, but may be interested in detecting overall trends/patterns in the relative abundance of a taxon over multiple ordered (or partially ordered) experimental groups. Order (or partial order) among experimental groups arises when the experimental groups represent time or dose or stages of disease etc.
For example, researchers may be interested in understanding the trends in (relative) abundance of taxa across four partially ordered groups, namely, (G1) Vaginally born babies who were not exposed to any antibiotics during the first four months after birth, (G2) Vaginally born babies who were exposed to at least one dose of antibiotics during the first four months of after birth, (G3) C-Section born babies who were not exposed to any antibiotics during the first four months after birth and (G4) C-Section born babies who were exposed to at least one dose of antibiotics during the first four months of after birth. In this case, groups G1 and G4 are the extreme groups in terms of gut microbial environment. In G1 there are no interventions, and in G4 there are two interventions (C-section and antibiotics exposure). Groups G2 and G3 are intermediate groups with one intervention each (either C-Section or antibiotics exposure). Although, groups G2 and G3 are intermediate to G1 and G4, the order between G2 and G3 is uncertain and hence we have a partial ordering among the four groups.
A study design such as the one in this example can be represented using the Figure 1C, called a simple loop order, where, for each taxon, the researcher is interested in obtaining two sets of patterns, namely, pattern over G1, G2, and G4 and a pattern over G1, G3, and G4. Note that members within each set are completely ordered in terms of baby's exposure to interventions. When groups are ordered, one may be interested in identifying taxa whose mean relative abundance increases (or decreases) as we go from one extreme group (e.g., Group 1) to the other extreme group (e.g., Group 4) within each set. Such monotonic patterns, increasing or decreasing, are called the simple order ( Figure 1A). More, precisely, for each taxon, k = 1, 2, . . . , p − 1, one may be interested in testing the following hypotheses: Vs. Vs.
In some applications one may be interested in identifying taxa that have an umbrella shaped pattern as in Figure 1B.
As observed above, rather than using some arbitrary parametric functions, one can describe various patterns or trends using mathematical inequalities, called order restrictions. To determine the best pattern or trend for each taxon we adopt the strategy in Peddada et al. (2003), where a similar problem was considered for time-course gene expression data. For each taxon, we test the null hypothesis that there is no change in mean relative abundance (in log scale) over all the experimental groups against the alternative hypothesis which is the union of all patterns of interest. For each pattern we construct a suitable order restricted test and the final test statistic is taken to be the maximum of all test statistics. The null distribution of the test statistic is derived using the residual bootstrap based procedure developed in Farnan et al. (2014) which is implemented in the package called constrained linear mixed effects (CLME), an R code developed by Casey Jelsema and is described in Jelsema and Peddada (2016). The R code allows for modeling covariates as well as longitudinal/repeated measurements data. Since there are FIGURE 1 | Illustration of hypotheses H 1a and H 2a testing for trends amongst groups.
Frontiers in Microbiology | www.frontiersin.org a large number of taxa, we perform multiple testing corrections using the BH procedure to control for the overall FDR. As in Peddada et al. (2003), if for a taxon, the null hypothesis is rejected at the desired level of significance (FDR ≤ α), then we assign the pattern with largest value of the test statistic. Thus, we are essentially adopting the ORIOGEN methodology developed in Peddada et al. (2003) to the present context.

NUMERICAL RESULTS
We evaluate the performance of our proposed methodology, which we refer to as ANCOM-II, using two distinct simulation studies. The first is inspired by a real data set collected by Yatsunenko et al. (2012). This setup also allows for all three kinds of zeros described in the paper. The second is based on a negative binomial distribution, which is commonly used to model OTU count data of microbiome studies. The results of the proposed method are obtained by filtering outlier zeros at a threshold of c = 0.15. We compare the proposed with methodology with three other methods, namely, DESeq2 (Love, Huber and Anders 2014), t-test based on sample proportions (Prop-T) and t-test based on data transformed via (2.3) after adding a pseudo-count of 1 to each entry (Pseudo-C). Note that a comparison between ANCOM-II and the Pseudo-C method provides numerical results on how our assessment of zeros impacts the analysis. We also provide a user friendly R code in the supplementary materials to implement the proposed methodology described in this section.

Simulation Study Based on Real Data
This simulation study is based on the OTU count data (at the genus level) corresponding to the US group provided in Yatsunenko et al. (2012). We constructed two groups, namely, cases and controls (J = 2). Each group consisting of 175 subjects and 200 taxa. Among these 200 taxa, 100 are taken to be differentially abundant. As detailed below, our simulation study allows for all three forms of zeros discussed in the paper.
Step 1 Generate a simple random sample of 175 subjects from the US group in Yatsunenko et al. (2012) data. Process the data as described in Section 2 by taking the genus Bifidobacterium as the reference taxon for the transformation (2.3). This provides us with a 175 × 661 data matrix. Let m = (m 1 , ..., m 200 ) denote the vector of 200 column means which are highest in magnitude obtained after normalization of (2.3).
Step 5 Back transform the above continuous scale data to the count scale by inverting the transformation (2.3) and rounding the observations. Specifically, using the transformation z ijk = e y ijk z ijb i z ijb 1/n here z ijb represents the counts of "Bifidobacterium" taxa in the subset of the global gut data described in Step 1. In the above steps, all values between (0,1) are rounded to zero counts. Thus, although we are generating continuous random variables, with a positive probability we generate zeros. Recall that in Step 2 samples are generated from a mixture of two independent normal distributions. The observations corresponding to zero counts are induced by the first component of the mixture distribution. Since the two components are independently generated, the zero observations are not dependent on the taxa itself (assuming that the true distribution of the taxa is given by the second component). Thus, these zeros, by design, represent observations that are missing at random. On the other hand, the zeros obtained in Step 3 are from a single distribution, and are zero because z ijk with values between 0 and 1 are set to 0.
Step 6 Apply the three methods on the above simulated count data. Repeat Steps 1 through 6 and estimate the false discovery rate (FDR) and power of each method.
The left and right panels of Figure 2 provides the estimated FDR and power of the four methods, respectively. Here the shift parameter of Steps 2 and 3 is set to δ = 0.5. In this setting, on average (red dot), our proposed method, DESeq2 and Pseudo-C appear to control the FDR at the nominal level of 0.05. However, in terms of power our method appears to outperform the rest. In Figure 3, we further examine the effect of a varying shift parameter δ. We compare the powers of the four methods for 100 distinct values of δ ∈ (0, 0.5). Once again we note that the proposed method ANCOM-II, tends to have larger power than the others. Specifically, a comparison between ANCOM-II and the Pseudo-C method emphasizes the importance of identifying the various sources of zeros and dealing with them accordingly, rather than using a constant pseudo-count for all observed zeros.

Simulation based on Negative Binomial Distribution
In this section we investigate the performance of the four methods by generating data according to negative binomial (NB) distribution as follows. For j = 1, 2, k = 1, ..., 200, we generate, z ijk ∼ NB(µ jk , s jk ), i = 1, .., 100 ( 5.1) where µ jk , s jk are the mean and dispersion parameters of the negative binomial distribution respectively, in all cases we set s jk = µ 2 jk . The control samples are generated for j = 1 and k = 1, .., 200 by choosing µ jk from a uniform distribution over (1,1500). The case samples are generated by shifting the mean of the first one hundred taxa. Thus,for j = 2, k = 1, .., 100 set µ jk = µ 1k + 5k. The remaining k = 101, ..., 200 micorbes for group j = 2 are generated with the same mean parameters as the control samples. Furthermore we induce additional zeros in the data set by multiplying the previously generated counts with independent Bernoulli random variables w ijk = 0 with probability 1 − π jk where π jk is chosen uniformly between (0.8,1). This simulation experiment is repeated 100 times and the FDR and power comparison results are reported in Figure 4. From these simulation results we note that only ANCOM-II and Pseudo-C have estimated FDR at or below the nominal level of 0.05. Furthermore, between the two methods, ANCOM-II enjoys higher power. DESeq2 and Prop-T have unacceptably high estimated FDR.

ANALYSIS OF GLOBAL HUMAN GUT MICROBIOME DATA
We illustrate ANCOM-II using global human gut microbiome data of Yatsunenko et al. (2012). The data consists of microbial taxa counts obtained from 317 subjects from US, 99 from Venezuela and 114 from Malawi. We used Bifidobacterium as the reference taxon because it was present in all samples.
Let S i denote the set of genera with i countries having structural zeros. According to our method, by taking c = 0.15 we found that out of 661 genera, 262 belong to S 0 , 86 belong to S 1 , 95 belong to S 2 and 218 belong to S 3 . Depending upon the set a genus belongs to, the method tests suitable hypotheses as outlined below (the corresponding R code is provided in the supplementary materials).
Hypotheses 1. For genera j ∈ S 0 we test the following hypothesis H 0j : µ US,j = µ Venezuela,j = µ Malawi,j , against H aj : µ US,j ≤ µ Venezuela,j ≤ µ Malawi,j ∪ µ US,j ≤ µ Venezuela,j ≥ µ Malawi,j ∪ µ US,j ≥ µ Venezuela,j ≤ µ Malawi,j ∪ µ US,j ≥ µ Venezuela,j ≥ µ Malawi,j Hypotheses 2a. For genera j ∈ S 1 , when a taxon is structurally zero in Malawi data we test the following hypothesis H 0j : µ US,j = µ Venezuela,j against H aj : µ US,j ≤ µ Venezuela,j ∪ µ US,j ≥ µ Venezuela,j  Frontiers in Microbiology | www.frontiersin.org developed in this paper, called ANCOM -II, is a general procedure that is not only applicable to cross-sectional as well as longitudinal designs, but in each case it can be used for detecting trends and patterns in a taxon over two or more groups. Our simulation study suggests that the methodology controls the overall false discovery rate while maintaining high power. In addition, since the methodology is based on residual bootstrap, it does not make any major distributional assumptions. For testing non-directional alternative hypotheses (hypothesis H 1 ), ANCOM-II can be implemented using the Rcode accompanying this paper. If no covariates are present and if there are no repeated measurements, then using residuals calculated in Equation (