Variance Component Selection With Applications to Microbiome Taxonomic Data

High-throughput sequencing technology has enabled population-based studies of the role of the human microbiome in disease etiology and exposure response. Microbiome data are summarized as counts or composition of the bacterial taxa at different taxonomic levels. An important problem is to identify the bacterial taxa that are associated with a response. One method is to test the association of specific taxon with phenotypes in a linear mixed effect model, which incorporates phylogenetic information among bacterial communities. Another type of approaches consider all taxa in a joint model and achieves selection via penalization method, which ignores phylogenetic information. In this paper, we consider regression analysis by treating bacterial taxa at different level as multiple random effects. For each taxon, a kernel matrix is calculated based on distance measures in the phylogenetic tree and acts as one variance component in the joint model. Then taxonomic selection is achieved by the lasso (least absolute shrinkage and selection operator) penalty on variance components. Our method integrates biological information into the variable selection problem and greatly improves selection accuracies. Simulation studies demonstrate the superiority of our methods versus existing methods, for example, group-lasso. Finally, we apply our method to a longitudinal microbiome study of Human Immunodeficiency Virus (HIV) infected patients. We implement our method using the high performance computing language Julia. Software and detailed documentation are freely available at https://github.com/JingZhai63/VCselection.


UNIFRAC DISTANCE KERNEL
For each individual i we get n i repeated microbiome profile measurements. Consider two microbiome communities A and B and suppose that we have a rooted phylogenetic tree with m branches(m is the total number of OTUs). Let b i denote the length of branch i (i = 1, 2, · · · , m) and p A i , p B i are the taxa proportions descending from the branch i for community A and B respectively. T is the total reads of m OTUs and T i is total reads of ith OTU from both community. Unweighted Unifrac (Lozupone and Knight, 2005) Weighted Unifrac (Lozupone et al., 2007) VAW Unifrac (Chang et al., 2011) Generalized Unifrac (Chen et al., 2012) We define a kernel matrices by transforming the distance matrices through equation S1 to measure similarities between the microbiome composition among subjects.
is the pairwise distance matrice, I is the identity matrice and 1 is a vector of 1's.
The N denotes the total number of samples.

MICROBIOME COUNT SIMULATION
The microbiome count data measured at multiple time points are expected to be correlated in longitudinal study. We adopt the two-part zero-inflated Beta regression model with random effects proposed by Chen and Li (2016) to simulate correlated microbiome count data. The ZIBR model considers each bacterial taxon separately. We illustrate the simulation procedure using one taxon as example: Step 1. Given the real pulmonary microbiome dataset, we first compute the mean reads for each bacterial taxon, e.g., OT U u , u = 1, 2, ..., 2964.
Step 2. We first simulate the absence/presence of OT U u in ZIBR model's logistic component S2, where treatment is a binary covariate, treatment = 0 for time = 0 and treatment = 1 for time > 0. p it is the probability of OT U u present in the subject i at time t.
Step 3. For the present OT U u , we then simulate its abundance, namely the proportion of overall taxon reads for subject i. The abundance follow the Beta distribution with mean µ it and dispersion φ. The µ it depend on the treatment and time through the logit link function S3 where a i and b i is the random effect in the two part logistic model to account for the correlation of repeated measurement.
Step 4. The ZIBR model return the proportion instead of reads for each bacterial taxon, the simulated microbiome counts are obtained by multiplying the proportion with mean reads from step 1.

FALSE DISCOVERY RATE AND TRUE POSITIVE RATE OF VC-LASSO
In Figure S1 and Figure S2, we evaluate VC-lasso using false discovery rate (FDR) and true positive rate (TPR) by equation S4. FDR of VC-lasso in both simulation scenario 1 and scenario 2 are in reasonable range. FDR decreases with the number of true variance components decrease. For example, FDR for the model with 2 non-zero variance components is lower than the model with 15 non-zero variance components. TPR increases when the effect sizes increase as shown in Figure S1- Figure

AKAIKE INFORMATION CRITERION AND BAYESIAN INFORMATION CRITERION
In Figure S3- Figure S4, we show the performance of VC-lasso when optimal tuning parameter λ is selected by Akaike Information Criterion (AIC) and Schwarz's Bayesian Information Criterion (BIC). Specifically, Figure S3 show variable selection performance under different sample sizes while Figure S4 show variable selection performance under different number of non-zero variance components. Compare to the cross-validation, AIC and BIC procedures lead to slightly higher g-Measure under sample size n = 20 and effect size greater than 5. When sample sizes are 50, 100, cross-validation perform better. Cross-validation is also better with less non-zero variance components exist. Although we suggest cross-validation should be used to choose tuning parameter, the performance of AIC and BIC are comparable to cross-validation when sample sizes are limited.

MODELS WITH LARGER NUMBER OF VARIANCE COMPONENTS
In this supplementary simulation, we aim to evaluate the performance of VC-lasso when the number of variance components in the model is large, for example, L = 100. We randomly group 998 non-zero OTUs into 100 clusters. The maximum of OTUs in each clusters is 18 and the minimum is 4. We fix sample size at n = 50 and responses are simulated when σ 2 d = 0. The true model has five non-zero variance components. As shown in Figure S5, VC-lasso works well in large scale variance components models even with 100 clusters. VC-lasso is also very efficient. It took only 11.9 seconds to run one simulation replicate using five fold cross-validation on a grid of 11 values for the regularization parameter λ. The simulation was run on High Performance Computing (HPC) cluster using 1 node, 28 cores, and 168GB RAM in total. This cluster uses Intel Haswell V3 28 core processors with 192GB RAM per node.  Figure S5: Estimated g-Measure of VC-lasso with 5 non-zero and 95 zero variance components using cross-sectional design. The 998 OTUs are randomly aggregate to 100 clusters with maximum 18 OTUs and minimum 4 OTUs in one cluster. Responses are simulated with 5 true variance components in 100 clusters. Sample size is set to n = 50 and σ 2 d = 0.