Gut and oral microbiota associations with viral mitigation behaviors during the COVID-19 pandemic

Imposition of social and health behavior mitigations are important control measures in response to the coronavirus disease 2019 (COVID-19) pandemic caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). Although postulated that these measures may impact the human microbiota including losses in diversity from heightened hygiene and social distancing measures, this hypothesis remains to be tested. Other impacts on the microbiota and host mental and physical health status associations from these measures are also not well-studied. Here we examine changes in stool and oral microbiota by analyzing 16S rRNA gene sequence taxonomic profiles from the same individuals during pre-pandemic (before March 2020) and early pandemic (May-November 2020) phases. During the early pandemic phase, individuals were also surveyed using questionnaires to report health histories, anxiety, depression, sleep and other lifestyle behaviors in a cohort of predominantly Caucasian adults (mean age = 61.5 years) with the majority reporting at least one underlying co-morbidity. We identified changes in microbiota (stool n = 288; oral n = 89) between pre-pandemic and early pandemic time points from the same subject and associated these differences with questionnaire responses using linear statistical models and hierarchical clustering of microbiota composition coupled to logistic regression. While a trend in loss of diversity was identified between pre-pandemic and early pandemic time points it was not statistically significant. Paired difference analyses between individuals identified fewer significant changes between pre-pandemic and early pandemic microbiota in those who reported fewer comorbidities. Cluster transition analyses of stool and saliva microbiota determined most individuals remained in the same cluster assignments from the pre-pandemic to early pandemic period. Individuals with microbiota that shifted in composition, causing them to depart a pre-pandemic cluster, reported more health issues and pandemic-associated worries. Collectively, our study identified that stool and saliva microbiota from the pre-pandemic to early pandemic periods largely exhibited ecological stability (especially stool microbiota) with most associations in loss of diversity or changes in composition related to more reported health issues and pandemic-associated worries. Longitudinal observational cohorts are necessary to monitor the microbiome in response to pandemics and changes in public health measures.


Cluster Influencers Analysis
The purpose of the cluster influencers analysis is to identify which taxa are responsible for a cluster's separation from its peer clusters. Clustering depends on the distance metric (e.g., Euclidean, Manhattan, Morisita-Horn, Jaccard, etc.) used to compare samples and the clustering algorithm (e.g., Ward's minimum variance, completelinkage, etc.) used to join (or split) members together (or apart) into clusters. For compositional data, taxonomic categorical counts for each sample are first normalized (to sum to 1) to produce relative abundance profiles, then a distance metric is calculated pairwise between these profiles. Since the same dataset may yield different clusters depending on metric and clustering algorithm, the separation between two clusters cannot be simply attributed to differences in their taxonomic abundance, as this was not the criteria used that separated them.
To identify the taxonomic contributors to cluster separation, an analysis of variance (ANOVA) approach is taken by using the R 2 statistic between two clusters to measure cluster separation. Here we define R 2 as the sum of squares between (SSB) / sum of squares total (SST). To determine whether a taxon of interest contributed to the separation between two clusters, we first calculate the reduced-R 2 , which is the R 2 when the taxon of interest is removed from all samples, then we compare it against the full-R 2 which includes all taxa. If the log-ratio of the reduced-R 2 /full-R 2 is negative, then that taxon contributed to the two clusters separating from each other. If the log-ratio is positive, then that taxon contributed "noise" to the two clusters and can be ignored. This is done for all taxa of interest, usually the top 15-35 taxa in abundance. Taxa that contributed to cluster separation are called "Cluster Differentiators". Taxa that are consistently cluster differentiator against other clusters, are called "Cluster Unifiers". A set of cluster unifiers is considered the defining characteristic of that cluster, or a microbiome "type" or configuration.
Since hierarchical clustering was used to generate our clusters, we also use an iterative tree cutting algorithm to generate cluster cuts from cuts k = 2 to 10, i.e. (k-1) groups of clusters, with cluster identifiers in each grouping labeled cl = 1 to k. Cluster unifiers are calculated for each cut and summarized in a hierarchically structured table. See Supplemental Figure 5 and Figure 6 for the cluster unifiers for stool and saliva, respectively.

Cluster Transition Analyses
The goal of cluster transition analysis is to understand how a cohort's microbiota has changed between their pre-and early pandemic sample, given their pre-pandemic microbiota composition and factors, such as demographics and questionnaire responses. The assumption that is made when modeling with clustering is that a cluster represents a particular state or configuration of the microbiota from a sample that cannot be meaningfully reduced further. While there may be an infinite number of possible compositions that can be brought together in a cluster, each variation is not considered consequential when considering that cluster as whole. In other words, the members of the cluster may be treated homogenously if they clustered together, and should be considered different in state of configuration, if they cluster separately.
To identify the possible microbiota states that may exist in a cohort, both the pre-and early pandemic samples were clustered together. Pairwise sample distancing was performed with the Manhattan distance metric, followed by hierarchical clustering with Ward's minimum variance criterion. The tree resultant from hierarchical clustering was cut iteratively from k = 2 to 7. At each cut, a k x k contingency table was computed with the rows and the columns containing the pre-and early pandemic cluster identifiers, respectively. For example, at k = 2, a contingency table would be a 2 x 2 matrix. Please examine Figure 3, "Cluster Transition", for examples of stool and saliva sample "contingency tables" drawn as a scatter plot, cut at k = 6 and k = 4, respectively.
When a sample's preand early pandemic sample are in the same cluster, they are considered to have not changed clusters, i.e., state or configuration. From these initial data structures, conditional and joint probabilities may be calculated to quantify whether some pre-pandemic clusters have a stronger probability of transitioning towards one early pandemic cluster versus another. To incorporate questionnaire responses and demographic data, two models were fit: "Departers" and "Arrivers". In the "Departers" model, the members of each prepandemic cluster were split into two groups, the "remainers" and the "departers". The pre-and early pandemic samples of the remainers share the sample cluster id. The departers pre-and early pandemic belong in different clusters. For each pre-pandemic cluster, a logistic regression was fit with the departers/remainers as a response to subject demographics and questionnaire responses. See Supplemental Figure 4. "Cluster Transition Departers". In contrast, for the "Arrivers" model, the members of each early pandemic cluster were examined.
Remainers were subjects that had both pre-and early pandemic samples in the same cluster, whereas the "arrivers" had a sample from a different pre-pandemic cluster. Logistic regression was then computed with the arrivers/remainers as a response to subject demographics, questionnaire responses, and pre-pandemic cluster id. Thus, the "Arrivers" analysis is conditional on the subjects' pre-pandemic state. See Supplemental Figure 3.
Since the association between model predictors (demographics and questionnaire responses) depends on the cut k, the logistic regressions are calculated across all the clusters (identified 1 to k) generated at each cut. Pvalues are accumulated for each predictor across all the cuts, and the cut with the most significant p-value is considered the best. If variable-to-cluster associations from multiple cuts of k are equally significant, then the cut k with the lowest value is considered the best association/cut combination, by Occam's razor.

P-values
P-values have not been adjusted for multiple testing. Adjustments for multiple testing assume every test is independent. Our analyses on diversity, distance, and abundance frequently overlap in their associations as the same taxa are contributing to the three different analyses, thus correcting for each test would over penalize the significance of each calculated association. With respect to factors included in the model, some demographic variables may be considered nuisance variables, such as age, ethnicity, sex, and BMI (for cross-sectional analyses), since these may not be of direct interest, but necessary to control for. Each of the taxa considered in the abundance analyses, 25, should be considered separate tests, but the reader should be more discerning towards accepting statistically significant associations with taxa with lower abundance. Excluding demographic variables, the number of questionnaire responses included in the models were 21. A summary of noteworthy associations that were included in the manuscript have been included into Supplemental Table 1. "Associations Comparisons".

R Analytical Code
The R code used for the "Cluster Influencers Analysis" and "Cluster Transition Analyses" has been provided on GitHub at: https://github.com/CMM-Release/MB-COVID The software was written to be executed from the Linux command line and is part of a larger suite of analytical tools. Experience with R programming, cluster analyses, ANOVA, and multinomial logistic regression is suggested. This table summarizes the significant associations found in the stool and saliva analyses organized so that the various metrics and methods may be compared. Each association includes a description, coefficient, and p-value. The left and right columns contain the associations for the "Early Pandemic Cross-Sectional" analyses and "Pre / Early Pandemic Paired" analyses, respectively. The top and bottom analyses are split into "Stool" and "Saliva" analyses, respectively. Within each sample type by analyses grouping, the sub-analyses "Diversity", "Distance", and