Inter-Individual Diversity Scaling Analysis of the Human Virome With Classic Diversity-Area Relationship (DAR) Modeling

The human virome is a critical component of the human microbiome, and it is believed to hold the richest diversity within human microbiomes. Yet, the inter-individual scaling (changes) of the human virome has not been formally investigated to the best of our knowledge. Here we fill the gap by applying diversity-area relationship (DAR) modeling (a recent extension to the classic species-area law in biodiversity and biogeography research) for analyzing four large datasets of the human virome with three DAR profiles: DAR scaling (z)—measuring the inter-individual heterogeneity in virome diversity, MAD (maximal accrual diversity: Dmax) and LGD ratio (ratio of local diversity to global diversity)—measuring the percentage of individual to population level diversity. Our analyses suggest: (i) The diversity scaling parameter (z) is rather resilient against the diseases as indicated by the lack of significant differences between the healthy and diseased treatments. (ii) The potential maximal accrual diversity (Dmax) is less resilient and may vary between the healthy and diseased groups or between different body sites. (iii) The LGD ratio of bacterial communities is much smaller than for viral communities, and relates to the comparatively greater heterogeneity between local vs. global diversity levels found for bacterial-biomes.


INTRODUCTION
Viruses spread over almost every ecosystem and are believed to be the most abundant biological entity on the earth (Breitbart and Rohwer, 2005;Edwards and Rohwer, 2005;Cobiaìn Güemes et al., 2016;Berliner et al., 2018). Viruses also parasitize in different parts of human body, including blood, cerebrospinal fluid, nasal cavity, oral cavity, skin, vagina, lungs and gastrointestinal (GI) tract (Nakamura et al., 2009;Ly et al., 2014;Wylie et al., 2014;Hannigan et al., 2015;Santiago-Rodriguez et al., 2015;Columpsi et al., 2016;Abbas et al., 2017;Thannesberger et al., 2017;Pannaraj et al., 2018;Ghose et al., 2019). The human virome includes endogenous retroviruses, eukaryotic viruses that infect human cells, bacteriophages that infect bacteria, and viruses that infect archaea (Zhao et al., 2017;Santiago-Rodriguez and Hollister, 2019). In addition to the well-known human health effects of the human immunodeficiency virus (HIV), Ebola virus, and influenza virus, changes in the composition and diversity of virome have been found in many diseases, such as cystic fibrosis, periodontal disease, urinary tract infections, and inflammatory bowel disease (Willner et al., 2009;Ly et al., 2014;Norman et al., 2015;Santiago-Rodriguez et al., 2015;Carding et al., 2017). Bacteriophages can also affect human health by influencing the composition of the bacterial communities in the body (Barr et al., 2013;Carding et al., 2017;Galtier et al., 2017;Hannigan et al., 2018). In addition, phages were briefly used to treat bacterial infections in humans prior to the inventions of antibiotics; today phage therapy is still confined to laboratory and animal models (Debarbieux et al., 2013). In summary, the influences of viruses on human health are not limited to the their pathogenicity. There are complex interactions and even co-evolution between viruses and hosts (Parker, 2016).
But why have the studies of human virome, a vital part of human microbiomes, been behind those of bacteria and fungi? Different from prokaryotes and eukaryotes, the virome does not encode commonly conservative genes, and is highly diversified in genes (Carding et al., 2017). However, with the development of high-throughput sequencing technology, metagenomic sequencing and the application of many effective bioinformatics approaches, human virome studies have stepped into an era of rapid development (Reyes et al., 2012;Garmaeva et al., 2019). The Global Virome Project (GVP) launched recently (Carroll et al., 2018) is aimed to identify and characterize most of the currently unknown viruses in major wildlife populations, including rodents, non-human primates and bats. A number of studies have shown that human gut virome is stable in individuals over a certain period of time, but shows high heterogeneity among individuals (Minot et al., 2011(Minot et al., , 2013Carding et al., 2017;Clooney et al., 2019;Shkoporov et al., 2019).
Here we apply the diversity-area relationship (DAR) model, an extension of the classical species-area relationship (SAR) model, to measure the diversity changes (scaling) of human viral communities from multiple perspectives (Watson, 1835;Arrhenius, 1921;Preston, 1960Preston, , 1962Ma, 2018Ma, , 2019. Specifically, four diversity-scaling profiles were used to characterize the spatial (inter-individual) heterogeneity, community similarity, potential diversity, and the ratio of local diversity to global accrual diversity of human virome in health and disease.

Datasets Description
We collected four human virome datasets from NCBI (Table 1)

Bioinformatics Pipeline for Bacterial Sequences
All raw sequences were processed by QIIME2 v2018-06 (Bolyen et al., 2018) pipeline to get the OTU (operational taxonomic unit) tables. Sequences were denoised by the DADA2 plugin and taxonomic classification was performed using the Greengenes database and QIIME feature-classifier classify-sklearn plugin.

DAR (Diversity-Area Relationship) Analysis
DAR analysis is implemented through the DAR model, involving the PL (power law) model and the PLEC (power law with exponential cutoff) model (Ma, 2018(Ma, , 2019). The DAR model extended the classical ecological power law of species-area relationship (SAR), by replacing the species richness   (number of species or OTUs) in the classic law with more general community diversity metrics measured in Hill Numbers (Chao et al., 2012(Chao et al., , 2014a. The relationship between diversity and area conform to the power law function: where q D represents the diversity of measured by Hill numbers when order is q, A is area, and c and z are parameters of the power law scale model. In addition, Ma extended the general power law to an exponential cutoff power law scale model (Ma, 2018(Ma, , 2019, and the PLEC model was initially applied to SAR modeling (Plotkin et al., 2000;Ulrich and Buszko, 2003;Tjørve, 2009): Frontiers in Genetics | www.frontiersin.org    where d is the third parameter of the power law equation, and exp(dA) is the exponential decay term. After log-linear transformation of the above power law equation, which can be used to estimate DAR model parameters: When no natural order exists among samples, or the order (permutation) information is not available, the choice of an arbitrary permutation may be problematic. To avoid this potential bias, we re-sampled all the virome and bacterial datasets, that is, we randomly selected 100 permutations after permutation of all the samples in each dataset to fit the DAR model, and finally used the average parameters of the 100 models as the final DAR model. Meanwhile, the goodness-of-fitting can be judged based on the linear correlation coefficient R and p-value. Ma (2018) also defined four diversity-scaling profiles based on the DAR model: (i) DAR profile: The relationship between diversity scaling parameter (z) and diversity order (q) of DAR-PL model was defined as DAR profile.
(ii) PDO (pair-wise diversity overlap) profile: The relationship between parameter g and diversity order (q) of DAR-PL model was defined as PDO profile. Parameter g measures the diversity overlap of two areas, where z is the parameters in equation (1).
(iii) MAD (maximum accrual diversity) profile: The MAD can be derived from the following equation based on the parameter of DAR-PLEC model: where A max =−z/d , which represents the area when the diversity is maximized. The relationship between MAD and the diversity order (q) was defined as MAD Profile: (iv) LGD (the ratio of local diversity to global accrual diversity) profile: Li and Ma (2019) defined the ratio of sample diversity to global diversity as LGD: LGD = c/D max (7) where c is the parameter of Equations (1) at q-th diversity order, and D max corresponding to the diversity order can be obtained by equation (6). The relationship between LGD and the diversity order was defined as LGD profile. To test whether there are significant differences in DAR parameters between different subsets in the four virus datasets, we performed a permutation (randomization) test (Ma, 2018;Li and Ma, 2019;Ma and Li, 2019). Figure 1 displayed the workflow of the VirusSeeker pipeline and subsequent steps for DAR analysis using the viral OTU table.

Fitting DAR Models
After an initial bioinformatics analysis of four datasets, we fitted the DAR-PL and DAR-PLEC models to the three datasets of virome datasets and one bacteria-virome dataset. Table 2 shows the changes in the number of viral sequences during data processing, and Table 3 lists all parameters related to model fitting. Table 4 gives the results of the permutation test for three virome datasets. Although the four datasets cover different disease types, sampling sites, and microbial types, the results show that all datasets confirmed to the DAR-PL and DAR-PLEC models, with p-value < 0.05. After 100 re-sampling for the four datasets, data-1, data-2, data-4 and the CD group of data-3 fully followed the DAR-PL pattern at the species richness level (q = 0), but some groups were not fitted successfully in the higher diversity order. At higher diversity orders, some model fittings failures occurred.

DAR Profiles
With the exception of a few groups, the patterns of the four DAR profiles across the four datasets are consistent in general ( Table 3). Data-1 and Data-3 were the comparison between the healthy group and the disease group. Although the disease types were different, the variation patterns of the DAR parameters are similar in general. The results in Table 4 show that there was no  significant difference in the diversity scaling parameter (z) and the pair-wise diversity overlap (g) among three virome datasets. However, the estimators of the maximal accrual diversity (D max ) between the Control group and the UC group and between the CD group and the UC group presented significant difference when q = 0. In addition, the results of Data-4 show that bacterial community and virome community had different LGD ratios. And based on Table 3, we also have the following findings:   Figure 2 shows the downward trend of parameter z in Data-1. But the parameter z of the Blood-Control-LTR group in Data-1, the Village C and Village D group in Data-2 were the largest when q = 0, and had the minimum value when q = 1, and then enlarge with the increase of diversity order (q > 0). When the diversity order q = 0, the difference in species number among individuals in each group was measured; When q > 0, the DAR profile measured the difference in the diversity of dominant species and the number of each species among individuals in each group. With the increase of the diversity order, the proportion of dominant species in the calculation process increased. These suggested that at the lower level of diversity, the higher heterogeneity within the group, indicating the greater difference between individuals within the group. It also should be noted that there was no significant difference in the parameter z among all groups, even between the healthy and diseased groups. (ii) The PDO profile: In all datasets, the pair-wise diversity overlap (g) of more than half of the groups grew with the increase of the order of diversity. Although the g varies among individuals with different tissue types, different health conditions and different living environments, there is no significant difference. And the larger the g was, the higher the degree of overlap was. Thus, the difference between the individuals within the group at the low diversity order was greater than that at the high diversity order (q = 0 − 1), Figure 3 shows the variation of g in Data-2. (iii) The MAD Profile: The estimators of the MAD of four datasets reduced with the increase of the diversity order. When q = 0, that is, at the species richness level, D max represents the estimated maximum number of species, and the A max represents the number of individuals required to reach the maximum number of species. In Data-1, the MADs in the alveolar samples of donors and recipients are very similar, but with some differences between the control group and the PGD group at the species richness level. Simultaneously, there was no significant difference in MAD between the bacteria group and the virome group in Data-4. As for Data-3, the MAD of the control group and the CD group were significantly smaller than the UC group at q = 0 (Figure 4). (iv) The LGD profile: In addition to the Blood-PGD-LTR group in Data-1, the Village B and the Village D group in Data-2, Data-3 and the virome group in Data-4, the LGD ratios in other groups increased with the rise of the diversity order. In the first three datasets, the LGD ratios of the same sample type are relatively close, such as Village C vs. Village D (0.374 vs. 0.371, q = 0); Lung-control-LTR vs. Lung-control-OD (0.098 vs. 0.123, q = 0). According to Data-4, we can see that bacteria and virome have completely different LGD ratios (0.078 vs. 0.763, q = 0). Figure 5 displays the growth of LGD in Data-4 and the differences between the bacteria group and the virome group in LGD.

CONCLUSIONS AND DISCUSSION
To the best of our knowledge, this should be the first application of DAR analysis to the human virome, which offers a powerful tool to investigate the spatial (inter-individual) variability (heterogeneity) of virome diversity, and potential virome diversity globally (in a whole population). With the MAD Profile, D max estimates the maximum number of viral species in the community when diversity order q = 0. It can be seen that, in Data-1, which includes blood samples and alveolar samples of the donors and the lung transplant recipients, the maximum number of species in blood serum samples was lower than that in bronchoalveolar lavage samples, and this may indicate that the species diversities of different tissues is diverse (Ghose et al., 2019). At the same time, there was no statistically significant difference between the four profiles in the bronchoalveolar lavage samples of the donors and the recipients, perhaps because the virome of donor was transferred to the recipient along with the organ transplant (Mitchell and Glanville, 2018;Abbas et al., 2019). Data-2 compared the virome of one urban and three villages children. And we did not observe significant differences in the scaling parameter z and the pair-wise diversity overlap g among four locations, especially Village C and Village D. Thus, differences in living environment and diet do not seem to have a significant impact on the spatial heterogeneity of individual gut virome. For data-3, we can clearly see that when q = 0, the scaling parameter z of the control group is greater than that of the two disease groups CD and UC, and the pair-wise diversity overlap g of the control group is much smaller than that of the CD group and UC group. The changes of these two profiles suggest that there was a large difference among individuals in the Control group, while the CD and UC groups show a smaller heterogeneity among individuals. In addition, the D max of the UC group was significantly greater than that of the Control and CD group, and the virome diversity of the CD and UC group was significantly increased. These findings are also consistent with existing research showing increased virome diversity in IBD patients compared to healthy individuals (Lepage et al., 2008;Norman et al., 2015;Wang et al., 2015). In IBD patients, the diversity of viruses increased while the diversity of bacteria decreased (Norman et al., 2015;Clooney et al., 2019;Fernandes et al., 2019;Zuo et al., 2019). We also compared bacteria and virome in stool samples from hunter-gatherer Hadza people living in Tanzania. Based on the MAD profiles, it was revealed that the virome community and the bacteria community have similar maximum number of species. However, according to the LGD profiles, the ratios of local diversity to global diversity are quite different between virome and bacterial community. The differences may be due to limitations in virome analysis methods (including biochemical and bioinformatics methods), as well as to the limited virome library available (Aggarwala et al., 2017;Santiago-Rodriguez and Hollister, 2019). Although this study is the first to compare the diversity scaling parameters between the healthy and diseased treatments, due to the very limited number of datasets, we are refrained from inferring general conclusions about possible effects of diseases on the inter-individual diversity scaling. We hope that future studies will allow us to draw more general conclusions on this important topic. As a very preliminary step, our analysis seems to suggest that the disease effects on the inter-individual diversity scaling (or the inter-individual heterogeneity of diversity) may be insignificant. Nevertheless, our study demonstrates an important quantitative tool (i.e., DAR) for analyzing the inter-individual heterogeneity of virome diversity, which is of obvious significance for investigating the ecology of human virome.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent for participation was not provided by the participants' legal guardians/next of kin because the datasets we used are published data.