Clustering Species With Residual Covariance Matrix in Joint Species Distribution Models

Modeling species distributions over space and time is one of the major research topics in both ecology and conservation biology. Joint Species Distribution models (JSDMs) have recently been introduced as a tool to better model community data, by inferring a residual covariance matrix between species, after accounting for species' response to the environment. However, these models are computationally demanding, even when latent factors, a common tool for dimension reduction, are used. To address this issue, Taylor-Rodriguez et al. (2017) proposed to use a Dirichlet process, a Bayesian nonparametric prior, to further reduce model dimension by clustering species in the residual covariance matrix. Here, we built on this approach to include a prior knowledge on the potential number of clusters, and instead used a Pitman–Yor process to address some critical limitations of the Dirichlet process. We therefore propose a framework that includes prior knowledge in the residual covariance matrix, providing a tool to analyze clusters of species that share the same residual associations with respect to other species. We applied our methodology to a case study of plant communities in a protected area of the French Alps (the Bauges Regional Park), and demonstrated that our extensions improve dimension reduction and reveal additional information from the residual covariance matrix, notably showing how the estimated clusters are compatible with plant traits, endorsing their importance in shaping communities.

Modeling species distributions over space and time is one of the major research topics in both ecology and conservation biology. Joint Species Distribution models (JSDMs) have recently been introduced as a tool to better model community data, by inferring a residual covariance matrix between species, after accounting for species' response to the environment. However, these models are computationally demanding, even when latent factors, a common tool for dimension reduction, are used. To address this issue, Taylor-Rodriguez et al. (2017) proposed to use a Dirichlet process, a Bayesian nonparametric prior, to further reduce model dimension by clustering species in the residual covariance matrix. Here, we built on this approach to include a prior knowledge on the potential number of clusters, and instead used a Pitman-Yor process to address some critical limitations of the Dirichlet process. We therefore propose a framework that includes prior knowledge in the residual covariance matrix, providing a tool to analyze clusters of species that share the same residual associations with respect to other species. We applied our methodology to a case study of plant communities in a protected area of the French Alps (the Bauges Regional Park), and demonstrated that our extensions improve dimension reduction and reveal additional information from the residual covariance matrix, notably showing how the estimated clusters are compatible with plant traits, endorsing their importance in shaping communities.

INTRODUCTION
Understanding and predicting the distribution of species across space and time is one of the central questions in ecology (Thuiller et al., 2013). As such, species distribution models (SDMs) are essential tools to investigate how species respond to environment (Guisan and Thuiller, 2005;Elith and Leathwick, 2009;Guisan et al., 2017). The main principle is to relate individual species observations to a set of environmental predictors. The estimated relationship between species and the environment allows to infer the environmental niche of the species and then to predict its distribution for new environmental conditions, either in space or time, or in both (Guisan and Thuiller, 2005;Merow et al., 2014;Guisan et al., 2017). While SDMs could be used to study species assemblages (a technique commonly called stacked SDM (sSDM), see Ferrier and Guisan, 2006;Calabrese et al., 2014), they were meant to model and predict the distribution of individual species. To model species assemblages, recent statistical advances yield to Joint Species Distribution Models (JSDMs) (Pollock et al., 2014;Warton et al., 2015;Clark et al., 2017;Ovaskainen et al., 2017b), which are multivariate extensions of generalized linear regression models (GLM) (other approaches can be found in Harris, 2015;Vanhatalo et al., 2020). In JSDMs, the regression coefficients are related to the response of species to the environment, as in SDMs, while the correlation among the residuals describe the pairwise-species dependencies not explained by the environment.
Since JSDMs were created to deal with community data, they are gaining popularity with the ever-increasing developments of novel methods for community assessment, such as environmental DNA (eDNA) metabarcoding (Taberlet et al., 2012). However, their application to large datasets still faces strong limitations such as computational costs and the interpretation of the residual covariance matrix. Indeed, JSDMs are computationally expensive because the number of estimated parameters in the residual correlation matrix grows quadratically with the number of species. There are several approaches to address this problem. For JSDMs that are based on the multivariate probit model, computational reduction can be achieved by efficient parallel sampling (Chen et al., 2018;Pichler and Hartig, 2020) to fit a full covariance matrix in a frequentist framework. Another common solution relies on dimension reduction through latent variable models (LVM) (Warton et al., 2015), where the effective dimension of the model is reduced by a low-rank approximation of the residual covariance matrix (Warton et al., 2015;Hui, 2016;Ovaskainen et al., 2016;Taylor-Rodriguez et al., 2017). While the approximation with low rank values could capture the residual associations in the covariance matrix for a large number of species and significantly improve convergence and computational time (Warton et al., 2015), their wide applications to large dataset is still prohibited (Pichler and Hartig, 2020).
A growing number of species is not only a problem from a computational viewpoint but it also makes the interpretation of the residual correlations challenging. For example, for 100 species, 4,950 pairwise residual correlations are estimated, which represent species associations patterns that are not explained by the environmental covariates and can depend on many factors: model misspecification, missing covariates, and less likely, biotic interactions (Poggiato et al., 2021). Moreover, the symmetric constraint of any covariance matrix impedes to detect any asymmetric dependence between species (e.g., hierarchical competition, predator-prey, Dormann et al., 2018;Poggiato et al., 2021). Therefore, the complexity of the pattern increases with the dimension of the problem and blurs the interpretation of the residual covariance matrix inferred by JSDMs.
To improve such an interpretability of the residual covariance matrix recent works proposed to reduce the number of nonzero residual correlations between species. This is usually done by applying sparsity inducing regularization (e.g., L 1 , elastic net) to the correlation matrix (e.g., elastic-net, Pichler and Hartig, 2020) or its inverse, the precision matrix (Chiquet et al., 2019). However, latent factor JSDMs usually fail to produce sparse matrices (Pichler and Hartig, 2020). We believe that providing additional assumption on the structure of the residual covariance matrix could be a promising avenue. For instance, we might consider block-wise structure of the covariance matrix, such that residual associations would vary between the blocks, instead of individual observations (Moscone et al., 2017). In the JSDM case, it means that we can consider groups of species that share the same association patterns with respect to the other ones. In this case, the model would capture the main associations between (and within) groups of species instead of the species level ones.
Incorporating expert-based knowledge about this block structure of the covariance matrix would further improve this model. Interestingly, most JSDMs are implemented within a Bayesian framework, that naturally allows the incorporation of a prior knowledge, but few ecological studies have actually exploited this feature (Banner et al., 2020). Choosing the prior knowledge that we want to give to the residual covariance matrix is tricky, but feasible. For instance, in a species-rich foodweb, there are a fair amount of species that share the same preys and predators with others, forming what is usually called trophic groups . If they are known, or inferred with a specific approach like a stochastic block model (Lee and Wilkinson, 2019), the number of trophic groups can be used as a prior to reduce the residual correlation matrix. In a similar way, plant functional groups have been designed to group species that share the same traits, respond the same way to environment, and interact the same way with species from other groups (Boulangeat et al., 2012). We believe that the prior knowledge on the number of groups of interest (e.g., trophic or functional) can be used as a prior for the block structure of the residual covariance matrix, which could help to reduce the dimension, and the same time, might help the interpretability of the residual covariance matrix.
Recently, Taylor-Rodriguez et al. (2017) proposed a dimension reduction method that combines a latent variable approach with an additional clustering of the variance-covariance matrix using a Dirichlet process prior. That allows to further reduce the effective dimension and improve the computational efficiency of the model, but in the proposed model, clustering was mainly a tool for dimension reduction, without focusing on further cluster interpretation. That paper also did not address prior information that could be used with the Dirichlet process to inform the number of desired species groups.
Here, we build on this recent work to propose a novel framework that allows for a clustering of residual associations that makes use of prior information. In doing so, we addressed the following questions: (1) Can prior knowledge, combined with dimension reduction on the structure of the residual covariance matrix, improve model inference in JSDM? (2) Can estimated clusters be interpreted in ways that help us understand species communities?
In the following, we first describe the model and our extension that improves clustering properties by incorporating prior knowledge on the number of species that share residual associations. We then introduce Pitman-Yor process, a more flexible Bayesian nonparametric prior, which is less sensitive to miscalibration than the Dirichlet process. We hypothesized that species within the same cluster have similar functional strategies. As a show case, we investigated this hypothesis within the scope of the case study of Bauges National Park.

Statistical Model
We provide a formal description of our model, which is an extension of the model in Taylor-Rodriguez et al. (2017) developed to reduce the dimensionality of the inference in JSDMs, in the particular framework of Generalized Joint Attribute Modeling (GJAM) . GJAM allows to model many types of species observations (presence-absence, counts, biomass, and others) altogether. We present the model and its application for presence-absence data, but since our approach is an extension of GJAM, it is valid for most responses.
To study species distributions, we model a response variable y i with respect to a set of p environmental covariates x i = {x iℓ } p ℓ=1 , at each site i = 1, . . . , n, where ℓ = 1, . . . , p represents the ℓth environmental covariate. The response variable Y i ∈ {0, 1} is a vector where each element y ij contains the observation for species j = 1, . . . , S at site i. JSDMs model the response variable using what is commonly called the multivariate probit model (Chib and Greenberg, 1998), where for species j at site i the probability of presence is modeled through a latent normal variable z ij as Pr(y ij = 1) = Pr(z ij > 0). In dimension reduction approach suggested by Taylor-Rodriguez et al. (2017) z i is modeled as: where B is the S × p matrix of regression coefficients and x is the p × n matrix of measured covariates and w i iid ∼ N(0, I r ) are the latent factors, is the S × r matrix of factor loadings. The number of factors r is supposed to be comparably smaller than S (r ≪ S). Here, latent normal variable z i has residual covariance matrix * = T + σ 2 ǫ I S , which have dimension S × r + 1 and is less than S(S + 1)/2 in general case (see details Taylor-Rodriguez et al., 2017). In this model, the number of latent factors r is fixed, and can be chosen to maximize the goodness-of-fit or some informative criteria like DIC, BIC (Gelman et al., 2004) or using cross-validation. While choosing the number of latent factors r, it is important to verify that the matrix has a full column rank and the model is well-identified (Geweke and Singleton, 1980;Taylor-Rodriguez et al., 2017).
If the residual covariance matrix * represents the cooccurrence pattern not explained by the environment, latent factor models can provide further insights in this residual correlation. Indeed, the low-rank matrix would represent the main axes of variation of the residual co-occurrence pattern. Moreover, latent factors w i could represent missing environmental predictors at site i, and rows of matrix (λ j ) encode the response of species j to these missing predictors. Therefore, latent factors can highlight both the main axis of covariation and a common (or opposite) response to unmeasured covariates (see Chapter 7 of Ovaskainen and Abrego, 2020).
A further dimension reduction proposed by Taylor-Rodriguez et al. (2017), that finds common rows in , is described in the next section.

Clustering in the Residual Covariance Matrix
Latent factors allow to model a "tall and skinny" S × r matrix instead of a "tall and wide" S × S matrix . Further dimension reduction proposed in Taylor-Rodriguez et al. (2017) is based on the reduction of this "tall and skinny" matrix to a "short and skinny" one. To do so, the authors find common rows in , exploiting the clustering properties of the Dirichlet process (DP), a prior distribution used routinely in Bayesian nonparametric statistics. By finding common rows in , the DP creates clusters (i.e., groups) of species that share the same values of j . Therefore, only the N < S unique values of the rows of * are estimated, that will then be repeated for species in the same clusters to obtain and then * . As a consequence, the model no more estimates the "tall and skinny" , but only the "short and skinny" * matrix. Species in the same cluster will have the same value of the corresponding rows of , and consequently these species will also have the same rows and columns in * = T + σ 2 ǫ I S . In other words, species in the same cluster are similar in their residual covariance matrix. Therefore, we will say hereafter that we cluster species depending on their associations with respect to other species. This approach allows to reduce dimension of the model and infer groups of species with the same residual correlation structure.
In the model proposed by Taylor-Rodriguez et al. (2017), clustering was only a tool for dimension reduction, and the paper did not focus on the further interpretation of the clusters that we just discussed. However, the clustering resulting from the Dirichlet process prior depends on the prior specification (De Blasi et al., 2015) for which we offer two extensions. In the first extension, we provide a flexible method to incorporate prior information on the number of clusters that allows clusters to better represent the underlying data. For the second extension, we introduce another Bayesian nonparametric prior called the Pitman-Yor process, which overcomes some limitations of the Dirichlet process and is more suitable for ecological applications.

Dirichlet Process Formulations: DP and DP c
We describe here the original Dirichlet process formulation proposed by Taylor-Rodriguez et al. (2017), as well as an extension of it which allows the introduction of a prior knowledge in a flexible way, respectively denoted as DP and DP c (calibrated Dirichlet process model).
The Dirichlet process is used in Bayesian statistics as a prior distribution over distributions. In other words, sampling from the Dirichlet process provides a distribution that has the important feature of being discrete, thus clustering samples naturally (Ferguson, 1973). This process is parameterized by the base distribution H, which is the mean of the Dirichlet process and the concentration parameter α that regulates how the distribution drawn from the Dirichlet process is concentrated around its mean (base distribution) (see the formal definition in Supplementary Material, section 4).
We denote by λ j , j = 1, . . . , S the rows of the matrix in (1). The original DP model uses the Dirichlet process as follows: where G is the probability distribution drawn from the Dirichlet process prior. Taylor-Rodriguez et al. (2017) chose the rdimensional normal distribution as the base distribution H, and used a fixed concentration parameter α. By the properties of the Dirichlet process, the distribution G is almost surely discrete, so that there will be repeated values in the sampled rows λ j (i.e., there is non-zero probability that two rows collide). The unique values of λ j form the N ≤ S matrix * . We hereafter call a "cluster" (group) the subset of species whose λ j coincide.
The main advantage of the Dirichlet process (and other more flexible Bayesian nonparametric priors) is that it does not pre-specify the exact number of clusters. Dirichlet process prior induce prior distribution on the number of clusters. We may fix features of the induced prior number of clusters through the concentration parameter α, that regulates the clustering properties of the Dirichlet process (see details in Supplementary Material, section 4).
The Dirichlet process is the most widespread nonparametric prior due to its computational ease, but it has several limitations. The main one is precisely that clustering properties are regulated by only one parameter, α. As pointed out in De Blasi et al. (2015), this concentration parameter has a strong effect on the posterior distribution of the number of clusters. Indeed, the prior distribution on the number of clusters of a Dirichlet process prior is highly peaked. As a consequence one would require a high sample size to counterbalance such a strong prior weight, resulting in a low posterior probability to have a number of clusters far from the prior mean.
To overcome this limitation, we added a hierarchical layer for the α parameter to let the model choose values for α, thus providing flexibility to the posterior number of clusters. We chose a Gamma distribution as a hyperprior for α, so that α ∼ Ga(ν 1 , ν 2 ), where ν 1 , ν 2 are hyperparameters. We implemented a within-Gibbs Metropolis-Hasting step (see for details Supplementary Material, section 6), to sample from the posterior distribution of this parameter (Robert and Casella, 2004). As in Taylor-Rodriguez et al. (2017) we use the Dirichlet multinomial process for approximating the Dirichlet process (Muliere and Secchi, 2003). We hereafter refer to this model as DP c (calibrated Dirichlet Process). By conveniently choosing the hyperparameters of the Gamma distribution, we can calibrate the expected value of the prior distribution on the number of clusters induced by the DP. Indeed, the expected number of clusters for the Dirichlet multinomial process is (Lijoi et al., 2020a, Example 3): where (x) n = x(x + 1) . . . (x + n − 1) denotes the increasing factorial coefficient, for any real number x and integer n.
By further sampling from parameter α and from Ga(ν 1 , ν 2 ) and using (4) we can ultimately determine the values of the hyperparameters (ν 1 , ν 2 ) that guarantee that the prior expected number of clusters K S matches our prior knowledge on the number of clusters K * , i.e., E[K S ] = K * (see details in Supplementary Material, section 5.1). In our case, an ecologically-driven prior knowledge is used to specify the prior belief on the number of clusters K * . The ground truth on the value of K * is hard to be known, but due to the larger prior variance provided by the hierarchical modeling of α the prior is not fully informative, and allows the inclusion of an eventually uncertain prior knowledge too. A sensitivity analysis has to be carried to confirm such a choice. As a side note, notice that in the model proposed by Taylor-Rodriguez et al. (2017) one may suitably select the parameter N in order to fix the prior mean on the number of groups, but this could lead to an extremely peaky prior distribution (Supplementary Figure 4), which may not result in a flexible model. While providing more flexibility to the clustering properties of the model, the Dirichlet Process is still limited by its dependence on a single parameter α. We therefore propose another extension of the model, by introducing the Pitman-Yor process.

Pitman-Yor Process Formulation: PY c
The Pitman-Yor (PY) process is a flexible generalization of the Dirichlet process (see the full description in Supplementary Material, section 4). Indeed the Pitman-Yor process is characterized by the base measure H, the concentration α and, importantly, by a discount parameter σ ∈ [0, 1). When σ = 0, the Pitman-Yor process is anything but the Dirichlet Process. The parameter σ influences the variance of the prior number of clusters, and a high value of σ leads to high variance for the distribution of the prior number of groups. As a consequence, the posterior distribution is less constrained by the prior, and the resulting clustering is more flexible. Denote by K S the prior number of clusters for S samples. Another property of Pitman-Yor process is that the number K S grows more rapidly with the number of species S than for the Dirichlet process (Pitman, 2006). For the Pitman-Yor process the number K S follows a power-law, i.e., E[K S ] grows as S σ when S − → ∞, while for the Dirichlet process it grows logarithmically as log(S) when S − → ∞ (Pitman, 2006). Moreover, the cluster size distribution also shows power-law under Pitman-Yor process (Pitman and Yor, 1997). For many real applications, this power-law property is a more natural assumption than in the Dirichlet process, where we generally have a small number of clusters with a high number of observations, and a large number of clusters with only a few observations.
We therefore considered a Pitman-Yor process as a prior for the rows of , similarly to the DP and DP c models.
where H is the base measure as in (3), α is the concentration parameter, and σ is the discount parameter. In our model we used the finite-dimensional Pitman-Yor multinomial process proposed by Lijoi et al. (2020b), which approximates the Pitman-Yor process and allows tractable computation (more details in Supplementary Material, section 4). We assumed parameters α and σ as fixed following (Lijoi et al., 2020b), and that the Pitman-Yor multinomial process is flexible enough and does not require a prior distribution on hyperparameters. We use the prior distribution on the number of clusters for the Pitman-Yor multinomial process to compute the prior expected number of clusters E[K S ] and variance V[K S ] of this prior distribution. We can set E[K S ] = K * and specify the variance V[K S ] to reflect the desired level of uncertainty and then solve the system of equations numerically. However, the solution could be computationally challenging for some values of parameters. In addition, certain pairs of expectation and standard deviation are not easily attainable (see Figure 6 in Bystrova et al., 2021). Here, we firstly fix parameter σ , choosing the distribution variability, and then find the parameter α, such that E[K S ] = K * (see details in Supplementary Material, section 5.2). We refer to this model as PYc (calibrated Pitman-Yor process model), see comparison with DPc model in Table 1.

Clustering Analysis
We summarize the posterior distribution of the clusters to obtain a clustering (i.e., partition of species) for each model DP, DP c , and PY c (the procedure is described in Supplementary Material, section 7). Notice that there is a difference between the posterior expected number of clusters and the number of clusters of the estimated clustering, obtained by the algorithm described in Supplementary Material (section 7) for posterior inference on partition space. The former describes the distribution of the number of clusters in Markov chain Monte Carlo (MCMC) samples (Robert and Casella, 2004), while the latter represents the number of clusters in the single partition that best represents the posterior distribution of the clusters in the MCMC samples. Generally, even if one has certain prior knowledge on the number of groups, it is possible that there is no information on the cluster composition. In our case study, we have a prior expected number of clusters and we also have a cluster composition. For this reason, we could also assess the composition of obtained cluster in-depth. To do so, we measured the distance between the clusters obtained by the models and the clusters we used as a prior belief. We used the adjusted Rand index (ARI) (Hubert and Arabie, 1985), which is the corrected for chance proportion of the number of agreements (species clustered similarly) in all possible pairs of species divided by the total number of all possible pairs. This value is between 0 and 1, where 1 corresponds to exactly the same cluster composition. Additionally, we checked how the choice of the prior number of groups affects the posterior distribution (sensitivity to the prior, Supplementary Material, section 8.3).

Study Site and Species Information
We illustrated our methods with data on plant species in Bauges Natural Regional Park (France) available from the Alpine Botanical Conservatory (CBNA) and previously analyzed by Thuiller et al. (2018). We included as covariates the first two principal components of the environmental variables presented by Thuiller et al. (2018), including a quadratic term (using orthogonal polynomials to reduce correlation among the covariates). We considered presence-absence for the 111 most abundant species (present at least in 2% of sites) across the 1-139 plots. For details on the data processing steps (see Supplementary Material, section 1). We considered the 16 plant functional groups (PFGs) that were built in Thuiller et al. (2018). These PFGs have been obtained through hierarchical clustering, in order to build groups of species that have a similar functional role: they have a similar tolerance of abiotic conditions, dispersal abilities, resistance to disturbance (grazing and mowing), response to competition for light (whether they germinate and grow under specific light conditions), competitive effects (estimated by the height of the species) and demographic characteristics (life-form, longevity, age of maturity). We refer to Thuiller et al. (2018) for a complete description of PFGs and how they were classified. The number of these groups were used to specify the DP c and PY c priors, by fixing E[K S ] = 16. Note that the number of groups, but not their composition, was used for prior specification.

Implementation and Specification of the Models
We applied our DP c and PY c models together with the original DP model in dimension reduction with the default settings on the Bauges plant data. We fitted the models using Bayesian inference via MCMC using a Gibbs sampling scheme. For the original DP model, we used the R package GJAM (Taylor-Rodriguez et al., 2017). We implemented the DP c and PY c models in R by extending the original GJAM R package. In particular, we implemented an additional adaptive Metropolis-Hasting step (for DP c ) and the multi-step algorithm proposed by Lijoi et al. (2020b) to sample from PY c (see details in Supplementary Material, section 6). Our code (an extension of the GJAM package) can be found at the first author's Github repository (https://github.com/dbystrova/GJAM_clust). The prior on the number of clusters was set using the number of plant functional groups (PFGs) as described above (see Supplementary Material, section 5 for the calibration method and for the importance of such step). For the sake of comparison, we used the same default non informative priors suggested by Clark et al. (2017) for all other parameters of the three models.
Convergence was assessed through the calculation of Gelman-Rubin diagnostics (Gelman and Rubin, 1992) or visual inspection of the trace plots.
For the dimension reduction regime in GJAM model, the number of latent factors in the first step of dimension reduction needed to be specified. The number of latent factors was chosen using the deviance information criterion (DIC) (Spiegelhalter et al., 2002) (see details in Supplementary Material, section 2). Model fit was evaluated at the species level. Prediction performances were not the main objective of the paper, as we do not expect the residual covariance matrix to impact predictive mean values (Norberg et al., 2019;Wilkinson et al., 2019;Pichler and Hartig, 2020). However, we did check that the model fitted well the data both on training and test set. The dataset was randomly partitioned into a training and a testing dataset, using 70/30% ratio. We fitted models on the training dataset and then predicted species occurrences on the testing data, comparing the predicted and the actual occurrences, similarly to Norberg et al. (2019) and Wilkinson et al. (2019) (cross-validation is not a doable task due to the computational costs of the models). For each species, we measured the predictive performances by calculating the area under the receiver operating characteristic curve (AUC) on both training (AUC in ) and testing datasets (AUC out ).

Ecological Representation of the Clusters
We hypothesized that species within the same clusters might have similar functional strategies as measured by distance in trait space. We considered the following traits: Landolt nutrient indicator, Landolt light indicator, height (in the logarithmic scale), specific leaf area (SLA), leaf dry matter content (LDMC), leaf carbon concentration (LCC), and leaf nitrogen concentration (LNC) (Brun et al., 2019). All traits presented here were available for at least 70% of the species. For a more intuitive understanding, we assigned traits with a similar role in the community assembly process (Boulangeat et al., 2012) into four categories: competitive effect (height, SLA, LDMC, LCC, LNC), tolerance to abiotic and biotic conditions (Landolt nutrient indicator, Landolt light indicator), interaction via light resources (height, SLA, Landolt light indicator), and interaction via soil resources (LNC, Landolt nutrient indicator). Specifically, we calculated the following species-specific ratio for each species, each category of traits and each clustering method (including the PFGs) to measure whether species within the same cluster share a similar range of functional traits: Species grouped-trait ratio = mean(distance to other species) within cluster mean(distance to other species) all species In accordance with our hypothesis, we expected these distributions of species grouped-trait ratios to be close to zero, however not exactly zero, as exact zero would indicate the singleton clusters. This specifically indicates that the species were closer to within-cluster species than to species in the other clusters, thus fitted clusters could represent similar functional strategies. However, in our interpretation, we also penalized the clustering method when the number of singleton clusters increase as they do not serve the aim for clustering functionally similar species.

Prediction Evaluation
Supplementary Table 2 provides the predictive performances (both in-sample and out-of-sample) of the models, that are all very similar across models. The data are well-explained (mean value of AUC in is around 0.755), and the performances do not drop on the test dataset (AUC out around 0.745).

Clustering Properties of the Models
The posterior distribution of the number of clusters of the DP model with a mean value of 35 was substantially lower than the prior mean of 56 (Supplementary Table 1, Figure 1). Larger variances for the DP c and PY c models reduced prior weight and thus posterior distributions remained closer to their prior mean of 16, yielding a posterior mean near 20 (Figure 1). Thus, the posterior cluster estimate from the DP model estimated more clusters than did the DP c and PY c models (18, and 20, respectively), which were closer to the number of PFG groups (16); see point estimates in Figure 1. Figure 2 provides the ARI as similarity measure between clusters estimated by each model and the PFGs. The posterior cluster estimates for all models are distant from the PFGs as the value of the ARI measure between each of the clusters and PFGs is close to zero. The DP is however more distant from PFGs than DP c and PY c . The DP c and PY c models yielded cluster estimates similar to one another (Figure 2). Pairwise similarities with a random partition in the Supplementary Material (see Supplementary Figure 5) show that PFGs are closer to the estimated clusters than a random partition.
We have tested sensitivity to prior specification for the PY c and DP c models, specifying prior at lower (8) and larger (56) values (Figure 3). PY c model which has a larger variance for the prior distribution of the number of clusters, appeared to be less sensitive to prior specification than DP c .

Clusters Interpretation
The clusters estimated by DP c and PY c represent functional strategies (Figure 4), particularly for traits related to tolerance. The resulting clustering of the DP model contains many singleton clusters (i.e., clusters with one species), which have zero grouped-trait ratio, thus imply lower overall grouped-trait ratio for DP model (Figure 4). Supplementary Figures 7-9 show the residual covariance matrix inferred by the models.

DISCUSSION
Understanding what are the main environmental drivers of species distributions and biodiversity is one of the main goad of ecology. This task requires to consider a large number of species with as a consequent high computational cost of the models employed, whose feasibility depends on dimension reduction and the inclusion of an expert-based prior knowledge. Here, we presented an extension of the dimension reduction approach for joint species distribution models proposed by Taylor-Rodriguez  et al. (2017), by incorporating prior knowledge and by providing a more flexible clustering method. While reducing the dimension of the model, we provide a tool to create groups of species that share the same associations with respect to other species. For studies where a specific residual covariance structure is desirable our approach brings new flexibility to JSDMs. Our application shows a case where residual covariance is structured in agreement with functional traits, suggesting that these traits determine presence-absence beyond what is explained by the mean structure of the model.

Clustering Properties of the Models
The results of our case study confirm the importance of carefully choosing the prior in the DP model proposed by Taylor-Rodriguez et al. (2017). The DP specifies greater prior weight on the number of clusters than can be desirable in some applications. For this application where we specified a prior mean that was far from the posterior (i.e., using the default settings of Taylor-Rodriguez et al., 2017), the posterior distribution of the number of clusters of the DP model moved far from the prior, but without the full flexibility we offer here (Figure 1). Large prior variances on the DP c and PY c models make them less informative. In this application where FIGURE 4 | Distribution of species grouped-trait ratio for different trait categories and for all clustering methods. The reference curve is the distribution of species grouped-trait ratio of PFGs (DP,DP c , PY c ).
we specified a prior mean close to the posterior, we found prior-posterior agreement.
By tuning the parameter N, the DP model would have also recovered the desired number of clusters. However, due its very peaky prior distribution (i.e., strong prior weight), it has less flexibility to move far from the prior when sample size is limited (sample size in our case, number of plots is n = 1, 139, number of species is S = 111). For this reasons we did not test for the ability of this model in the case study. Finally, the fact that the posterior distribution of the number of clusters of PY c for different prior choices stay close, confirms that the prior on the number of clusters for the models (E[K S ] = 16) is well-chosen.

The Importance of Prior Elicitation
Including prior knowledge is an appealing feature of Bayesian statistics, which is however often unused, or worse, misused (Banner et al., 2020). Expert-based prior knowledge on species interactions has always been available, and it is now getting more and more accessible . While cooccurrence networks should not be interpreted as interaction networks, we claim that this prior knowledge can help to separate the effect of the environment from the one of biotic interactions, to improve inference of interaction networks, but also to account for biotic interactions in predictive distribution models. In our case, prior knowledge does not concern particular species-specific interactions, but informs the model on the number of groups of species that share the same associations with other species. Although these associations should not be confounded with interactions, we suggest that our model DP c is a first attempt to include prior information when building co-occurrence networks. Since time-series contain much more informations on biotic interactions than snapshot data, we could further extend our model to cluster the autoregressive coefficients of dynamic JSDMs (Ovaskainen et al., 2017a;Clark et al., 2020), in order to truly include a prior knowledge on the structure of the interaction network.

Clustering Species in JSDMs Framework
Thanks to new sampling techniques (e.g., eDNA, Taberlet et al., 2012), community data are becoming more and more available. Learning the structure of a co-occurrence network from data with a large number of species is particularly demanding, since for a given number of nodes S, there exists 2 S possible networks. Moreover, even in case a correct inference is possible, it is not an easy task to visualize, and then summarize and analyze a large network. Clustering species allows to zoom out from the species level, focusing on a broader scale, easier to model, visualize, and describe (Ohlmann et al., 2019). Indeed, our model both reduce the dimension of the problem and enable a better understanding of the ecosystem under study, showing how these clusters are strongly linked to functional traits. We emphasize that our method is conceptually different from applying a clustering method (e.g., hierarchical clustering) on the inferred residual correlation matrix, because in that case species in the same cluster do not exactly share the same residual associations and the dimensions of the model would not be reduced. Finally, we notice that since we cluster the residual correlation matrix, we do not filter out indirect associations (Harris, 2016;Popovic et al., 2019). To do so, our model should be extended to cluster the residual partial correlation matrix (i.e., the inverse of the residual correlation matrix) to truly represent a co-occurrence network, and not a network of marginal correlations (that represent both direct and indirect associations).

The Role of Functional Traits to Shape Community Assemblages
With our case study, we show how the proposed clustering methodology could facilitate the description and provide better insights of the residual covariance matrix. Firstly, while we acknowledge that such a residual covariance matrix should not be interpreted as a species interaction network, we believe that we can still attribute a certain ecological meaning to the residual associations between species. Indeed, species within the same inferred clusters share similar competitive abilities, similar tolerance level to abiotic and biotic conditions and interact in the same way even when we consider ecological processes at different levels such as interactions for light and soil resources (Figure 4). Moreover, species within the same clusters tend to be positively correlated (Supplementary Figures 8, 9). For example, with both clustering methods (DP c and PY c ), we observed that Sorbus aria (i.e., mountainous tree) and Hieracium murorum form a cluster. The latter being a mountainous understory herbaceous species it needs the shade provided by the former: therefore, the two are positively correlated in the residual covariance matrix. Another example is given by five species (Lonicera xylosteum, Corylus avellana, Mercurialis perennis, Hedera helix, Fraxinus excelsior) from different life forms that are always grouped together with both clustering methods (DP c and PY c ) for that they are all found mainly at forest edges and the herbaceous understory species, Mercurialis perennis, benefiting from the shade of the trees. In addition, the same cluster is negatively correlated with the big cluster no. 5 (built with PY c method, Supplementary Figure 9) that is mainly grouping lowland to subalpine species that are shade-intolerant but can tolerate nutrient poor soils. In sum, the groups of species that we build represent those species that tend to co-occur together more than expected by the lens of observed environmental variables. Notice that this might also be an indication that species within the same clusters also happen to be in similar habitats, suggesting missing environmental variables. Notably, the fact that species within the same clusters tent to show similar values of the Landolt nutrient indicator suggests that soil properties might explain some of the residual correlations (Supplementary Figure 3).
Moreover, we believe that these results will also have practical advantages. The PFG building framework (Boulangeat et al., 2012) allows to group species according to their functional strategies in the aim of reducing the botanical complexity in dynamic vegetation models. As shown here, our models provide clusters that could represent similarity in tolerance to abiotic and biotic conditions and their competitive ability at least as much as PFGs. Hence, the obtained clusters in our case can be considered as a valuable alternative to the PFG building framework, that requires the availability of many functional traits for most species.
Despite these improvements and advantages, we also acknowledge some possible pitfalls. Notably, missing covariates have always the potential to drive the patterns in the residual covariance matrix. The fact that our clusters performed well in representing the traits related to tolerance to abiotic conditions might be an indication of such a problem. Among these traits, Landolt nutrient indicator represents soil nutritive requirements of plants and was quite well represented by the clusters (Supplementary Figure 3). Having in mind that we were not able to include soil data among the covariates for this case study due to data availability, it is possible that the residual co-occurrence patterns are also driven by the soil properties. Another way forward in the framework would also be including habitat and soil information as covariates to further investigate if we can retrieve different patterns in the residual covariance matrix that are more directly related to biotic interactions.

CONCLUSION
We propose a statistical framework that allows an additional but ecologically meaningful dimension reduction of joint species distribution models and includes prior knowledge in the residual covariance matrix, providing a tool to infer clusters of species that share the same residual associations with respect to other species. The case study shows that the obtained clusters of species are ecologically meaningful, and correlated with functional traits. Therefore, our model can also be seen as an alternative way to build functional groups without having to measure all necessary traits.

DATA AVAILABILITY STATEMENT
The datasets analyzed for this study can be found at the first author's Github repository (https://github.com/dbystrova/ GJAMclust).