Genetic architecture and polygenic risk score prediction of degenerative suspensory ligament desmitis (DSLD) in the Peruvian Horse

Introduction: Spontaneous rupture of tendons and ligaments is common in several species including humans. In horses, degenerative suspensory ligament desmitis (DSLD) is an important acquired idiopathic disease of a major energy-storing tendon-like structure. DSLD risk is increased in several breeds, including the Peruvian Horse. Affected horses have often been used for breeding before the disease is apparent. Breed predisposition suggests a substantial genetic contribution, but heritability and genetic architecture of DSLD have not been determined. Methods: To identify genomic regions associated with DSLD, we recruited a reference population of 183 Peruvian Horses, phenotyped as DSLD cases or controls, and undertook a genome-wide association study (GWAS), a regional window variance analysis using local genomic partitioning, a signatures of selection (SOS) analysis, and polygenic risk score (PRS) prediction of DSLD risk. We also estimated trait heritability from pedigrees. Results: Heritability was estimated in a population of 1,927 Peruvian horses at 0.22 ± 0.08. After establishing a permutation-based threshold for genome-wide significance, 151 DSLD risk single nucleotide polymorphisms (SNPs) were identified by GWAS. Multiple regions of enriched local heritability were identified across the genome, with strong enrichment signals on chromosomes 1, 2, 6, 10, 13, 16, 18, 22, and the X chromosome. With SOS analysis, there were 66 genes with a selection signature in DSLD cases that was not present in the control group that included the TGFB3 gene. Pathways enriched in DSLD cases included proteoglycan metabolism, extracellular matrix homeostasis, and signal transduction pathways that included the hedgehog signaling pathway. The best PRS predictive performance was obtained when we fitted 1% of top SNPs using a Bayesian Ridge Regression model which achieved the highest mean of R2 on both the probit and logit liability scales, indicating a strong predictive performance. Discussion: We conclude that within-breed GWAS of DSLD in the Peruvian Horse has further confirmed that moderate heritability and a polygenic architecture underlies the trait and identified multiple DSLD SNP associations in novel tendinopathy candidate genes influencing disease risk. Pathways enriched with DSLD risk variants include ones that influence glycosaminoglycan metabolism, extracellular matrix homeostasis, signal transduction pathways.


Introduction
Spontaneous rupture of tendons and ligaments in response to trauma or chronic degeneration is a common injury shared across species.In humans, rotator cuff and Achilles' tendon injuries are common diseases that often lead to chronic tendon degeneration (Thomopoulos et al., 2015).In horses, degenerative suspensory ligament (SL) desmitis (DSLD) is an idiopathic, devastating disease of an essential energy-storing tendon-like structure that prevents hyperextension of the fetlock joint (Mero and Pool, 2002).Typically, a multi-limb disease, horses affected with DSLD experience progressive hyperextension of their fetlocks because of degeneration and rupture of the SL and its distal branches, resulting in lameness and a decreased quality of life (Mero and Scarlett, 2005).Histologically, collagen disruption, accumulation of interfibrillar proteoglycans in ligament matrix, and chondroid metaplasia are key pathological features in affected horses (Halper et al., 2006;Plaas et al., 2011).Age at diagnosis is in the range of ~5-10 years and often results in euthanasia due to the life-limiting lameness associated with dropped fetlocks (Figure 1).The Peruvian Horse (Peruvian Paso), Paso Fino, Warmblood, Morgan, and Akhal-Teke breeds are predisposed to DSLD, whilst ponies and draft breeds have reduced disease risk (Mero and Scarlett, 2005).In some Peruvian Horse families, the incidence may be as high as 40%, suggesting familial association (Mero and Scarlett, 2005).The late onset of the disease means that DSLD-affected horses have often been used for breeding before clinical signs of tendon/ligament injury (TLI) develop, causing economic loss.Clinically, it is recommended not to use affected horses for breeding.
Presently, there is little understood about the mechanism leading to DSLD.Strong breed disposition suggests a substantial genetic contribution to risk of DSLD.However, the genetic architecture of DSLD is unclear.To date, no estimate of DSLD heritability has been reported in any horse breed.Currently, there is no genetic test available that could assess a horse's risk of developing DSLD.
In the Peruvian Horse, as in other horse breeds, domestication and breed development has generated selective pressures on the genome to enable horses to work in agriculture, and transport.More recently, traits such as morphology and performance have been considered during selection for breeding (Petersen et al., 2013;Gouveia et al., 2014).These genetic differentiation events have been evolutionarily generated by natural and artificial selection.An unintended consequence of breed development is the increased incidence of disease within individual breeds or breed groups.Many spontaneous diseases in horses closely mimic heritable disorders seen in humans but occur in a model where reduced genetic diversity within a breed can generate long stretches of linkage disequilibrium (LD).In this regard equine DSLD is an important spontaneous model of chronic human TLI that is often related to disturbances in matrix homeostasis.
During past decades, attempts have been made to discover the genetic background of DSLD in the Peruvian Horse.These studies involved investigation of the molecular pathology of DSLD, disturbances to signaling pathways, a genome-wide association study (GWAS) with low density markers, and case-control differential gene expression analysis (Mero and Pool, 2002;Strong, 2005;Young et al., 2018;Haythorn et al., 2020).Whether a simple or polygenic architecture explains the genetic contribution to this important equine disease remains unclear.Earlier work using low-resolution GWAS in the Peruvian Horse identified candidate loci on chromosomes 6, 7, 11, 14, 26 that did not meet genome-wide significance (Strong, 2005;Metzger and Distl, 2020).Improved understanding of the genetic contribution to DSLD is clearly needed.Initial observational studies of breed predisposition suggest that DSLD-associated genetic variants are enriched in the Peruvian Horse through linkage to desirable phenotypes.From an evolutionary point of view, selection for a desired phenotype through careful breeding results in an increased frequency of haplotypes containing the gene(s) and functional allele(s) conferring the phenotype at a rate greater than expected under a null model of neutral evolution (Cutter and Payseur, 2013).GWAS and detection of signatures of selection (SOS) are two common genetic analysis approaches for case control association between a disease phenotype and genetic markers, typically single nucleotide polymorphisms (SNPs) (Patron et al., 2019;Momen et al., 2022).
Consequently, we recruited a reference population of Peruvian Horses phenotyped as DSLD cases or controls, undertook a GWAS and SOS analysis, and used the reference population to undertake polygenic risk score (PRS) prediction of disease risk.Discovery of strong DSLD candidate loci and genes influencing disease risk would represent a significant advance.Furthermore, we estimated heritability using a population-based pedigree to assess narrowsense heritability.Identification of genomic regions that contribute to the genetic risk of DSLD will permit development of a genetic screening test to assess risk of DSLD in Peruvian Horses.Additionally, gene mutations that influence risk of DSLD in the Peruvian Horse represent important candidate genes for risk of human and canine spontaneous TLI and rupture.We confirmed moderate heritability and a complex genetic architecture for DSLD in the Peruvian Horse and show that PRS prediction using Bayesian ridge regression (BRR) is highly accurate at predicting risk of DSLD in this breed.

Recruitment and phenotyping
Client-owned Peruvian Horses were recruited at the UW Madison School of Veterinary Medicine and Texas A&M College of Veterinary Medicine and through online advertising.Hair bulb samples pulled from the tail or mane, nasal swabs, or EDTA blood samples were collected from 183 Peruvian Horses for case control binary GWAS.The data set consisted of 80 cases and 103 controls.All owners gave informed consent to participate in the study.All procedures were performed in accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health and the American Veterinary Medical Association and with approval from the Animal Care Committee of the University of Wisconsin-Madison (Protocols V1070, V5463) and Texas A&M University (Protocol AUP IACUC 2018-0443 CA).Preparation of the manuscript conformed with the ARRIVE guidelines.If available, a pedigree was collected from each horse to confirm purebred status.DSLD cases were diagnosed with information from the veterinary records, such as physical exam findings, lameness exam findings and photographs.Physical examination consisted of palpation of soft tissue structures in all four distal limbs for pain, swelling, heat, asymmetry or scarring.Range-ofmotion and resistance to manipulation was assessed.Fetlock flexion after gait evaluation at the walk and trot on hard and soft surfaces was evaluated.As DSLD develops, the SL progressively thickens.Over time, the SL mechanically weakens and ruptures, resulting in a classic sign of dropped fetlocks in multiple limbs.In the severe case, obvious thickening and dropping of the fetlocks is evident compared with the normal standing posture (Figure 1).In horses with DSLD, the SL progressively weakens causing hyperextension of the fetlock, hock, and stifle.In some cases, ultrasound examination further confirmed the disease-status.B-mode tendon ultrasound examination of the SL using a linear 12 MHz transducer to include both the medial and lateral branches (Mero and Scarlett, 2005) can provide additional confirmation of DSLD.If necessary, sedation with xylazine or detomidine/butorphanol was given to the horse to facilitate the examination.Control horses were normal on physical exam and ≥15 years, as onset of DSLD in horses in this age range is unlikely (Mero and Scarlett, 2005).If a control horse developed DSLD through follow-up contact with the owner, its phenotype was updated.Medical records were also reviewed for the presence of other diseases that could be associated with development of tendon laxity, although this did not lead to inclusion or exclusion of a subject horse.

DNA isolation and SNP genotyping
DNA was isolated from buffy coat, hair bulbs obtained from the mane or tail, or nasal swabs (Performagene PG-100, DNA Genotek, Ottawa, Canada).Blood was collected in EDTA-coated tubes.For genotyping, samples underwent DNA isolation using the Gentra Puregene kit (Qiagen, Valencia, CA, United States).DNA quantity and purity were assessed using a Qubit 4 Fluorometer (Thermo Scientific, Waltham, MA, United States) and a Nanodrop Lite Spectrophotometer (Thermo Scientific, Waltham, MA, United States).
Isolated DNA samples were stored at −20 °C until genotyping.SNP genotyping was performed using the Axiom Equine Genotyping Array (Axiom MNEC670K, Thermo Scientific, Waltham, MA, United States) which includes a total of 670,796 SNPs.Genomic coordinates based on the latest version of genome assembly, EquCab3.0SNP collection (https://www.ncbi.nlm.nih.gov/assembly/GCF_002863925.1/), was used throughout the study.

FIGURE 1
Degenerative suspensory ligament desmitis (DSLD) is a crippling, painful equine disease.(A) A Peruvian Horse that is severely affected with DSLD and a (B) phenotype-negative control Peruvian Horse.As the disease develops, the suspensory ligament (SL) progressive thickens.Over time, the SL mechanically weakens and ruptures, resulting in a classic sign of dropped fetlocks.In the severe case, obvious thickening and dropping of the fetlocks is evident (inset A) compared with the normal standing posture (inset B).DSLD is typically more evident in the pelvic limbs verses the thoracic limbs, although in some Peruvian Horses DSLD develops in all four limbs.Reproduced from Momen et al., 2022.

Imputation missing genotypes and SNP filtering
Missing genotypes were imputed using the Beagle software, version 5.4 (Browning & Browning, 2007).The software uses a hidden Markov model (HMM) to construct a tree of haplotypes and summarize it in a direct acyclic graph by joining nodes of the tree based on haplotype similarity to infer missing markers.Quality control was performed using PLINK v1.9 (Chang et al., 2015).SNPs were removed from the dataset if they had minor allele frequency (MAF) < 0.05, SNP genotyping call rate <95%, individual horse call rate of <90%, or did not conform to Hardy-Weinberg proportions at P < 1E-06.After quality control, 177,662 SNPs were removed, and 447,630 SNPs remained for analysis.

Heritability estimation from pedigrees
The Peruvian Horse dataset included a total of 1,947 individuals of which there were 54 horses with a case phenotype and 116 horses with control phenotype.Among these, there were 607 individuals with progeny acting as sires and 963 individual females with progeny and the rest consisting of individuals without any recorded progeny.The dataset consisted of 70 full-sibling groups, with an average family size of 2.19 individuals per group.The CFC software tool (Sargolzaei et al., 2006) was used to analyze the structure of the pedigree.Inbreeding was determined by the pedigree relationship coefficient (F), which was computed from the diagonal elements of the numerator relationship matrix F j A jj − 1, where A is the pedigree relationship matrix.The threshold for inbreeding was 1.0.A probit Bayesian linear mixed model, using the MCMCglmm package (Hadfield, 2010), was then used to generate a posterior distribution of heritability for DSLD in the Peruvian Horse.The MCMC chain was run for a total of 1,000,000 iterations plus a burn-in of 20,000 samples and a thinning interval of 5, meaning there were 100,000 posterior probabilities sampled of the variance components.

Genome-wide association study
A univariate logistic linear mixed regression model in R as implemented by the 'gaston' package (Perdry and Dandine-Roulland, 2020) was used for the association analysis.Each SNP was regressed using the Wald test and sex was used as a covariate.P-values were examined to assess the significance of SNP associations with DSLD.A genomic relationship matrix (GRM) as formulated by VanRaden (2008), was used to account for population stratification and relatedness among individuals: Where, where X is an n by m matrix of centered genotypes and p i is the minor allele frequency for allele i.
Two different P-value thresholds were considered.One threshold was determined through a Bonferroni correction (p < 0.05/total number of SNPs).As an alternative approach, we established genome-wide significance thresholds using 95% confidence intervals (CI) derived from the empirical distribution of P-values obtained under the null hypothesis of no association (Karlsson et al., 2013).To construct this distribution, we performed 500 permutations of the phenotypes and reran the GWAS each time.Genome-wide significance was defined as associations surpassing the upper 5% empirical CI, corresponding to a P-value threshold of ≤7.39E−05.A Quantile-Quantile plot was made to compare expected null distribution of the test statistic with the observed genome-wide based distribution.A list of GWAS genes was built using the EquCab3.0 genome assembly and the Ensembl genome browser.Regions of the reference genome were scanned ±50 kb upstream and downstream from the positions of SNPs that crossed the Bonferroni significance threshold.Associated genes were then investigated for relevance to tendon homeostasis using PubMed and the search term "tendon".

Regional window variance using local genomic partitioning
Regional variance analysis enhances the power to detect QTLs by effectively capturing the combined contribution of multiple marker effects within a specific region.This approach enables the identification of genetic variants that may have modest effects individually but collectively contribute to the trait's variation as well as rare variants whose effects are difficult to capture because of lack of statistical power (Oppong et al., 2022).Consequently, there is a benefit to be gained in terms of improving heritability estimates and uncovering genetic variants involved in the control of traits by fitting genome-wide analytical models that adequately capture the combined effects of rare genetic variants (Shirali et al., 2016).
We partitioned the genome of 31 + X chromosomes of the horse into 4,769 windows with the size of 90 SNPs.On average we assumed each window covered ~0.5 Mb of the genome.To determine the optimal window size, we used the longest chromosome in the horse genome (ECA1) as a reference, which has a length of 188.3 Mb.We calculated the SNP density on this chromosome by dividing the total number of distributed SNPs by its length.The result was approximately 190 SNPs per Mb.Based on this, we selected a window size of 0.5 Mb, which corresponds to ~90 SNPs.Then we ran a logistic linear mixed model with two variance components by considering the following mixed model: Where Y is the vector of DSLD case-control status as the binary phenotype, X is a design matrix of fixed effects, and β is a vector of fixed effects, Z w and Z r are the design matrices for the local window (w) and the rest of genome (r) random effects, respectively, with distributions and covariance structures of Here, G w and G r were the relationship matrices calculated using markers that were in the window and all SNPs out of that given window.We then selected the top 5% of windows with the highest heritability and searched for DSLD candidate genes influencing disease risk in each window through the UCSC genome browser using the EquCab3.0 reference genome.Candidate genes were then investigated for relevance to tendon homeostasis using PubMed and the search term "tendon".

Signatures of selection (SOS) analysis
Evidence of signatures of positive selection across the genome of case and control groups was investigated through five complementary statistics designed to detect signatures of selection, including nucleotide diversity (Δπ), number of segregating sites by length (nSL), a statistical test based on a measure of haplotype homozygosity (H12), and integrated haplotype score (iHS).The different statistics were combined using the decorrelated composite of multiple signals method (DCMS) (Ma et al., 2015).This method combines signals of multiple tests and considers potential correlations among the different tests to increase resolution and reduce the proportion of false positives.Nucleotide diversity (π) was calculated with vcftools (Danecek et al., 2011) and the other statistics were calculated using selscan and normalized using the norm script as implemented in selscan (Szpiech & Hernandez, 2014).For each statistic within the case and the control groups, we computed the P-value of the DCMS statistic using fractional ranks using the stat_to_pvalue () function in the MINOTAUR R package (Verity et al., 2017) for all of the SNPs.Then, using the covNAMcd function (alpha = 0.75, nsamp = 50,000) from the rrcovNA R package to calculate an s × s correlation matrix (i.e., the minimum covariance determinant estimator of multivariate location and scatter) between the included statistics (where s represents the number of statistics to estimate the DCMS values).This matrix was used as input in the DCMS function of the MINOTAUR R package to calculate genome wide DCMS values.Once the DCMS values were generated, they were fitted to a normal distribution using the robust linear model (rlm) function of the MASS R package in model = rlm (dcms ~1), in which the dcms object is a vector containing the raw DCMS values.The outputs of the fitted model (i.e., Mu [mean] and SD [standard deviation]) were used as input in the pnorm R function to calculate P-values for the DCMS statistics: dcms_pvalues = pnorm (q = dcms, mean = Mu, sd = SD, lower.tail= FALSE).SHAPEIT2 (Loh et al., 2016) was used for haplotype phasing of autosomes, separately for case and control groups.A list of SOS regions was developed by using the EquCab3.0 genome assembly on the Ensemble genome browser.Regions of the reference genome were scanned ±50 kb upstream and downstream from the SNPs exhibiting a positive selection signature.The analysis aimed to identify genes within the candidate regions that exhibited selection signatures specifically in the cases as the target cohort, rather than in both cases and controls.Candidate genes influencing disease risk were then investigated for relevance to tendon homeostasis using PubMed and the search term "tendon".

Pathway enrichment analysis
In the next step, the genes identified in the top 5% of genomic regions identified from the local genomic variance analyses using GWAS, regional heritability, and SOS analyses underwent functional analysis using genes with biological relevance to tendon homeostasis.This step aimed to reduce potential bias and increase the specificity of our analysis, by screening out unrelated genes based on prior knowledge, functional annotations, and available literature on DSLD and related biological processes.By doing this, we aimed to focus our analysis on genes that were more likely to be directly involved in the condition.
The gene lists derived from the GWAS, SOS and local window variance data were used for pathway enrichment analysis, using g: Profiler (https://biit.cs.ut.ee/gprofiler/), to identify Reactome pathways that are enriched in the experiment.The false discovery rate was set at 0.05 (Raudvere et al., 2019).Pathway enrichment analysis results were visualized and interpreted in Cytoscape using its EnrichmentMap plugin (http://www.baderlab.org/Software/EnrichmentMap) (Shannon et al., 2003).

Principal component analysis (PCA)
We assessed the genetic diversity within the Peruvian horse population using PCA.This analysis enabled us to investigate the variation present within the population and gain insight into its genetic structure.To accomplish this, the genotypic information was also used to compute a GRM between all individuals in case and control groups (VanRaden, 2008).By performing eigen decomposition of the GRM using the base eigen () function in R (R Core Team, 2021), the eigenvectors and eigen values were obtained and the eigenvectors were normalized.Finally, PCs were computed by multiplying eigenvectors by the square root of the associated eigenvalues (Bryant & Yarnold, 1995;Momen et al., 2022).To review the results, we plotted the projection of the individuals on the first two PCs, with colors corresponding to their group assignment.

Polygenic risk score prediction of DSLD risk 2.10.1 Machine learning models
Four different machine learning models: weighted subspace random forest (RF), gradient boosting machine (GBM), least absolute shrinkage and selection operator (LASSO), and elastic net (EN) were used to predict DSLD polygenic risk scores.A weighted RF model was used because the weighted form of the RF model can achieve high accuracy in classifying high dimensional data (Banfield et al., 2006), such as datasets with thousands of SNPs.Because such data often contain many uninformative features for classification, random sampling often does not include informative features in selected subspaces.So, the Weighted Subspace Random Forest algorithm (wsrf) (Xu et al., 2012) and the wsrf R package was used.The gradient boosting algorithm was the second machinelearning algorithm used.It has been shown that this algorithm has a similar or higher predictive accuracy than traditional methods, in both classification and regression problems (Friedman, 2002).The boosting algorithm has been previously used in genome-wide prediction and disease susceptibility studies in animal and plant breeding (Momen et al., 2018;Montesinos-López et al., 2022).A GBM model which combines predictions from an ensemble of treebased classifiers for outcome prediction (Chen and Guestrin, 2016) to generate the final predictions was the algorithm used as a classifier and the R package gbm (Greenwell et al., 2019) was used for implementation.Tuning of the hyperparameters was performed using a 10-fold cross validation grid search technique.Model training and optimization of tuning parameters used the caret R package (Kuhn, 2015).
The third model was the LASSO approach (Tibshhirani and VanRaden, 1996) which is used for efficient feature selection based on the assumption of linear dependency between input features and output values.In a general form, the lasso estimator uses the ℓ1 penalized least squares criterion to obtain a sparse solution to the following optimization problem: , is the ℓ2 -norm (quadratic) loss function (i.e., residual sum of squares), x T i is the i-th row of X, and the β 1 p j 1 β j is the ℓ1 -norm penalty on β, which induces sparsity in the solution, and λ≥ 0 is a tuning parameter.The ℓ1 penalty enables the lasso to simultaneously regularize the least squares fit and shrinks some components of βLasso to zero for some suitably chosen λ.The elastic net (EN) is an extension of the lasso that is robust to extreme correlations among the predictors (Friedman et al., 2010;Ogutu et al., 2012), for example, to overcome the instability of the lasso solution when SNPs as the predictors are in high linkage disequilibrium.In the EN model, there are two L1 and L2 penalties and the balance between them is controlled by a parameter (α).
The glmnet function from the "glmnet" R-package was used for fitting LASSO and EN models.The cv.glmnet () function in this package was used to obtain optimum values for α and λ using a cross validation procedure.

Bayesian regression prediction models
Four Bayesian regression models included Bayesian ridge regression (BRR), Bayes B (BB), Bayes C (BC) and Bayesian Lasso (BL) models were fitted and compared in terms of their prediction accuracy.We assumed that there is a genomic variable predictor, i.e., G g ij with i 1, . . ., n, j 1, . . ., p g .The phenotypic vector y y i was defined as either y i 0 for phenotype-negative controls or y i 1 for DSLD cases.A probit link function as P(y i 1|G i ) Φ(η i ) was used to estimate the model parameters.Where, Φ is a standard normal cumulative distribution function and η i is a linear predictor that has the following form: Where, µ is an intercept or population mean, g ij is the genotype of the i-th individual at the j-th marker, and β j is the j-th marker effect.The probit link implemented used a latent normally distributed variable l i η i + ε i and a measurement model y i 0 if l i < γ, and 1 otherwise, where γ is a threshold parameter; ε i is an independent normal model residual with mean zero and with variance set equal to one.A standard Bayesian linear model was used for prediction as follows: p θ g y, ω g ∝ p y θ g p θ g ω g Here, p(θ g |y, ω g ) is the conditional posterior density of the genomic parameters (θ g μ, σ 2 e , β ), including the residual variance (σ 2 e ), which was assigned a scaled-inverse χ 2 prior density, µ was assigned a flat prior density, and the marker effects (β) were assigned independent and identically distributed informative priors, depending on the model, and ω g is the genomic hyperparameter that indexes the prior density of marker effects which for the different models is: A) BRR assumes the same genetic variance for all markers; i.e., β i ~N(0, σ 2 β ).The prior distribution for marker genetic variance is the following scaled inverted chi-squared distribution, σ 2 β |υ β S β ~υβ S β χ −2 υ β , with hyper-parameters υ β (degrees of freedom) and S β (scale parameter) (Meuwissen et al., 2001).
In Bayesian LASSO, the regression parameter β j is assumed to follow a double exponential (DE) prior distribution regression (Park & Casella, 2008).In this context, , where τ 2 i |λ 2 ~Exp(λ 2 ), the shrinkage factor λ is further assigned with a hyper prior of a Gamma distribution Gamma (λ 2 |s, r), and so it can be estimated as other model parameters.Under this approach, the marginal prior distribution for the marker effect is given by β i |λ ~Double − Exp .(0,λ).This double-exponential distribution presents higher mass at zero, but it does not necessarily set coefficients exactly to zero.The shape (s) and rate (r) parameters of the Gamma prior was specified to s = 1.1 and r (Perez and de los Campos, 2014), where MS x represents the sum of the variances of genotype values of each SNP, and R 2 = 0.5.
We used the BGLR package (Perez and de los Campos, 2014) to fit the Bayesian regression models.A total of 200,000 iterations, plus 20,000 of burn-in samples were considered to create posterior distributions and infer the model parameters.Global convergence was checked by visual inspection of trace plots.

Accuracy of DSLD risk PRS prediction
To find out the optimum subset of the top SNPs, based on the GWAS results, for prediction of the DSLD genetic risk score, first we selected 0.5% (2,238), 1.0% (4,476 SNPs), 2.0% (8,952) and 3.0% (13,428 SNPs) and evaluated performance of all models.Overall, the best performance was when 1.00% of top SNPs was used for prediction.
Five-fold cross validation was used to investigate accuracy of DSLD risk prediction.We used the coefficients of determination (R 2 ) on the probit and logit liability scale to assess predictive performance of our models.R 2 on the liability scale can be obtained by transforming R 2 on the observed scale from linear regression, using the Robertson transformation (Dempster and Lerner, 1950) as: Where, R 2 o is on the observed scale, Z is the height of a normal density curve at the point according to the population prevalence of the disease, and K is the mean proportion of cases in the sample.In probit or logit models, the above formula can be directly obtained as the proportion of variance explained by linear predictors in relation to the total variance on probit liability scale as: where var( bprobit g i ) is the variance due to the explanatory variable (genetic variance) and residual variance is defined as var(e) = 1.Also, on the logit scale, R 2 can be obtained by residual variance of var(e) = π 2 3 3.92 These formulas allowed us to quantify the proportion of the total variance in the liability scale that can be attributed to the genetic factors captured by the linear predictors in the probit and logit models (Lee et al., 2012).So, by analyzing the power of our models' R 2 values to the total variance in the liability scale, we gain insights into the predictive power of our models and their ability to explain the underlying genetic variation.
Furthermore, to validate the predictive performance of the models, a separate validation set comprising 10 Peruvian Horses with known DSLD case (n = 3) and control (n = 7) phenotypes was utilized.Ensemble prediction was applied, incorporating all eight prediction models (BRR, Bayes B, Bayes C, BL, RF, GB, LASSO, EN).Finally, tuning of the posterior probability threshold for PRS (Polygenic Risk Score) prediction as a DSLD case was carried out using the validation set.

Clinical findings in the study population
Pituitary pars intermedia dysfunction (PPID) was identified in two DSLD cases and two phenotype-negative controls, and equine metabolic syndrome (EMS) was identified in four DSLD cases and three controls from medical records (Table 1).No control horses were reassigned as cases.

Heritability
Heritability analysis included 1,947 Peruvian Horses and was conducted using pedigree information.There were 499 (25%) horses considered inbred based on the F coefficient.The mean F coefficient was 1.72% and showed a range from 0.012% to 28.2%.The mean F in the inbred horses was 6.1%.The sample population included 688 founder horses and 1,259 nonfounder horses.There were 376 horses with no progeny and 1,570 horses with at least one progeny.The posterior density of the estimated genetic variance σ 2 g , and DSLD heritability (h 2 DSLD ) are represented in Figure 2. The posterior mean ± SD of DSLD heritability was 0.22 ± 0.08 with the highest (posterior) density interval (HPD) of lower and upper limits 0.081 and 0.419 respectively at 0.95 percent credible interval.The genetic variance component had a posterior mean and standard error of 0.683 ± 0.341, with the HPD interval's boundary of 0.134-1.371.

Genome-wide association study and regional window variance
The study population consisted of 80 cases and 103 controls.There were 7 and 10 stallions, 46 and 45 mares, and 26 and 48 geldings in the case and control groups respectively.The neuter status of one male horse in the case group was unknown.
In our GWAS analysis we considered two cut-off thresholds, a Bonferroni corrected P-value threshold at P < 1E-7 and a permutation-based threshold at P < 7.39E-5.In total, 3 and 151 SNPs passed these thresholds respectively.The three SNPs that exceeded the Bonferroni significance threshold were located on chromosomes 4, 10, and 11.Candidate loci with significant SNPs that passed the Bonferroni threshold contained the NOG, AHR, and   , 2, 6, 10, 13, 16, 18, 22, and the X chromosome.
UBE3D genes.We identified a total of 200 DSLD candidate genes based on the less stringent permutation threshold and after we screened the gene list.After conducting a thorough functional investigation, 17 of them were considered functionally or biologically related to DSLD (Table 2).In this analysis λ = 0.99, indicating absence of systematic biases or population structure that could lead to false-positive associations.As shown in Figure 3A, only SNPs that were significantly associated with DSLD reside outside of the normal distribution line.Multiple SNP associations were identified across the genome (Figure 3B).
Multiple regions of enriched local heritability were also identified across the genome, with strong enrichment signals on chromosomes 1,2,6,10,13,16,18,22, and the X chromosome (Figure 3C).In this analysis, we selected the top 5% of windows (238 windows) with the highest genetic variance and searched for the DSLD related genes in each window through the UCSC genome browser.In total 953 DSLD candidate genes were identified of which 39 genes were relevant to tendon homeostasis (Table 2).

Signatures of selection and principal component analysis
As a prerequisite for selection signature analysis, we performed a PCA analysis (Supplementary File S1).The results showed that the Peruvian Horses clustered together and distributed along these two vectors based on their genomic similarity with a small difference between DSLD case and control groups of horses (Supplementary Figure S1A).Our PCA analysis showed that the first principal component captured 15.9% and the second explained 5.8% of total variance (Supplementary Figure S1B).
The SOS analysis in case-control groups showed that there were 115 genes in candidate loci exhibiting a selection signature in DSLD cases and 123 genes in control populations.Of these genes, 49 were shared between case and control groups.In the DSLD case group, candidate genes included the candidate tendinopathy gene TGFB3 that was not shared with the DSLD phenotype-negative control group (Figure 4 and Table 2).

Pathway enrichment analysis
We combined 60 candidate genes obtained from the three analyses (20 genes from GWAS, 39 genes from window based local variance analysis, and one gene from SOS analysis), for Reactome pathway enrichment analysis.Finally, 33 pathways from the Reactome data base which were most related to tendon homeostasis, were identified (Figure 5).SNP associations with DSLD showed enrichment for pathways including proteoglycan metabolism, extracellular matrix homeostasis, and signal transduction pathways that included the hedgehog signaling pathway (Figure 5).

Polygenic risk score prediction of DSLD risk
Predictive performance of both classification machine learning and Bayesian regression models using a five fold cross validation is represented in Table 3. Machine learning performance was assessed using the top 1% of GWAS SNPs (4,476 SNPs).We considered sex as the only covariate in the predictive model.A chi-squared test with indicates the significance cutoff of (P-value = 0.0001).In the DSLD case group, candidate genes included the candidate tendinopathy gene TGFB3 that was not shared with the DSLD phenotype-negative control group.Additionally, the candidate tendinopathy gene GLI2 exhibited a large selection signature in DSLD cases relative to DSLD controls.
Yates' continuity correction indicated a weakly significant association between sex and disease status in our sample population (χ-squared = 4.55, df = 1, p = 0.0329).The coefficients of determination (R 2 ) were estimated for the predictive performance of four machine learning classifiers and four Bayesian models on both the probit liability and logit liability scale.On the probit liability scale, the Bayesian Ridge Regression (BRR) model achieved the highest mean R 2 of 0.702, followed closely by the Bayes B and Bayes C models with mean R 2 values of 0.699.These models demonstrated relatively strong predictive power in explaining the variation in the data.On the other hand, the Gradient Boosting (GB) model had the lowest mean R 2 value of 0.506, indicating relatively weaker predictive performance compared to the other models.
Similar patterns were observed on the logit liability scale, where the BRR model exhibited the highest mean R 2 of 0.718, followed by Bayes B and Bayes C with mean R 2 values of 0.717 and 0.716, respectively.The GB model again showed the lowest mean R 2 of 0.467, suggesting comparatively lower predictive accuracy.Among the machine learning classifiers, Random Forest (RF) had a mean R 2 of 0.632, while LASSO and Elastic Net (EN) achieved mean R 2 values of 0.562 and 0.682, respectively on the probit liability scale.
When PRS prediction of a separate validation set of Peruvian Horses was performed using ensemble risk prediction, four DSLD control horses were incorrectly predicted using a posterior probability threshold of 0.5 for classification as a case (Supplementary Table S1).Tuning of the threshold (Supplementary Figure S4) identified an optimal threshold of 0.55.With this adjusted threshold, correct classification was achieved for 9 out of 10 horses (Supplementary Table S1).One DSLD control horse was still predicted to have the genetic risk of a case.

Discussion
DSLD is a debilitating condition characterized by systemic deposition of proteoglycan in connective tissues that may yield insight into human TLI associated with similar matrix disturbances.We undertook a within-breed GWAS of DSLD in the Peruvian Horse using a linear mixed model to discover candidate loci and genes that influence risk of the disease.Our  analysis showed that the disease has moderate heritability of 0.22 in this breed.Specific environmental risk factors for DSLD are poorly understood.Our results also show that DSLD has a polygenic architecture with risk loci spread across the autosomal genome.Novel TLI genes and pathways were highlighted from this research.PPID and EMS were occasionally identified with similar frequency in the horses in both the case and phenotypenegative control group.
Our GWAS analysis identified 151 DSLD-associated SNPs, suggesting DSLD is a complex polygenic disease.Environmental risk factors account for the remaining risk.Candidate loci with significant SNPs that passed the Bonferroni threshold contained the AHR, NOG and UBE3D genes.The AHR gene regulates transcription via the aryl hydrocarbon receptor.A role in tendon biology has not been defined, but it is possible that this gene may have a role in extracellular matrix degradation during aging (Salminen, 2022).NOG is a 222 amino acid secreted protein known for binding and inactivating BMP4 and other proteins in the transforming growth factor beta (TGF) superfamily.NOG is known to play an important role in tendon development and homeostasis (Schweitzer et al., 2001), including development of heterotopic ossification with tendon aging (Dai et al., 2020).BMP2, another member of the TGF superfamily, has been previously identified within cellular foci of fibroblasts in DSLD-affected SL (Young et al., 2018).BMPER and BMP2R were also identified as candidate genes in this analysis.DSLD is associated with an atypical accumulation of proteoglycans, such as aggrecan, within diseased SL tissue (Plaas et al., 2011).It is conceivable that NOG may influence aggrecan homeostasis in SL tissue matrix through BMP-SMAD1/5 signaling (Wang et al., 2012).UBE3D is a ubiquitin-conjugating enzyme that plays an important role the ubiquitin proteosome system, regulating protein degradation.It is possible that functional variation in this protein may contribute to the pathogenesis of DSLD through protein degradation (Huang et al., 2015).
A much larger number of SNPs passed the permutation threshold used in this study.Bonferroni correction is widely considered too conservative and may propagate Type II error (false negatives) (Nakagawa, 2004).Because groups of SNPs are inherited together in a haplotype block because of linkage disequilibrium, association testing of individual SNPs is not independent.In this larger set of SNP associations, additional DSLD risk SNPs were identified in genes that could influence tendon homeostasis.Increased CXCL1 expression has been found in chronic tendinopathy (Kendal et al., 2020).GLI2 and GLI3 were also identified as candidate DSLD genes.GLI3 has been linked to mechanotransduction responses during tendon healing (Freedman et al., 2022).In humans, increased expression of MYL1 has been identified in traumatic rotator cuff tears in female patients, whereas MYL2 is highly expressed in degenerative tears in male patients (Rai et al., 2022).A mutation in B4GALT7 has been associated with dwarfism and development of tendon laxity in Friesian horses.B4GALT7 is one of the enzymes that synthesizes the tetrasaccharide linker between protein and glycosaminoglycan moieties of proteoglycans in extracellular matrix (Leegwater et al., 2016).A mutation in TNXB has also been associated with connective tissue laxity that is part of an Ehler-Danlos Syndromelike phenotype (Brisset et al., 2020).Tenascin-X is a matrix glycoprotein that is thought to have an important role in collagen fibrillogenesis (Brisset et al., 2020).Collagen assembly is also regulated by MEP1A (Broder et al., 2013), another DSLD-candidate gene identified in this research.
Through our SOS analysis, we found 66 candidate genes that exhibited a selection signature in DSLD cases that was not present in the control group, further supporting the hypothesis that DSLD has a polygenic architecture.Candidate genes include 25 non-coding RNA sequences, suggesting that regulatory SNPs may play an important role in the genetic contribution to DSLD (Giral et al., 2018).We also found GLI2 exhibited a large selection signature in DSLD cases relative to phenotype-negative control horses.A candidate genomic region from SNP GWAS that also contains a positive selection signature is more likely to contain the causal genetic variant, particularly for diseases with a simple mode of inheritance, but not for complex traits (Kemper et al., 2014).Development of tendinopathy likely represents a failure to repair or remodel extracellular matrix after repetitive micro-injury.In this regard, poor healing has been associated with loss of TGFB receptors from diseased matrix (Fenwick et al., 2001) and downregulation of TGFB3 is found with aging, particularly in tendons exposed to mechanical overload (Kinitz et al., 2021).The Indian hedgehog signaling pathway, which includes the transcription factors GLI1/ GLI2/GLI3, is known to modulate matrix responses to load and healing in tendon injury, particularly at bone attachment sites (Liu et al., 2022).
Pathway analysis of candidate genes identified by GWAS, local variance analysis and SOS identified enrichment of pathways associated with glycosaminoglycan metabolism and extracellular matrix homeostasis.Additionally, signal transduction, particularly the hedgehog signaling pathway also showed enrichment.Glycosaminoglycans have a key role in extracellular matrix composition of tendons and disturbed metabolism of the extracellular matrix of tendon; accumulation of aggrecan in SL tissue and disturbance to decorin glycosylation are key features of DSLD (Kim et al., 2010;Plaas et al., 2011;Haythorn et al., 2020).Mechanotransduction has a key role in tendon and ligament homeostasis and genes that have regulatory effects on mechanotransduction were a key finding in a previous categorical GWAS of DSLD in multiple breeds of horse (Momen et al., 2022).
The Peruvian Horse is a breed with a small effective population (Momen et al., 2022), enabling detection of significant associations and accurate PRS predictions with a relatively small sample size.In the cross-validation experiment, the models we studied demonstrated moderate predictive performance on both the probit and logit liability scales.It is important to consider both the mean R 2 values and the corresponding SD, representing the variability of R 2 values across the cross-validation folds.The low SD values indicate more stable and consistent predictive performance.We identified the BRR model as the best performing single model with a clinically relevant predictive accuracy in the reference population with an R 2 that exceeds 0.7 using the top GWAS SNPs and sex as the only covariate in the predictive model.When we used ensemble prediction in a validation set of 10 independent Peruvian Horses, all horses were predicted accurately except one after tuning of the posterior probability threshold.
These results fit with an earlier observation that PCA analysis using top DSLD GWAS SNPs reflecting breed categorical risk causes the population within the Peruvian Horse breed to form two distinct clusters in contrast to a single breed cluster when all SNPs are considered in the analysis (Momen et al., 2022).Because DSLD is an acquired disease that often develops after horses have reached breeding age and been used for breeding, accurate PRS prediction of DSLD risk is an important advance clinically, as it enables screening of horses for selection for breeding at a young age.
There were several limitations to this work.The sample size in our study population was relatively small at 183 horses.The age was not available for all the cases.In the future, increasing the sample size may help detect additional associations, improve the accuracy of PRS prediction of disease risk, and enable SNP estimation of DSLD heritability.Consideration of athletic activity in prediction models may also be useful in the future.Further validation of PRS prediction is needed by predicting a larger independent test set of horses and evaluating prediction accuracy.It would also be important to follow predicted horses over time to confirm whether young horses predicted as cases develop DSLD later in life.Combining both genotype and pedigree data to estimate heritability could also be considered.In our regional window variance analysis, the choice of the threshold for selecting the top windows with the highest heritability was a subjective decision.A higher or lower threshold than 5% may have yielded different results.More work is also needed to further investigate the key pathways involved in the pathogenesis.RNA-Seq analysis of tendon tissue from DSLD case and control horses will likely help confirm key candidate genes and pathways.Further investigation of candidate genetic variants using whole genome sequencing is also warranted.By including all genes related to DSLD from GWAS, SOS and WIN, and shortening the gene list by selecting biologically compelling genes, we aimed to capture a set of genes that are biologically relevant to the phenotype of interest and increase the power of our pathway analysis to detect meaningful associations.Additionally, we assumed this approach would help reduce the impact of false negatives in our analysis compared with inclusion of all associated genes from our discovery analyses.
In conclusion, our within-breed GWAS of DSLD in the Peruvian Horse has further confirmed moderate heritability and a polygenic architecture underlies the trait and identified multiple DSLD SNP associations.Pathways enriched with DSLD risk variants include pathways that influence glycosaminoglycan metabolism, extracellular matrix homeostasis, signal transduction, interleukin signaling, and apoptosis.PRS prediction using an ensemble prediction pipeline shows clinical promise as a genetic risk test for DSLD.

FIGURE 2
FIGURE 2Posterior densities for genetic variance component and heritability of degenerative suspensory ligament desmitis (DSLD) in the Peruvian Horse.The red line denotes the mean (SD) of the heritability distribution.The heritability estimate was 0.22 ± 0.08 and the mean (SD) of the genetic variance was 0.68 ± 0.34.
FIGURE 3 (A) Quantile-quantile plot comparing the expected P-value distribution to the observed P-value distribution.(B) Manhattan plot of -log 10 (P-value).A linear mixed model GWAS analyzed the association between 447,630 SNPs and the DSLD disease phenotype.The solid red line denotes the Bonferroni corrected significance threshold of ≤1.1E-07.The dotted red line denotes the permutation significance threshold of ≤7.39E−05.There were 3 SNPs that passed the Bonferroni corrected P-value threshold and 151 SNPs that passed the permutation threshold.(C) Enriched heritability windows were evident in multiple regions across the genome on chromosomes 1, 2, 6, 10, 13, 16, 18, 22, and the X chromosome.

FIGURE 4
FIGURE 4 Mirrored Manhattan plot for signatures of selection results in the case and control sub-populations.-log10 P-value of the decorrelated composite of multiple signals (DCMS) values are shown in the upper panel for DSLD cases and in the lower panel for the control group.The red line in the figureindicates the significance cutoff of (P-value = 0.0001).In the DSLD case group, candidate genes included the candidate tendinopathy gene TGFB3 that was not shared with the DSLD phenotype-negative control group.Additionally, the candidate tendinopathy gene GLI2 exhibited a large selection signature in DSLD cases relative to DSLD controls.

FIGURE 5
FIGURE 5Network of the Reactome pathways.Each node (circle) represents a gene set characterized by a particular reactome pathway.Node fill indicates the enrichment score (FDR q-value).The thickness of blue lines (edges) indicates the number of shared genes (overlap) between two connected nodes.Nodes with high overlap are clustered together, forming groups characterized by similar terms and pathways.

TABLE 1
Clinical features of the Peruvian Horse study population.

TABLE 2
Degenerative suspensory ligament desmitis candidate genes in case and control Peruvian Horses identified by genome-wide association study, signatures of selection analysis, and enriched local heritability based on the top 5% of windows consisting of 90 SNPs (~0.5 Mb).

TABLE 2 (
Continued) Degenerative suspensory ligament desmitis candidate genes in case and control Peruvian Horses identified by genome-wide association study, signatures of selection analysis, and enriched local heritability based on the top 5% of windows consisting of 90 SNPs (~0.5 Mb).Candidate genes were identified through analysis of significant SNPs with ±50 kb flanking regions using the EquCab3.0 reference genome.Chr, chromosome.*Significance of association met the Bonferroni threshold.GWAS, genome-wide association study; SOS, signatures of selection; WIN, window analysis of enriched local heritability.

TABLE 3
The estimated coefficients of determination (R 2 ) for each model on the probit liability scale and the logit liability scale for polygenic risk score prediction of risk of degenerative suspensory ligament desmitis in the Peruvian Horse.