Combining Landscape Genomics and Ecological Modelling to Investigate Local Adaptation of Indigenous Ugandan Cattle to East Coast Fever

East Coast fever (ECF) is a fatal sickness affecting cattle populations of eastern, central, and southern Africa. The disease is transmitted by the tick Rhipicephalus appendiculatus, and caused by the protozoan Theileria parva parva, which invades host lymphocytes and promotes their clonal expansion. Importantly, indigenous cattle show tolerance to infection in ECF-endemically stable areas. Here, the putative genetic bases underlying ECF-tolerance were investigated using molecular data and epidemiological information from 823 indigenous cattle from Uganda. Vector distribution and host infection risk were estimated over the study area and subsequently tested as triggers of local adaptation by means of landscape genomics analysis. We identified 41 and seven candidate adaptive loci for tick resistance and infection tolerance, respectively. Among the genes associated with the candidate adaptive loci are PRKG1 and SLA2. PRKG1 was already described as associated with tick resistance in indigenous South African cattle, due to its role into inflammatory response. SLA2 is part of the regulatory pathways involved into lymphocytes' proliferation. Additionally, local ancestry analysis suggested the zebuine origin of the genomic region candidate for tick resistance.

Supplementary Text

Supplementary Text 1. MODIS collection 5 data.
The 72 ten-days annual NDVI periods used in the present study are based on the MODIS collection 5 (C5) data, whose processing ended in March 2017 (Budde, 2018;personal communication). C5 data are no longer distributed, so that the link http://earlywarning.usgs.gov/fews/product/116 originally used for the download has expired. Upon permission from the US Geological Survey (USGS), we make available the C5 NDVI data from the Dryad Digital Repository (Heneghan et al., 2011): doi:10.5061/dryad.sf5j2bf.

Supplementary Text 2. Statistical significance of genotype-environment associations.
Statistical significance of genotype-environment associations was evaluated by means of a likelihood-ratio (LR) test. Null and alternative models (NM and AM, respectively) were compared for each genotype.
Given a specific genotype, NM was specified as: where β 0 represents the model intercept, β s the regression coefficient related to the s-th population structure predictor, and x si the i-th observation of the s-th population structure predictor.
AM was specified as: where β Z is the regression coefficient for the environmental variable Z, and z i is the i-th observation of Z.
This way, NM is nested within AM (i.e. NM=AM when β Z =0). Then, a LR test was performed for each genotype to test if the inclusion of Z (i.e. in turn ψ R or γ) led to a significantly improved explanation of the genotype spatial distribution. Particularly, as SAMβADA returns log-likelihood (LogL) values, the LR test was specified as:

= −2(LogL NM − LogL AM)
Under the null hypothesis that NM and AM have the same log-likelihoods (i.e. D=0), D follows a χ 2 distribution with degrees of freedom equal to the difference in the number of parameters between AM and NM. Here, p-values were derived from a χ 2 distribution with one degree of freedom (AM having one parameter more than NM; see figure hereafter). P-values estimate was performed with the R function pchisq, by setting the appropriate value for degrees of freedom. The option lower=FALSE was used to compute the probability of obtaining the observed (or more extreme) D values under the null hypothesis.
Expected distribution of the likelihood-ratio test statistic (D) under the null hypothesis that NM and AM have the same log-likelihoods (i.e. D=0). Here, D values are distributed following a 2 distribution with one degree of freedom.

Supplementary Text 3. Additional SAMβADA analysis (K=3 and K=16 corrections).
In order to test SAMβADA results obtained with the K=4 correction (see main text), we also performed genotype-environment associations tests with alternative population structure predictors. Particularly, an additional ADMIXTURE analysis was performed on PSD for K values from two to 30 (Supplementary Figure 4) (Kopelman et al., 2015), and the membership coefficients from the K=3/K=16 cluster solutions were independently used as population structure predictors: K=3 was chosen to account for the fundamental gene pools present in cattle (i.e. African Bos taurus taurus, European B. t. taurus and B. t. indicus), and K=16 as the cluster solution showing the lowest CV error (see figure hereafter). In both the cases, a PCA was performed on the membership coefficients to obtain synthetic and orthogonal population structure predictors following the same procedure described in main text. When correcting for K=3, the first and second principal components (accounting for 63% and 37% of the variance within membership coefficients, respectively) were retained. The principal components from one to seven (accounting for 70% of the original variance within membership coefficients) were used when correcting for K=16 (Jolliffe, 2002). Results of additional SAMβADA analysis for the K=3/K=16 corrections are reported in Supplementary Figure 12.
Cross-validation (CV) errors associated with the cluster solutions (K) tested for partitioning PSD.  Figure 4) and literature findings (Decker et al., 2014).

Supplementary Text 4. Selection of reference populations for local ancestry analyses.
Four taurine/zebuine references combinations were tested: (i) Muturu-Tharparkar; (ii) Muturu-Lohani; (iii) Hereford-Tharparkar; (iv) Hereford-Lohani.  (Table 3), and (iv) the Hereford-Lohani comparison (Table 4).  ) and γ c , as derived from beta regression analysis. Column headings, acronyms and codes for statistical significance are the same described in Table 1. Upper table: association between Tharparkar ancestry at window 13 (chromosome 26) and ψ Rc , as derived from beta regression analysis. Lower table: association between Tharparkar ancestry at window 145 (chromosome 13) and γ c , as derived from beta regression analysis. Column headings, acronyms and codes for statistical significance are the same described in Table 1.

Supplementary Figure 2. Selection of the annual period with NDVI values best explaining
Syncerus caffer records.
Seventy-two models were tested with Maxlike (Royle et al., 2012), one for each "eMODIS" annual period (or composite; see X-axis in the plot). Model performances were evaluated through the Akaike information criterion (AIC). Here, ΔAIC values (Y-axis) are plotted representing the differences between the AIC values of the tested models and the AIC value of the best model. The model testing composite 21 (ea21stm; April 6-15) resulted the best one displaying a ΔAIC=0. The horizontal dotted line corresponds to ΔAIC=2, which is a suggested threshold to discriminate models with a performance comparable to the best one (Muscarella et al., 2014). ΔAIC values of the best model and those <2 are highlighted in red.

Supplementary Figure 3. Outlier detection in the infection risk model predictors.
Predictors of the infection risk model were checked for the presence of outliers potentially influencing model parameters estimates. Each predictor was individually checked for the uninfected (0) and infected (1) animals through boxplot visualization (A). Outliers were defined as those values lying outside 1.5 times the interquartile range above the upper quartile and below the lower quartile.
After inspection, ψ R , cattle density (Cd) and ψ S were transformed on the log 10 scale to reduce the potential leverage effect caused by the skewness of their distributions (B). Independent Mann-Whitney-Wilcoxon tests were performed to investigate significant differences between groups (H 0 : μ 0 = μ 1 , α=0.05). A significant difference between the means of infected and uninfected animals for BIO 5 (p-value=5.203E−05) and log 10 (ψ S ) (p-value=0.0234) was identified.

Supplementary Figure 4. Additional ADMIXTURE analysis: results.
This figure can be found in the Dryad Digital Repository (doi:10.5061/dryad.sf5j2bf); it depicts the results of the additional ADMIXTURE analysis described in Supplementary Text 3.

Supplementary Figure 5. Correlations between bioclimatic variables and principal components.
Absolute Pearson's product-moment correlation coefficients between first (upper plot), second (inbetween plot) and third (lower plot) principal components and the original bioclimatic variables used to perform PCA. Filled and empty circles indicate positive and negative correlations, respectively. In order to ease interpretation of principal components, a vertical dashed line is positioned in correspondence of bioclimatic variable showing the highest correlation.

Supplementary Figure 6. Candidate Rhipicephalus appendiculatus distribution models and model selection.
Twelve distribution models (X-axis) were tested to describe R. appendiculatus distribution in Uganda, and their performances evaluated based on the Bayesian information Criterion (BIC; Yaxis). The model including first (PC 1 ), second (PC 2 ) and third (PC 3 ) principal components showed the lowest BIC value (filled circle in the plot), and was therefore retained to predict ψ R over Uganda. R. appendiculatus occurrence probability (ψ R ) as derived from the selected distribution model (B) (see Supplementary Figure 6). Lower (A) and upper (C) bounds of the 95% confidence intervals around predicted ψ R are reported. Colour key corresponds to predicted ψ R , with increasing values from blue to red tones. Sampled farms are represented as circles, and coloured according to the ψ R value estimated at their geographical location.

Supplementary Figure 8. Candidate Syncerus caffer distribution models and model selection.
Thirty-one distribution models (X-axis) were tested to describe S. caffer distribution, and their performances evaluated based on the Bayesian information Criterion (BIC; Y-axis). The model including altitude (alt), annual precipitation (BIO 12 ), NDVI, and distance from water (Wd) showed the lowest BIC value (filled circle in the plot), and was therefore retained to represent ψ S . Model including BIO 12 +Wd failed to converge, and therefore it does not present any associated BIC.  Figure 11. Quantile-Quantile plots of the genotype-environment association studies (K=4 correction).

Quantile-Quantile plots of the genotype-environment association studies regarding ψ R (A) and γ (B).
Each point is relative to a single likelihood ratio test. In particular, X-axis reports the sorted Dstatistics as derived from a χ 2 distribution with one degree of freedom (i.e. the quantiles of the expected distribution of D values); Y-axis reports the quantiles of the observed distribution of D values. Observed D values from both the genotype-environment association studies were divided for their respective genomic inflation factor (λ) prior plotting, in order to correct for overdispersion possibly due to unconsidered sources of bias. Genomic inflation factors (as reported in the bottom right corner of each plot) and Quantile-Quantile plots were calculated and produced with the function qq.chisq as embedded in the snpStats R package (Clayton, 2015). Overall, observed Dstatistics from the ψ R study suggest a higher divergence from the expectation then D-statistics from γ association study.

Genotype-R. appendiculatus occurrence probability association study (K=3 correction)
Manhattan plot of the genotype-R. appendiculatus occurrence probability (ψ R ) association study for the K=3 correction (refer to Supplementary Text 3 for an explanation). X-axis reports SNPs chromosomal positions on B. taurus chromosomes. On the Y-axis, the test statistic p-values (p) of the associations with ψ R are shown for each genotype, after the Benjamini-Hochberg (BH) correction, and on the -log10 scale. Nominal significance threshold (α BH =0.05) is displayed as a black dashed line, and significant p-values are represented in green. One hundred and twenty-nine SNPs resulted significantly associated with ψ R . Thirty-eight of them resulted significantly associated also when the K=4 correction was used. In particular, Locus BTA-113604 (with genotype AA) resulted from both K=3 and K=4 analyses, by confirming the association found with the K=4 correction (see Table 5A, main text).

Genotype-R. appendiculatus occurrence probability association study (K=16 correction)
Manhattan plot of the genotype-R. appendiculatus occurrence probability (ψ R ) association study for the K=16 correction (refer to Supplementary Text 3 for an explanation). X-axis reports SNPs chromosomal positions on B. taurus chromosomes. On the Y-axis, the test statistic p-values (p) of the associations with ψ R are shown for each genotype, after the Benjamini-Hochberg (BH) correction, and on the -log10 scale. Nominal significance threshold (α BH =0.05) is displayed as a black dashed line, and significant p-values are represented in green. Three SNPs resulted significantly associated; one of these (ARS-BFGL-NGS-18933, genotype AA, chromosome 29) was present also when the K=4 correction was applied, being in linkage with OPCML (see Table 5A, main text). The other associated markers (ARS-BFGL-NGS-103237 and ARS-BFGL-BAC-6188, both with genotype AA) were not present in the set of SNPs resulting from the K=4 correction.

Genotype-T. p. parva infection risk association study (K=3 correction)
Manhattan plot of the genotype-T. p. parva infection risk (γ) association study for the K=3 correction (refer to Supplementary Text 3 for an explanation). X-axis reports SNPs chromosomal positions on B. taurus chromosomes. On the Y-axis, the test statistic p-values (p) of the associations with γ are shown for each genotype, after the Benjamini-Hochberg (BH) correction, and on the -log10 scale. Nominal significance threshold (α BH =0.05) is displayed as a black dashed line, and significant p-values are represented in green. Eight SNPs resulted significantly associated with γ; seven of them resulted significantly associated also when the K=4 correction was used (see Table 5B, main text), by confirming findings obtained with the K=4 correction.