Signatures of Selection for Environmental Adaptation and Zebu × Taurine Hybrid Fitness in East African Shorthorn Zebu

The East African Shorthorn Zebu (EASZ) cattle are ancient hybrid between Asian zebu × African taurine cattle preferred by local farmers due to their adaptability to the African environment. The genetic controls of these adaptabilities are not clearly understood yet. Here, we genotyped 92 EASZ samples from Kenya (KEASZ) with more than 770,000 SNPs and sequenced the genome of a pool of 10 KEASZ. We observe an even admixed autosomal zebu × taurine genomic structure in the population. A total of 101 and 165 candidate regions of positive selection, based on genome-wide SNP analyses (meta-SS, Rsb, iHS, and ΔAF) and pooled heterozygosity (Hp) full genome sequence analysis, are identified, in which 35 regions are shared between them. A total of 142 functional variants, one novel, have been detected within these regions, in which 30 and 26 were classified as of zebu and African taurine origins, respectively. High density genome-wide SNP analysis of zebu × taurine admixed cattle populations from Uganda and Nigeria show that 25 of these regions are shared between KEASZ and Uganda cattle, and seven regions are shared across the KEASZ, Uganda, and Nigeria cattle. The identification of common candidate regions allows us to fine map 18 regions. These regions intersect with genes and QTL associated with reproduction and environmental stress (e.g., immunity and heat stress) suggesting that the genome of the zebu × taurine admixed cattle has been uniquely selected to maximize hybrid fitness both in terms of reproduction and survivability.


INTRODUCTION
Domestic cattle are classified, based on phenotypes, into humpless taurine (Bos taurus) and humped zebu (B. indicus) cattle. They originated from different auroch ancestral populations B. primigenius primigenius and B. p. namadicus at two separate domestication centers; the Near East and the Indus Valley-Northern part of the Indian subcontinent, respectively (Epstein, 1971;Loftus et al., 1994;Troy et al., 2001;Chen et al., 2010). These two cattle types are considered as distinct species or subspecies with mitochondrial DNA (Loftus et al., 1994) and microsatellite (MacHugh et al., 1997) analyses suggesting a common ancestry about 200,000 and 700,000 years ago, respectively.
Cattle populations of zebu × African taurine ancestries are common in Africa and in particularly on the eastern part of Africa, Southern Africa, and along the Sahelian belt. According to their phenotypes, they are either referred to as African zebu, e.g., East African Shorthorn Zebu (EASZ) in Kenya, Adamawa Gudali in Nigeria, and Karamojong zebu in Uganda, African sanga, e.g., Ankole in Uganda, or African zenga e.g., Nganda in Uganda (Epstein, 1971;Rege, 1999;Rege et al., 2001). These cattle originated from the earlier migration of B. taurus and B. indicus to the continent from their putative centers of domestication followed by subsequent hybridization between them (Loftus et al., 1994;Chen et al., 2010;Gifford-Gonzalez and Hanotte, 2011). An African auroch influence in their genome has been postulated (Decker et al., 2014) but this remains until now speculative. The cattle colonization of the African continent started first with the arrival of taurine cattle, ∼7,000 years ago (Gifford-Gonzalez and Hanotte, 2011), followed by the zebu type in two waves, ∼4,000 and ∼1,300 years ago, according to pictorial, archeological and genetic evidences (Epstein, 1971;Hanotte et al., 2002). The later was through the Horn of Africa and it was likely linked to the development of the Swahili civilization (Hanotte et al., 2002). Zebu × taurine hybrids are therefore of ancient origin on the African continent. Given the sole presence of the taurine type mtDNA in African cattle, a malemediated introgression of zebu cattle to the native African taurine has been proposed (Loftus et al., 1994;Bradley et al., 1996). Microsatellite analysis (Hanotte et al., 2002) and more recently genome-wide single nucleotide polymorphism (SNP) analyses (Decker et al., 2014) show that the indicine ancestry peaks in the Horn of Africa gradually declining toward the western and the southern parts of the continent.
The small EASZ is the main type of African zebu cattle populating East Africa (Rege et al., 2001). As for other indigenous cattle populations, EASZ are more preferred by the local farmers over the pure exotic highly productive taurine breeds due to their superior adaptability to their local environment, which is characterized by a warm climate (20-23 • C), high humidity (60-80%) and high pathogenes challenges (e.g., Theileria parva, Ehrlichia ruminantium, and Haemonchus placei; de Clare Bronsvoort et al., 2013). These cattle show a degree of resistance to Rhipicephalus appendiculatus tick, the vector of East Coast Fever (ECF) protozoan parasite T. parva (Latif et al., 1991;Latif and Pegram, 1992), as well as tolerance to poor forage and water scarcity (Western and Finch, 1986). A mortality rate of ∼16% has been observed mainly attributed to ECF, haemonchosis, and heartwater in an EASZ population of western Kenya under the traditional management system (de Clare Bronsvoort et al., 2013;Thumbi et al., 2014). The genetic structure of this cattle population has recently been investigated by Mbole-Kariuki et al. (2014) using mid-density genome-wide SNP data. This study showed that the EASZ is a stabilized admixed population of zebu and African taurine ancestries, with an average genome proportions of 0.84 ± 0.009 and 0.16 ± 0.009, respectively. However, some of the EASZ animal showed recent European taurine introgression likely following artificial insemination program aiming to improve indigenous cattle productivity . This European introgression has been linked to increased vulnerability of EASZ to infectious diseases (Murray et al., 2013).
The genomes of several livestock species have now been intensively explored for signatures of positive selection, e.g., chicken (Rubin et al., 2010;Zhang et al., 2012), pigs (Rubin et al., 2012), sheep (Kijas et al., 2012), and cattle (Gautier et al., 2009;Gautier and Naves, 2011;Kemper et al., 2014;. In cattle, the main genomic tool used for this purpose is the commercially available genome-wide SNP chip (e.g., Illumina BovineSNP50 BeadChip and Illumina BovineHD BeadChip). Although the lower-density SNP chip has been widely used to detect signatures of selection on the genome of tropically adapted cattle (Gautier et al., 2009;Gautier and Naves, 2011;Flori et al., 2012Flori et al., , 2014 and commercial dairy and beef breeds Qanbari et al., 2011;Khayatzadeh et al., 2016), it has been associated with two main drawbacks; (i) the limited coverage of the bovine genome with an average markers gap of 49.4 kb, (ii) the SNP ascertainment bias toward European taurine breeds (Matukumalli et al., 2009). These two issues were partly solved upon the development of the higherdensity SNP chip (Illumina BovineHD BeadChip; Rincon et al., 2011). This recently developed tool has shown its usefulness in detecting signatures of selection in dairy and beef cattle breeds (Utsunomiya et al., 2013;Kemper et al., 2014;Perez O'Brien et al., 2014;Xu et al., 2015;Chen et al., 2016) and in tropical-adapted cattle (Porto-Neto et al., 2014;Xu et al., 2015). More specifically, Utsunomiya et al. (2013) have reported for the first time in cattle the use of a composite mapping index "Meta-analysis of Selection Signals (meta-SS), " using the Z-transformation method "Stouffer's method" (Stouffer et al., 1949), to define footprints of positive selection related to meat and milk production traits.
Full genome sequencing has also been used in livestock species to detect signatures of positive selection, e.g., in chicken (Rubin et al., 2010), pig (Rubin et al., 2012;Frantz et al., 2015), sheep (Liu et al., 2016), and cattle (Liao et al., 2013;Choi et al., 2015). In Gir cattle, assessing the pooled heterozygosity of sliding windows has defined footprints of selection on genes associated with heat tolerance, and innate and adaptive immunity (Liao et al., 2013). In Fleckvieh cattle, genes associated with coat color and sensory perception (e.g., olfaction and taste) have also shown signatures of positive selection . The PPP1R12A gene involved in intramuscular fat content was also considered as a candidate of positive selection upon analyzing the full genome re-sequence of Hanwoo cattle (Choi et al., 2015).
Recently, we explored the genome of an indigenous EASZ from western Kenya population using genome-wide SNP data from the Illumina BovineSNP50 BeadChip v.1, and identified 24 candidate genome regions harboring signatures of positive selection . However, given the restricted genome coverage of this tool (Matukumalli et al., 2009), these likely represent only a subset of the positively selected regions. We now analyze the autosomes of the same population for signatures of positive selection using two highly informative genome-wide dataset; genome-wide SNP genotypes obtained from the Illumina BovineHD BeadChip and full genome resequencing data. Furthermore, we include zebu × taurine populations from Uganda and Nigeria in our SNP genotyping analysis to assess the presence of commonly selected regions across these populations. Our aim is to provide a complete picture of the genome landscape of positive selection footprints in EASZ and in particular attend to untangle, at genome level, selection for the environmental challenges and the two components of animals fitness; reproduction and survivability. The identified regions were further explored to define possible causative variants responsible for the associated signatures of selection.

Cattle Populations, SNP Genotyping, and Quality Control
Details of the cattle populations studied are presented at Table 1. They include 92 non-European introgressed small EASZ from the Western and Nyanza provinces of Kenya (KEASZ) as well as cattle from Uganda, Nigeria, Guinea, Europe, and India. They were genotyped for 777,962 SNPs mapped to the UMD3.1 bovine reference genome using the Illumina BovineHD Genotyping BeadChip.
Quality control (QC) analyses for 735,297 autosomal SNPs were conducted through the check. marker function implemented in the GenABEL package (Aulchenko et al., 2007) for R software version 2.15.1 (R development Core Team, 2012). SNPs with a minor allele frequency (MAF) <0.05 (n = 68,731) or call rate <95% (n = 18,667) were filtered out from the entire dataset. These include 1,712 SNPs that failed both criteria. In total, 649,611 SNPs were therefore retained. The ancestral allelic state for 373,005 SNPs [mean gap size = 6.7 kb and standard deviation (SD) = 12.1 kb] has been reported previously by Utsunomiya et al. (2013) following genotyping of three non-cattle Bovinae species: two B. gaurus (gaur), six Bubalus bubalis (water buffalo), and two B. grunniens (yak), with the fixed allele in the three species considered as ancestral. Only these SNPs were included in the downstream analyses.
Additional QC criteria included a minimum sample call rate of 95% and a maximum pairwise identity-by-state (IBS) of 95%, with the lower call rate animal eliminated from the high IBS pair. One ZS sample was excluded for having a low call rate, whilst 15 samples (two GIR, one NEL, four HOL, four JER, two AG, one AZ, and one WD) were excluded following the IBS criterion.

Principle Component Analysis (PCA)
PCA were conducted using the prcomp function implemented in GenABEL package for R software. These analyses were carried out in three levels; (i) all cattle populations, (ii) all cattle populations excluding the European taurine cattle, and (iii) only the KEASZ, Uganda (UGN) and Nigeria (NGR) cattle.

Estimation of Asian Zebu Autosomal Ancestry Proportion in African Zebu Cattle (Admixture Analysis)
Admixture analysis using ADMIXTURE 1.23 software (Alexander et al., 2009) with cross-validation and 200 bootstraps for K = 3 was conducted on the whole dataset to determine the European taurine, Asian zebu, and African taurine ancestries at genome-wide level and for each autosome separately. The output files were graphically displayed by the ggplot2 package (Wickham, 2009) for R software.
the combined reference cattle populations (NEL, GIR, NDM, MT, HOL, JER) using the rehh package (Gautier and Vitalis, 2012) for R software. The standardized Rsb-values were normally distributed ( Supplementary Figure 1), so a Z-test was applied to identify statistically significant SNPs under selection on KEASZ, UGN, and NGR cattle populations. One-sided upper-tail Pvalues were derived as 1-Φ(Rsb) from the Gaussian cumulative density function Φ. Candidate regions were defined as having five adjacent SNPs, not separated by more than 500 kb, passing the threshold of −log 10 P = 4. iHS analyses (Voight et al., 2006) were conducted on KEASZ, combined UGN cattle and combined NGR cattle populations using the rehh package for the R software. This statistic was calculated for SNPs that passed the QC criteria and exhibited a within-population MAF of at least 0.05, since the algorithm of iHS has a limited power to calculate the statistic for fixed SNPs. As with Rsb, the standardized iHS-values followed a normal distribution ( Supplementary Figure 1), so a two-tailed Z-test was applied to identify statistically significant SNPs under selection with either an unusual extended haplotype of ancestral or derived alleles relative to the genome. Two-sided P-values were derived as 1-2|Φ(iHS)-0.5| from the Gaussian cumulative density function Φ. Candidate regions were defined as in Rsb.
As a prerequisite to the Rsb and iHS analyses, fastPHASE 1.4 (Scheet and Stephens, 2006) was used to phase the genotyped SNPs into the corresponding haplotypes using K10 and T10 criteria. Population label information was used to estimate the phased haplotype background.

Inter-Population Change in SNP Allele Frequency ( AF)
AF analysis investigates absolute allele frequency difference between two populations (Carneiro et al., 2014). In this analysis, the mean frequency of the first allele was estimated for KEASZ, combined UGN cattle and combined NGR cattle populations, separately (AF pop1 ). Likewise, the mean frequency of the first allele for each SNP was calculated for the combined reference cattle populations (AF pop2 ). The standardized values of the AF (AF pop1 -AF pop2 ) were normally distributed (Supplementary Figure 1), therefore a Z-test was applied to identify statistically significant SNPs showing higher allele frequency in the first population. Two-sided P-values were derived as 1-2|Φ (standardized AF)-0.5| from the Gaussian cumulative density function Φ. Candidate regions were defined as in iHS and Rsb.

Meta-Analysis of Selection Signals (Meta-SS)
The Stouffer method was used to combine the P-values obtained from Rsb, iHS, and AF, for each analyzed set of populations, i.e., KEASZ, UGN, and NGR, in meta-SS analyses (Whitlock, 2005;Utsunomiya et al., 2013). Each value, for every SNP in each test, was transformed to a Z-score, F −1 (1-P-value), where F −1 is the inverse cumulative distribution function for a standard normal distribution. Then, the SNP-specific Z-scores were combined together according to the following equation: Z i = (Z Rsb + Z iHS + Z AF )/ √ k, where i and k are numbers of SNPs and tests, respectively. The resulting Z-scores were referred back to the standard normal distribution to obtain combined P-values [P-value = 1-F(Z i )], where Fis the cumulative distribution function for a standard normal distribution. Candidate regions were defined as having five adjacent SNPs not separated by more than 500 kb passing the threshold of −log 10 P = 4.

KEASZ Whole Genome Sequencing Analysis
A single pool of 10 unrelated KEASZ DNA samples were sequenced using an ABI SOLiD 4 genetic analyser (Supplementary Table 1). SOLiD 2 × 50 bp mate-paired libraries were constructed and sequenced in the Deep Seq facility at the University of Nottingham according to the manufacturer's instructions. Reads were mapped to the UMD3.1 bovine reference genome assembly (Elsik et al., 2009) using the LifeScope Genomic Analysis software 2.5.1 re-sequencing mapping pipeline (http://www.lifetechnologies.com/lifescope). SNPs and indels (insertions/deletions) were called using the diBayes package implemented in LifeScope. A minimum coverage of two uniquely mapped reads and two non-reference allele counts were required to call a variant. Additionally, a minimum read mapping quality of 20 (MAPQ20) and base quality of 20 were also implemented.
Pooled heterozygosity Hp of SNPs detected in the pooled 10 KEASZ full genome SOLiD sequences were calculated on 100 kb sliding windows with 10 kb incremental steps. Window sizes were extended by the number of uncovered bases to improve the accuracy of the calculation and consistency across windows. For each SNP in the window, the number of reads for the most and least frequent allele was counted (n MAJ and n MIN , respectively).
Hp-values were calculated using the following formula: Hp = 2 n MAJ n MIN /( n MAJ + n MIN ) 2 (Rubin et al., 2010). The autosomal Hp-values were Z-transformed (ZHp = Hpmean Hp/SD Hp) and a ZHp ≤ −4 was applied as a threshold to specify windows carrying selective sweep as in Liao et al. (2013). Overlapping candidate windows were merged into a single region.

KEASZ Exome Enrichment and Sequencing
The Agilent SureSelect XT target enrichment kit (cat no. G7530-90004) was used for bovine exome sequence enrichment. It covers a total of ∼45 Mb of the bovine sequence that composed of the coding regions in the UMD3.1 reference genome (coding regions from Refseq and Ensembl, no UTR "untranslated regions") and microRNA. The exomes of a further 10 KEASZ samples (Supplementary Table 1) were sequenced in the Deep Seq facility at the University of Nottingham using an ABI SOLiD 5500 genetic analyser. Using the LifeScope Enrichment Sequencing Pipeline, the generated 75 bp reads were mapped to the UMD3.1 bovine reference genome. Supplementary Table  2 summarizes the number of aligned reads, average depth of coverage, and percentage of reference exome for each sample. The same variants calling criteria used in the KEASZ whole genome sequence were implemented for the exome analysis except that reads with MAPQ ≥ 30 and base quality of 30 were used.

African Taurine (N'Dama and Muturu) Genome Sequence Analysis
To infer about the origin of the variants identified in the KEASZ selected regions, we analyzed 10 N'Dama cattle full genome sequences from Guinea available at the GenBank with the Bioproject accession number PRJNA312138 (Kim et al., 2017), and 10 Muturu cattle genome sequences from South East Nigeria. For the later, genomic DNA was extracted using Macherey-Nagel NucleoSpin R Tissue DNA extraction kit, according to manufacturer's protocol, from 10 ear tissue samples. One hundred and fifty bp paired-end libraries were constructed and sequenced using Illumina Hiseq2500 platform (Illumina, USA). Sequence reads were mapped to the UMD3.1 bovine reference genome assembly using the bwa-mem option of Burrows-Wheeler Alignment tool (BWA) version 0.7.5a (Li and Durbin, 2010;Supplementary

Candidate Selected Regions Characterization
Protein-coding and RNA genes mapped within the candidate regions, based on the UMD3.1 bovine reference genome annotation, were processed using the functional annotation tool implemented in DAVID Bioinformatics resources 6.7 to determine the over-represented (enriched) functional terms (Huang da et al., 2009a,b). An enrichment score of 1.3, which is equivalent to the Fisher exact test P = 0.05, was used as a threshold to define the significantly enriched functional terms in comparison to the whole bovine reference genome background. The list of genes mapped on the UMD3.1 reference bovine genome was obtained from the Ensembl Genes 86 database (Flicek et al., 2013) using the BioMart tool (Kinsella et al., 2011). The bovine Quantitative Trait Loci (QTL) and their UMD3.1 genome coordinates were downloaded from the cattle QTL database (http://www.animalgenome.org/cgi-bin/QTLdb/BT/index). The intersectBed function from the BedTools software was used to overlap these QTL with the identified KEASZ candidate regions (Quinlan and Hall, 2010). Candidate regions lacking annotated genes (protein-coding, RNA, and pseudogenes) based on UMD3.1 bovine reference genome were also identified. The intersectBed function from the BedTools software was also used to overlap these candidate regions with transcription factors binding sites identified previously on the bovine reference genome by Bickhart and Liu (2013).
Variants (SNPs and indels) in the genes within the genomewide SNP and Hp overlapping candidate regions were annotated using the variant effect predictor tool (McLaren et al., 2016) based on Ensembl variants 86 databases. Comparisons with the previously discovered bovine variants listed in the dbSNP database (Sherry et al., 2001) classified these variants into KEASZ-specific (novel) and general bovine variants. The identified missense, splice region SNPs, frameshift and splice region indels were cross-checked with variants identified in the KEASZ exome data and with N'Dama and Muturu full genome sequences.
Candidate genes within the overlapping genome-wide SNP and Hp candidate regions were selected based on their biological function, e.g., immunity, reproduction, fertility, heat tolerance and anatomical development. Variants (SNP and indels) within these genes were defined and the biological effect of the missense SNPs were predicted by the online tool PolyPhen-2 (Adzhubei et al., 2010).

Estimation of Excess-Deficiency in Asian Zebu Ancestry at Candidate Regions
LAMP software version 2.4 (Sankararaman et al., 2008) was used to estimate the Asian zebu and African taurine ancestry proportions of the high-density genotyped SNPs in the KEASZ samples. The genome-wide autosomal zebu ancestry proportion of 70% was obtained from the admixture proportions α of the ADMIXTURE analyses. An estimated number of 500 generations was set for the beginning of the zebu-taurine admixture in light of our current knowledge of zebu arrival on the continent, assuming a generation time of 6 years (Keightley and Eyre-Walker, 2000). A uniform recombination rate of 1 cM = 1 Mb was assumed. The average excess/deficiency in Asian zebu ancestry at each SNP ( AZ) was calculated by subtracting the average estimated Asian zebu ancestry of the SNP from the average estimated Asian zebu ancestry of all SNPs. The median AZ for the significant SNPs of KEASZ meta-SS analysis within candidate regions was considered.

The Admixed Genome Structure of African Cattle Populations
The PCA plot on all cattle populations using the filtered autosomal SNPs dataset ( Figure 1A) reveals the already described triangle-like 2-dimentional global organization of cattle genetic diversity (Gautier et al., 2010). The first component, which accounts for 20.6% of the total variation, separates the European (HOL and JER) and African (NDM and MT) taurine populations from the Asian zebu cattle (NEL and GIR). The second component, which accounts for 4.9% of the total variation, separates the African taurine cattle from the other cattle populations. All the East (KEASZ and UGN) and West (NGR) African cattle populations analyzed are positioned between the African taurine and Asian zebu cattle supporting zebu × taurine genome admixture.
The PCA conducted on Asian zebu (NEL and GIR), African taurine (NDM and MT), East African (KEASZ and UGN) and West African (NGR) cattle populations ( Figure 1B) shows genetic differentiation within the East African cattle (KEASZ and UGN) cattle and between the East and West African (NGR) cattle. The first component, which explains 13.9% of the total variation, separates Asian zebu from African taurine. Whilst, the second component, which explains 2.6% of the total variation, separate West African zebu cattle (NGR) and the East African (KEASZ and UGN) cattle. A PCA on KEASZ, UGN, and NGR cattle only, separates first the East (KEASZ and UGN) and West (NGR) cattle populations (PC1 = 2.99%), while PC2 (2%) reveal a much higher genetic heterogeneity within and across East African cattle populations compared to the West African ones (Supplementary Figure 2).
The admixture analysis indicates a relatively even admixed genome of Asian zebu and African taurine ancestries across KEASZ animals with an estimated genetic proportion of 0.7 ± 0.01 SD and 0.30 ± 0.01 SD, respectively. The same even admixed proportion is present across the Karamajong zebu cattle from Uganda, while the Serere zebu has an estimated 0.74 ± 0.04 SD and 0.24 ± 0.04 SD zebu × taurine genetic proportions, with minor European taurine ancestry (0.02 ± 0.01 SD). The genome of the Ankole cattle has a larger African taurine genetic proportion (0.48 ± 0.01 SD). Nganda cattle has an admixed genome of 0.57 ± 0.04 SD Asian zebu and 0.36 ± 0.03 SD African taurine ancestries, it also carries low level of European taurine ancestry (0.07 ± 0.05 SD) (Figure 2). This pattern of admixture is also observed in the cattle of Nigeria but with higher average African taurine ancestry (0.35 ± 0.01 SD) and lower zebu ancestry (0.65 ± 0.01 SD) than the East African cattle, except Ankole. No European introgression was detected in the West African zebu cattle analyzed.
Variation in Asian zebu ancestry is observed among the autosomes of the East (KEASZ and UGN) and West African (

Signatures of Positive Selection
High Density Genome-Wide SNP Analysis The iHS, Rsb, and AF analyses conducted on KEASZ reveal one, 19 and six autosomal candidate regions harboring signatures of positive selection, respectively (Figures 3A-C, Supplementary Table 5). As the tests follow normal distributions (Supplementary Figure 1) and their genome-wide average Pvalues are weakly correlated (Pearson correlation coefficient r ≤ 0.228; Supplementary Table 6), the P-values of each SNP for the three tests were combined in a meta-SS analysis revealing 98 candidate regions ( Figure 3D, Supplementary

Comparison with African Cattle from Uganda and Nigeria
A total of 32 KEASZ candidate regions are shared with the UGN cattle populations. We classify them as East African candidate regions. Twenty-two regions are also shared across the three comparisons and they are classified as East and West African candidate regions ( Table 3). The cross-validation of candidate regions between the three sets of zebu × taurine admixed cattle populations (KEASZ, UGN, and NGR) allow us to fine map KEASZ, East African shorthorn zebu from western Kenya; AO, Ankole; KR, Karamojong zebu; NG, Nganda; ZS, Serere zebu; NGR, West African zebu cattle from Nigeria. +: Substantial increase in Asian zebu ancestry (≥mean + one SD). -: Substantial decrease in Asian zebu ancestry (≤mean -one SD).
the size of 18 candidate regions. As indicated in Table 4, this approach allows us to narrow the size of these candidate regions to 94-894 kb, with the largest reduction in size (∼1.9 Mb) observed for a candidate region on BTA 12 ( Table 4).
Ten of the East African candidate regions intersect with regions under positive selection identified previously (Gautier et al., 2009;Gautier and Naves, 2011;Larkin et al., 2012;Kemper et al., 2014;Xu et al., 2015). Nine of these regions were identified in tropical-adapted cattle populations, such as Creole, Borgou, and Guzerat. Whilst four were found to be under positive selection in commercial cattle breeds, e.g., Holstein, Shorthorn, and Charolais.
For the 22 shared East and West African candidate regions, five were found to be under selection in tropical-adapted cattle (e.g., Gir, Creole, and Borgou) and four in commercial breeds (e.g., Angus, Holstein, and Shorthorn; Table 3).

Functional Characterization of High Density Genome-Wide SNP Candidate Regions
The 101 KEASZ candidate regions included 1,024 genes based on the UMD3.1 bovine reference genome annotation (Supplementary Table 9). These genes cluster into 110 functional clusters following DAVID functional term clusters enrichment analysis. Six of these clusters are significantly enriched relative to the bovine genome as indicated in Table 5. Candidate regions    by Bickhart and Liu (2013) are overlapping with any of these gene desert regions.

KEASZ Full Genome Sequencing Analysis
The 10 pooled KEASZ full genome sequences generated a total of 615,413,240 reads with MAPQ ≥ 20 (0.1% probability of incorrect alignment) mapped on the UMD3.1 bovine reference autosomes. These MAPQ20 reads covered ∼97% of the reference autosomes with an average of 11 times depth of coverage. SNP calling using LifeScope diBayes package identified a total of 10,466,699 SNPs (8,114,664 heterozygotes and 2,352,035 homozygotes). Regions with signatures of selective sweep were defined by assessing the pooled SNPs heterozygosity Hp of 100 kb windows, as in Liao et al. (2013), incremented by 10 kb. Supplementary Figure 7 shows the distribution of SNPs in the 100 kb autosomal windows with a mean of 297 SNPs per window. The mean Hpvalue is 0.42 (SD = 0.025). Out of the total 250,930 windows, 1,825 (∼0.73%) have a ZHp score of ≤ −4 merged into 165 autosomal candidate sweep regions (Figure 4, Supplementary Table 11). The largest region, ∼2 Mb in size, is on BTA 7 (51.4-53.4 Mb). This region contains windows with the lowest ZHp-value (ZHp = −16.6).
Based on the annotated UMD3.1 bovine reference genome, 518 genes are found within 133 of these sweep regions (Supplementary Table 12). DAVID analyses were conducted in two levels including: (i) genes within KEASZ Hp candidate regions and (ii) genes within all KEASZ candidate regions from the SNPs and Hp analyses combined together. The first DAVID analysis identifies 57 functional term clusters with six significantly enriched clusters (Table 6). Whilst, the second one defines 148 clusters with six significantly enriched clusters ( Table 6).
A total of 32 candidate regions are gene deserts (Supplementary Table 10). Five of these regions were present in UGN cattle and one in NGR cattle (genome-wide HD SNPs analysis). None of transcription factors binding sites, identified by Bickhart and Liu (2013), are overlapping with the gene desert regions identified here.

Overlapping Candidate Sweep Regions between Genome-Wide HD SNP and Hp Full Genome Sequence Analyses
Among the 165 candidate Hp sweep regions, 35 regions overlap with the genome-wide HD SNP candidate regions of KEASZ. These include 25 regions also revealed by the SNP meta-SS analysis in UGN cattle populations, in which seven regions were also shared between the East (KEASZ and UGN) and West African (NGR) cattle populations (Table 7). Also, our genome sequence analysis reveals 101 candidate regions not previously identified to be under positive selection in other studied cattle populations (Supplementary Table 11).
Within the 35 overlapping candidate regions, 185 genes are identified (Supplementary Table 13). DAVID analysis indicates 23 functional clusters with two significantly enriched functional clusters: response to hormone stimulus and signaling pathway (enrichment score = 2.07), and transcription regulation (enrichment score = 1.3). Also worth mentioning is a functional cluster associated with the immune system development and regulation, although it does not reach the 1.3 threshold (enrichment score = 1.24).
For the 25 East African overlapping sweep regions, 103 genes are found (Supplementary Table 13). They are grouped into four functional clusters: GTPase regulator activity (enrichment score = 1.17), protein complexes assembly (enrichment score = 0.76), regulation of transcription (enrichment score = 0.45), and nucleotides and ribonucleotides binding (enrichment score = 0.13), but none are significant. A total of 24 genes are within the seven overlapping regions across the East (KEASZ and UGN) and West African (NGR) populations (Supplementary Table 13). These genes are grouped into a single cluster associated with ion binding (enrichment score = 0.14).
A total of 11,915 SNPs in 148 genes and 484 indels in 96 genes are identified within the 35 overlapping genome-wide SNP and Hp candidate regions. These variants are either located on the coding region (missense, synonymous SNPs, and frameshift indels), or non-coding regions (intronic, splice regions, and 3 ′ and 5 ′ UTR SNPs and indels; Supplementary Table 14). A total of 261 SNPs in 50 genes and eight indels in seven genes have not been reported yet in the dbSNP database "novel variants." Among all the variants, 88 SNPs in 49 genes (one novel) are missense, 50 SNPs in 37 genes (none novel) are on splice regions, two indels in two genes (none novel) are frameshift and two indels in two genes (none novel) are on splice regions (Supplementary  Table 14).
Seventy-five missense SNPs in 44 genes were also identified in the KEASZ exome data spanning the overlapping regions. Twenty-three of these SNPs, in 13 genes, were not present in African taurine (Muturu and N'Dama), they are likely of zebu origin in KEASZ. Whilst, 15 missense SNPs in 12 genes were also identified in the two African taurine cattle populations examined and therefore they may be of taurine origin. Among the 50 splice regions SNPs, 44 in 32 genes are identified in the KEASZ exome data, in which seven SNPs in six genes may be of zebu origin and  ten SNPs in nine genes may be of taurine origin (Supplementary  Table 15). Only a single indel classified as frameshift and splice region indel has been identified in the KEASZ exome data. This indel has also been found in the genome of N'Dama and Muturu cattle and may be of taurine origin (Supplementary  Table 15).
Twelve genes within the overlapping genome-wide SNP and Hp candidate regions were selected as examples of interesting candidates of positive selection in KEASZ following their biological roles (Table 8). These genes have functional roles linked to traits for adaptation to the African environment and/or reproductive fitness, e.g., immunological-related traits (e.g., disease challenges) and reproduction-related traits (e.g., fertility).
Polymorphisms were identified in all these genes with the exception of the HSPB9 gene, which was found monomorphic. A total of 668 SNPs and 32 indels were detected (Supplementary  Table 16). In particular, four missense variants on three genes, one splice region SNP on one gene and one frameshift indel on one gene were identified (Table 9), in which all have been reported previously in the dbSNP database (Sherry et al., 2001). Both of the two missense variants in RXFP2 are considered as of probable zebu origin. Whilst the splice region SNP on SPATA24 is considered as of probable taurine origin ( Table 9, Supplementary Table  15).

Estimation of Excess-Deficiency in Asian Zebu Ancestry
LAMP software 2.4 (Sankararaman et al., 2008) estimated the mean Asian zebu ancestry proportion for all autosomal SNPs in KEASZ to be 0.76 (SD = 0.14). Based on this estimation, the mean AZ for autosomal SNPs is 0 (SD = 0.14). The majority of the candidate sweep regions show high zebu ancestry proportion, but similar to the mean Asian zebu ancestry proportion (within  Gautier et al., 2009;Chen et al., 2010;Flori et al., 2014 a ∆AZ = estimated excess/deficiency of the Asian zebu proportion. b The candidate regions were cross-referenced with the ones obtained previously on tropical-adapted cattle and commercial breeds. Bold (deviation by more than ±1 standard deviation from the autosomal mean ∆AZ). Commercial breeds studies. NA: No SNPs passed-log 10 (P-value) = 4 of EASZ meta-SS analysis. *Specific to East African cattle populations (KEASZ and Uganda). **Shared between East (KEASZ and Uganda) and West (Nigeria) African populations.
one SD from the mean). For the 32 East African candidate regions (Table 3), eight regions reveal substantial AZ, a single region shows deficiency and seven show excesses (more than ± one SD from the mean AZ). Moreover, six East and West African candidate regions (Table 3) demonstrate substantial AZ. Two of these regions show deficiencies and four show excesses of Asian zebu ancestry. When the genome-wide SNP and Hp analyses overlapping KEASZ candidate regions were considered, 13 regions demonstrate substantial AZ. One region, which is shared between East and West African cattle, show deficiency in Asian zebu ancestry. Whilst 12 regions show excess in Asian zebu ancestry, in which six are specific to East African cattle and one is shared between East and West African cattle ( Table 7).

DISCUSSION
In this study, we unravel the autosomal zebu × taurine admixed genome structure of African indigenous cattle populations from the eastern (Kenya and Uganda) and western (Nigeria) part of the African continent using genome-wide high density SNP data. Also, for the first time, both genome-wide SNP and full genome sequence data have been utilized to identify candidate signatures of positive selection in the genome of an African cattle, the EASZ from Kenya, based on meta-analysis of selection signals (meta-SS) and pooled heterozygosity (Hp) analysis. These regions were then further characterized to identify candidate causative variants and to assess their probable zebu or African taurine origins.

Genetic Structure of East and West African Cattle Populations
Archaeological and genetic evidences so far indicate that the history of African zebu cattle started with the introgression of Asian zebu to the native African taurine populations ∼4,000 and 1,300 years ago (Epstein, 1971;Hanotte et al., 2002). The coordinates of the African cattle samples (KEASZ, UGN, and NGR) in the PCA plots (Figures 1A,B) indicate zebu × taurine admixture level in their genome. This zebu introgression appears even across animals within population on the autosomes of KEASZ and also in West African cattle from Nigeria as well as in Ankole and Karamojong zebu cattle from Uganda (Figure 2). At the contrary, uneven European taurine introgression, likely of recent origin following ongoing crossbreeding of indigenous African cattle populations with exotic cattle breeds to improve their productivity (Mwai et al., 2015), is observed in the genome of Nganda and Serere zebu cattle from Uganda (Figure 2). Interestingly, we also observe positive significant correlation in Asian zebu ancestry autosomal proportion between the KEASZ and the other East and West African cattle populations examined. This is in agreement with the known history of zebu cattle on the continent (Hanotte et al., 2002) and it supports a common ancestry for the African cattle examined here. Our results also support an East African origin for the West African zebu cattle as previously showed in Hanotte et al. (2002) with migration of admixed zebu × taurine cattle populations.

Candidate Genomics Regions under Positive Selection
This study is the first to our knowledge that exhaustively investigated the genome of an indigenous African cattle population for signatures of positive selection using both high density genome-wide SNP and full genome sequence information. The outputs of these analyses represent a follow up of our previous identifications of signatures of selection on KEASZ genome using the lower density Illumina BovineSNP50 BeadChip v.1 . Here, we have validated 14 candidate regions out of 24 regions previously identified to be targeted by positive selection in KEASZ (Supplementary Table  17; . The remaining 10 regions might have been false positives. Indeed, in our previous study, we used a genomic tool characterized by high European taurine ascertainment bias and low genome coverage. This issue has been addressed in this study by using high-density HD SNP array and full genome sequence data. Moreover, we previously selected candidate regions based on only two SNPs above the significant threshold. Here we have been using much more stringent criteria with a minimum of five SNPs above the threshold to define a candidate region under positive selection (see Materials and Methods Section).
We also used a different strategy to increase the power of detecting genomic signatures of positive selection. First, instead of analyzing regions defined by each genome-wide SNP analyses (Rsb, iHS, and AF) separately, a composite statistical approach "meta-SS analysis" was conducted at the autosomal level as in Utsunomiya et al. (2013) to combine the SNP-specific P-values for each test into a single index.
Simulation data have shown that combining the signals from different tests into a single statistic increase the power of defining genomic regions under selection (Grossman et al., 2010). Most importantly, coalescent simulations and accurate calibrated demographic models, which are lacking in African cattle, are not required by meta-SS in comparison to the original Composite of Multiple Signals (CMS) method proposed by Grossman et al. (2010). Secondly, the reference cattle populations were pooled into a single population as in . As suggested by Gautier and Naves (2011), this pooling approach increases the haplotype diversity in the reference populations and it breakdown population-specific linkage disequilibrium (LD) which may result from genetic drift. Thirdly, information from both genome-wide SNP genotypes and full genome sequence were used to adequately cover the KEASZ genome, as well as, to address any breed ascertainment bias associated with the commercially available SNP arrays (Matukumalli et al., 2009). Overlapping candidate regions from these two approaches further support the identification of the candidate regions.

The Candidate Regions: A Result of Positive Selection or Genetic Drift?
Both selection and genetic drift may have shaped the genome of the KEASZ and the other populations examined here. Distinction between the two is difficult, and fixation or near fixation of allele through genetic drift may lead to false positive candidate regions for signatures of positive selection (Qanbari and Simianer, 2014). Comparison of the results between different cattle populations, the unraveling of the zebu or taurine origin of the selected regions and the function of genes within selected regions may here provide further information.
In this study, we observe a small number of candidate regions for positive selection shared between KEASZ HD SNP and/or genome sequencing information and the East and West African cattle population examined here (Tables 3, 7). While it may be argued that these overlapping candidate regions for positive selection are a consequence of common genetic backgrounds, the low number of shared candidate regions as well as the identification of the same regions in tropical-adapted cattle populations with different population histories, e.g., Creole cattle from Guadeloupe (Gautier and Naves, 2011), zebu × taurine admixed cattle from West Africa (Gautier et al., 2009;Flori et al., 2014;Xu et al., 2015), Brahman (Ramey et al., 2013;Xu et al., 2015) and Gir (Liao et al., 2013;Perez O'Brien et al., 2014) cattle (Tables 3, 7, and Supplementary Table 11) strongly supports the role of selection rather genetic drift for these shared regions. It underlines the importance that these genome regions may play a role in the adaptive traits of African tropical-adapted admixed cattle.
Examining the probable ancestral origin of the candidate regions reveals a subset of candidate regions with substantial AZ (Tables 3, 7). Most of these regions show excesses of Asian zebu ancestry, e.g.,  Mb, indicating that the indicine haplotypes are more likely to be under selection in the African admixed cattle populations than the taurine. This is perhaps not surprising considering the predominant zebu genomic background in EASZ . However, we also observed candidate regions showing substantial excess of African taurine ancestry, e.g., BTA 3: 76.1-76.4 Mb and BTA 19: 3.3-3.8 Mb (Table 3). These are present in chromosomes with overall low level of African taurine ancestry adding further support for selective pressure rather than genetic drift for their presence ( Table 2).

Biological Functions of the Genes Present in Signatures of Selection Regions in EASZ
Examining the potential biological functions under positive selection reveal several different significantly enriched biological pathways likely under selection in KEASZ given their importance for a cattle population living in a challenging tropical environment (e.g., bovine adaptive and innate immunity, response to hormone stimuli, intermediate filaments, and keratins pathways). Additionally, genes and QTL related to regulation of bovine immunity, fertility and reproduction, anatomical development, and heat stress have also been found within the identified KEASZ candidate regions (Supplementary  Tables 9, 12, 18).
Indeed, innate and adaptive immune genes may be expected to be primary targets of selection in African cattle that are exposed to a diversity of pathogens and associated physiological stresses in their surrounding environment, e.g., endoparasites, haemoparasites, and bacteria (de Clare Bronsvoort et al., 2013;Murray et al., 2013;Thumbi et al., 2014). Examples of candidate genes related to this category are: C-C chemokine receptor type 7 precursor (CCR7) and granulocyte macrophagecolony stimulating factor (CSF3). Upon binding to two chemoattractants: CCL19 and CCL21, CCR7 is involved in maturating dendritic cells and hence activate T lymphocytes (Marsland et al., 2005;Forster et al., 2008). This receptor has also demonstrated a role in regulating innate immunity by attracting macrophages to sites of infection (van Zwam et al., 2010). The multifunctional cytokine (CSF3) acts as a positive regulator for macrophages to induce their antimicrobial effects (Grabstein et al., 1986;Tarr, 1996). Interestingly, several trypanotolerance QTL, identified by Hanotte et al. (2003), overlap with the identified KEASZ candidate regions (Supplementary  Table 19). These QTL might indicate an undocumented level of trypanotolerance in KEASZ, as it has already been shown in other East African cattle populations (e.g., Orma Boran, Sheko and Mursi cattle; Dolan, 1987;Mwangi et al., 1993;Bahbahani and Hanotte, 2015).
Our results also show that besides tolerance to disease challenges, fertility and reproduction traits have been also selected in the African zebu × taurine admixed cattle. Examples of these candidate genes are: the retinoic acid receptor α subunit (RARA), relaxin/insulin-like family peptide receptor 2 (RXFP2), spermatogenesis associated 24 (SPATA24), and sperm associated antigen 7 (SPAG7). The retinoic acid receptor, which is expressed in sertoli cells in the seminiferous tubules, plays a role in maintaining retinoic acid signal to induce spermatogonia differentiation (Wolgemuth and Chung, 2007). RXFP2, which has been found to be under selection in admixed Creole cattle (Gautier and Naves, 2011) and Gir cattle (Liao et al., 2013), plays a role in testicular descent development (Gorlov et al., 2002;Agoulnik, 2007;Park et al., 2008;Feng et al., 2009). Candidate genes such as spermatogenesis-associated 24 (SPATA24) and sperm-associated antigen 7 (SPAG7) can also be classified into the cattle fertility and sexual reproduction category due to their role in spermatogenesis. Such signals may be the legacy of selection for fertility in hybrid populations between two cattle lineages (zebu and taurine), with a common ancestry perhaps as old as half a million years ago (MacHugh et al., 1997). This requires further investigation. QTL related to bovine reproduction and fertility, e.g., sperm motility and calving ease, are also target of positive selection in KEASZ.
An important gene category identified within the candidate regions is the heat stress category. Genes and QTL within this category might be targeted by natural selection to adapt to the tropical environmental condition (Hansen, 2004). Two heat shock protein genes (HSPA9 and HSPB9) and two members of the DnaJ family (DNAJC8 and DNAJC18) are mapped within three of the identified candidate regions (Table 8). Heat shock proteins have critical roles in maintaining protein folding under heat stress (Parsell and Lindquist, 1994;Coleman et al., 1995). The members of DnaJ family act as cofactors for other heat shock proteins (Hsp70) to maintain protein folding (Kampinga and Craig, 2010).
Genes and QTL related to anatomical development were also identified within the KEASZ candidate regions (e.g., LEMD3, LOX, and RXFP2). These genes are important to maintain optimum growth and development. LEMD3 and LOX are associated with the development of different organs, such as heart (LEMD3), lung and blood vessels (LOX) (Maki et al., 2005;Ishimura et al., 2008). The candidate region harboring LEMD3 has also been found to be under selection in Brahman cattle (Ramey et al., 2013). The role of RXFP2 in testicular descent development (Gorlov et al., 2002;Agoulnik, 2007;Park et al., 2008;Feng et al., 2009) can classify this gene into the anatomical development, fertility and reproduction as well as the heat tolerance categories. This gene has also been associated with the horn phenotype in sheep (Johnston et al., 2011;Kijas et al., 2012) and a study by Johnston et al. (2013) has demonstrated an association between variants of this gene and reproductive success and survival rate in Soay sheep from St. Kilda.
In addition, various production traits QTL, e.g., marbling score, milk fat percentage, and milk fat yield have also been found within the KEASZ candidate regions. Given that several of the identified KEASZ candidate regions overlap with regions under positive selection in commercial dairy and beef cattle breeds (Tables 3, 7, Supplementary Table S11; Larkin et al., 2012;Kemper et al., 2014;, this may illustrate possible human selection for some production traits, e.g., milk yield, has taken place in KEASZ at least in the past. In parallel to candidate regions harboring genes, the 39 gene desert candidate regions on KEASZ are targets of further research to define their biological roles. Although these regions do not contain any of the bovine transcription factors binding sites identified by Bickhart and Liu (2013), they may still harbor unannotated regulatory elements and/or genes targeted by positive selection. These regions may also be transcribed to generate long non-coding RNA transcripts (≥200 nucleotides), which could be further validated by RNA sequencing. This type of RNA molecules has critical roles in regulating the expression of neighboring genes at transcriptional and posttranscriptional levels (Mercer et al., 2009;Wang and Chang, 2011).

Putative Causative Variants in Candidate Regions
Several genomic variants, in which some are unique so far to KEASZ, have been identified within the KEASZ overlapping genome-wide SNP and Hp candidate regions. A subset of these variants, such as missense SNPs, framshift indels, and splice region SNPs/indels, can be considered as primary target of positive selection due to their functional roles on the corresponding genes. Interestingly, within the selected candidate genes four missense SNPs, a splice region SNP and a frameshift indel have been identified. These variants need to be confirmed in larger sample size of tropical-adapted (e.g., EASZ) and nontropical-adapted (e.g., European taurine) cattle to further support their role as causative variants under selection. Two of the missense variants are predicted to show non-benign effects on their genes calling for their effect to be validated, e.g., through gene editing approaches (Carlson et al., 2012;Wang et al., 2013). Moreover, variants on the other genes within the candidate regions or in non-coding regions following their possible roles in regulating gene expression may also be targets of selection.
Based on available sequences of KEASZ and African taurine cattle (N'Dama and Muturu), we were able to infer the possible ancestral origin of some of these functional variants (Asian zebu or African taurine origin) indicating the role of the zebu × taurine admixture as a selective force shaping the genome of KEASZ. However, this origin assignment was only partially successful due to the low number of sequenced samples and the unavailability of a full indicine de novo reference genome.

CONCLUSION
We reported here for the first time extensive exploring for candidate regions and putative causative variants under selection on the genome of an indigenous African cattle (East African shorthorn zebu) using both genome-wide SNP data and full genome sequence data. The possible ancestral origins of some of these variants have been inferred using exome data of EASZ from Kenya and full genome sequence of African taurine cattle. In this study we have defined three selective forces on the KEASZ genome; the external environmental pressures, the internal admixed genome pressure and possible ancient human selection for production traits. Our results can be considered as the first milestone in conserving the adaptive genetic resources of the indigenous African cattle and to further improve their breeding strategy. This will enhance the productivity of the indigenous African cattle populations and at the same time retain their African environment adaptability. Although in this study we have used two genomic tools and cattle populations from East and West of Africa, these results need to be further validated in larger number of cattle populations with different ancestries and environments.

DATA ACCESSIBILITY
The pooled full genome sequence and the 10 exome sequences of the East African shorthorn zebu, in addition to the Muturu full genome sequences, are publicly available from GenBank with the Bioproject accession number PRJNA386202. All identified SNPs are deposited in dbSNP under handel (UON_LAB_A100). The high density SNP genotyping data available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.38jp6.

ETHICS STATEMENT
Standard techniques were used to collect blood. The procedure was reviewed and approved by the University of Edinburgh Ethics Committee (reference number OS 03-06) and also by the Institute Animal Care and Use Committee of the International Livestock Research Institute, Nairobi.

AUTHOR CONTRIBUTIONS
HB and OH conceived, designed the experiment. HB and OH performed the experiment. HB, AT, and FA analysed the data. MB and DW helped in the bioinformatic analysis. TS, MW, HH, ON, AT, GA, CM, MM, and CV contributed in data. HB and OH wrote the manuscript. All authors have agreed on the contents of the manuscript.

FUNDING
We would like to extend our sincere gratitude to the Wellcome Trust (grant reference 07995) for financially supporting this project. USDA-ARS Animal Genome Improvement Laboratory provided funding through project 1265-31000-098-00.

ACKNOWLEDGMENTS
We would like to thank Geneseek veterinary diagnostics for providing invaluable technical assistance through the genotyping of the samples. Finally, we wish to acknowledge the grass root farmers of Western Kenya who participated fully and made this project a success. We would like to acknowledge the support of SRUL01/13 from Kuwait University for using their computer facility.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fgene. 2017.00068/full#supplementary-material