# WHY LIVESTOCK GENOMICS FOR DEVELOPING COUNTRIES OFFERS OPPORTUNITIES FOR SUCCESS

EDITED BY : Farai Catherine Muchadeyi, Eveline M. Ibeagha-Awemu, Johann Sölkner, Ardeshir Nejati Javaremi, Gustavo Augusto Gutierrez Reynoso, Joram Mwashigadi Mwacharo and Max F. Rothschild PUBLISHED IN : Frontiers in Genetics

### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88963-944-1 DOI 10.3389/978-2-88963-944-1

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: frontiersin.org/about/contact

# WHY LIVESTOCK GENOMICS FOR DEVELOPING COUNTRIES OFFERS OPPORTUNITIES FOR SUCCESS

Topic Editors:

Farai Catherine Muchadeyi, Agricultural Research Council of South Africa (ARC-SA), South Africa

Eveline M. Ibeagha-Awemu, Agriculture and Agri-Food Canada (AAFC), Canada Johann Sölkner, University of Natural Resources and Life Sciences Vienna, Austria Ardeshir Nejati Javaremi, University of Tehran, Iran

Gustavo Augusto Gutierrez Reynoso, National Agrarian University, Peru Joram Mwashigadi Mwacharo, International Center for Agriculture Research in the Dry Areas (ICARDA), Ethiopia

Max F. Rothschild, Iowa State University, United States

Image: Shutterstock.com/Cameron Watson

Citation: Muchadeyi, F. C., Ibeagha-Awemu, E. M., Sölkner, J., Javaremi, A. N., Reynoso, G. A. G., Mwacharo, J. M., Rothschild, M. F., eds. (2020). Why Livestock Genomics for Developing Countries offers Opportunities for Success. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88963-944-1

# Table of Contents

# *06 Editorial: Why Livestock Genomics for Developing Countries Offers Opportunities for Success*

Farai C. Muchadeyi, Eveline M. Ibeagha-Awemu, Ardeshir N. Javaremi, Gustavo A. Gutierrez Reynoso, Joram M. Mwacharo, Max F. Rothschild and Johann Sölkner

*08 Genetic Diversity and Population Structure of Ethiopian Sheep Populations Revealed by High-Density SNP Markers*

Zewdu Edea, Tadelle Dessie, Hailu Dadi, Kyoung-Tag Do and Kwan-Suk Kim

*22 Insertion/Deletion Within the KDM6A Gene Is Significantly Associated With Litter Size in Goat*

Yang Cui, Hailong Yan, Ke Wang, Han Xu, Xuelian Zhang, Haijing Zhu, Jinwang Liu, Lei Qu, Xianyong Lan and Chuanying Pan


Magretha D. Pierce, Kennedy Dzama and Farai C. Muchadeyi

*88 Corrigendum: Genetic Diversity of Seven Cattle Breeds Inferred Using Copy Number Variations*

Magretha D. Pierce, Kennedy Dzama and Farai C. Muchadeyi

*90 Functional Partitioning of Genomic Variance and Genome-Wide Association Study for Carcass Traits in Korean Hanwoo Cattle Using Imputed Sequence Level SNP Data*

Mohammad S. A. Bhuiyan, Dajeong Lim, Mina Park, Soohyun Lee, Yeongkuk Kim, Cedric Gondro, Byoungho Park and Seunghwan Lee

*104 Genomics for Ruminants in Developing Countries: From Principles to Practice*

Vincent Ducrocq, Denis Laloe, Marimuthu Swaminathan, Xavier Rognon, Michèle Tixier-Boichard and Tatiana Zerjal

*111 Whole-Genome Resequencing of Red Junglefowl and Indigenous Village Chicken Reveal New Insights on the Genome Dynamics of the Species* Raman A. Lawal, Raed M. Al-Atiyat, Riyadh S. Aljumaah, Pradeepa Silva, Joram M. Mwacharo and Olivier Hanotte

# *128 Incorporating Prior Knowledge of Principal Components in Genomic Prediction*

Sayed M. Hosseini-Vardanjani, Mohammad M. Shariati, Hossein Moradi Shahrebabak and Mojtaba Tahmoorespur

*137 Genome-Wide Characterization of Selection Signatures and Runs of Homozygosity in Ugandan Goat Breeds* Robert B. Onzima, Maulik R. Upadhyay, Harmen P. Doekes, Luiz. F. Brito,

Mirte Bosse, Egbert Kanis, Martien A. M. Groenen and Richard P. M. A. Crooijmans


Lin Ma, Meng Zhang, Yunyun Jin, Sarantsetseg Erdenee, Linyong Hu, Hong Chen, Yong Cai and Xianyong Lan

*180 Microsatellite-Based Genetic Structure and Diversity of Local Arabian Sheep Breeds*

Raed M. Al-Atiyat, Riyadh S. Aljumaah, Mohammad A. Alshaikh and Alaeldein M. Abudabos


Raphael Mrode, Julie M. K Ojango, A. M. Okeyo and Joram M. Mwacharo

*230 Detection of Selection Signatures Among Brazilian, Sri Lankan, and Egyptian Chicken Populations Under Different Environmental Conditions* Muhammed Walugembe, Francesca Bertolini, Chandraratne Mahinda B. Dematawewa, Matheus P. Reis,

Ahmed R. Elbeltagy, Carl J. Schmidt, Susan J. Lamont and Max F. Rothschild

# *243 Livestock Genomics for Developing Countries – African Examples in Practice*

Karen Marshall, John P. Gibson, Okeyo Mwai, Joram M. Mwacharo, Aynalem Haile, Tesfaye Getachew, Raphael Mrode and Stephen J. Kemp

*256 Leveraging Available Resources and Stakeholder Involvement for Improved Productivity of African Livestock in the Era of Genomic Breeding*

Eveline M. Ibeagha-Awemu, Sunday O. Peters, Martha N. Bemji, Matthew A. Adeleke and Duy N. Do

# *277 Performance Evaluation of Highly Admixed Tanzanian Smallholder Dairy Cattle Using SNP Derived Kinship Matrix*

Fidalis D. N. Mujibi, James Rao, Morris Agaba, Devotha Nyambo, Evans K. Cheruiyot, Absolomon Kihara, Yi Zhang and Raphael Mrode

# *289 Natural Selection Footprints Among African Chicken Breeds and Village Ecotypes*

Ahmed R. Elbeltagy, Francesca Bertolini, Damarius S. Fleming, Angelica Van Goor, Chris M. Ashwell, Carl J. Schmidt, Donald R. Kugonza, Susan J. Lamont and Max. F. Rothschild

# *305 Genome Analysis Reveals Genetic Admixture and Signature of Selection for Productivity and Environmental Traits in Iraqi Cattle*

Akil Alshawi, Abdulameer Essa, Sahar Al-Bayatti and Olivier Hanotte

# Editorial: Why Livestock Genomics for Developing Countries Offers Opportunities for Success

Farai C. Muchadeyi <sup>1</sup> \*, Eveline M. Ibeagha-Awemu<sup>2</sup> , Ardeshir N. Javaremi <sup>3</sup> , Gustavo A. Gutierrez Reynoso<sup>4</sup> , Joram M. Mwacharo<sup>5</sup> , Max F. Rothschild<sup>6</sup> and Johann Sölkner <sup>7</sup>

<sup>1</sup> Agricultural Research Council-Biotechnology Platform, Pretoria, South Africa, <sup>2</sup> Agriculture and Agri-Food Canada (AAFC), Sherbrooke, QC, Canada, <sup>3</sup> Department of Animal Science, University of Tehran, Tehran, Iran, <sup>4</sup> Programa de Mejoramiento Animal, Universidad Nacional Agraria La Molina, Lima, Peru, <sup>5</sup> International Center for Agricultural Research in the Dry Areas (ICARDA), Addis Ababa, Ethiopia, <sup>6</sup> Department of Animal Science, Iowa State University, Ames, IA, United States, <sup>7</sup> Department of Sustainable Agricultural Systems, Division of Livestock Sciences, University of Natural Resources and Life Sciences, Vienna, Austria

Keywords: indigenous livestock, genetic adaptation, genomics, production systems, developing countries

### **Editorial on the Research Topic**

### **Why Livestock Genomics for Developing Countries Offers Opportunities for Success**

In the developing world, rural farmers rely on local breeds to play crucial roles aimed at ameliorating the effects of adverse environments and resources shortages in sustaining their livelihoods. The local breeds appear to be adapted to numerous unfavorable environmental stressors that include worsening droughts characterized by extreme temperatures and debilitating disease challenges, the epitome of low input production systems. Breeding and genetics research programs are striving to develop robust animals that are adapted to local conditions and can produce at optimal and sustainable levels under constrained environments. Elucidating the intertwined relationship between production environments and the genetics of animals, with the aim of establishing selection priorities and developing suitable improvement strategies, is critical. Previously, livestock improvement programs have failed to realize expected gains due to the lack of performance data, pedigree records and funding, and worsened by such factors as uncontrolled livestock breeding practices on communal pastures. Advances in livestock genomics have facilitated the generation of "big data" in genetics through the advent of whole genome/transcriptome sequencing, genome assemblies and genome-wide SNP genotyping. Regardless of the room for genetic gains in local breeds and the anticipated higher impact of genomics assisted breeding and selection, developing countries still lag behind in the uptake of genomic technologies. This Research Topic addresses the need for livestock genomics for developing countries through review articles, original research articles and considerations of future opportunities.

The Research Topic yielded 23 articles that are either review (five papers) or original research articles (18 papers) covering major livestock species kept in developing countries including cattle (seven papers), sheep (five papers), goats (three papers), and chickens (three papers). The manuscripts cover a broad range of genomic applications such as genomic selection/assisted breeding, genome-wide association analysis, diversity studies with a particular emphasis on adaptive genetic variation and signatures of selection analysis, and some elements of functional genomics using RNA sequencing and differential gene expression profiling. Whilst a broad range of genomic applications are covered, there is a bias toward genomic diversity studies, indicating the limited utility of other genomic applications due to inherent limitations to data collection and funding that characterize most developing countries, and are highlighted in some of the review articles.

### Edited and reviewed by:

Guilherme J. M. Rosa, University of Wisconsin-Madison, United States

> \*Correspondence: Farai C. Muchadeyi muchadeyif@arc.agric.za

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 18 April 2020 Accepted: 26 May 2020 Published: 26 June 2020

### Citation:

Muchadeyi FC, Ibeagha-Awemu EM, Javaremi AN, Gutierrez Reynoso GA, Mwacharo JM, Rothschild MF and Sölkner J (2020) Editorial: Why Livestock Genomics for Developing Countries Offers Opportunities for Success. Front. Genet. 11:626. doi: 10.3389/fgene.2020.00626

**6**

The reviews provide an overview of the current and potential applications of genomics in developing countries, the opportunities that can be used from other supporting technologies such as reproductive technologies and the challenges and possible solutions of applying genomics in a developing country context. According to Mrode et al., genotypic data can provide solutions for parent verification, breed composition determination and genetic evaluation for smallholder farmers. The review by Mrode et al., also highlights the major problem of small reference populations, which could be overcome by across regional genomic prediction programs that pull together data from multiple countries. The review by Ducrocq et al. explores challenges facing developing countries, including limited capacity to genotype, poor data management, multiple breeding goals emanating from exposure to unfavorable conditions such as heat and diseases, requirement of special attention on fitness traits and limited expertise to drive genomics programs. Marshall et al. present case studies from Africa on the application of livestock genomics which included the identification and development of unique breeds in the region. This review also looks at the role of genomic studies on African livestock to understand the genetics of particular diseases and in the potential of technologies such as gene editing in disease management. The review by Van Marle-Koster and ˝ Visser highlight the benefits of a dual system of a highly developed commercial sector using the most recent technologies vs. a small holder and communal sector in South Africa, and how resources can be harnessed to advance both sectors. This review also highlights the importance of national animal recording schemes and government funding to ensure progress in driving the application of genomics across the two sectors. In line with this, Ibeagha-Awemu et al. discuss the importance of leveraging available resources and stakeholder involvement for coordinated improvement of livestock production in Africa. The review further highlights in-depth approaches that can enable the application of genomic technologies for rapid improvement of livestock traits of economic importance in the era of genomic breeding.

The first set of original research papers present case studies of genomic selection and genome-wide association analysis. Hosseini-Vardanjani et al. evaluate the gain in accuracy of genomic evaluations using multi-breed reference populations and demonstrates the utility of incorporating prior knowledge of principal components in genomic prediction as well as the potential of a multi-breed reference population to contribute to enhanced prediction accuracies. In the absence of conventional

# REFERENCES

Muchadeyi, F. C. (2019). "Application of genomics to resolve livestock production and adaptation issues in developing countries," in Abstract, 37th International Society of Animal Genetics Conference Proceeding, 61. Available online at: https://www.isag.us/Docs/Proceedings/ISAG2019\_Proceedings.pdf.

**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

genetic evaluations and selection, Mujibi et al. and Cheruiyot et al. use genomic data to understand breed composition and associate it to production performance. Genome-wide association studies are often challenging because of the need of very large number of experimental units with good phenotypes. A genome-wide association study (GWAS) by Xu et al. highlights the potential for different genetic mechanisms for litter size among sheep breeds. Nazari-Ghadikolaei et al. identify candidate genes for coat color and mohair traits in the Iranian Markhoz goats through a GWAS. Bhuiyan et al. use imputed sequence level SNP data in a GWAS to identify variants in genic and exon regions significantly associated to carcass traits in Korean Hanwoo cattle.

The second set of original research articles describes the common application of genomics in smallholder livestock systems of developing countries such as analysis of the level of admixture and investigation of signatures of selection in cattle (Chagunda et al.; Alshawi et al.); sheep (Ahbara et al.; Al-Atiyat et al.; Edea et al.); goats (Onzima et al.; Cui et al.) and in native chickens (Elbeltagy et al.; Walugembe et al.; Lawal et al.). Finally, Pierce et al. investigate copy number variations (CNVs), which have recently gained prominence, as a genomic tool, to ascertain genetic diversity and population structure in South African cattle.

Only one study focuses on functional genomics, using RNA-Seq and differential gene expression studies to investigate genetic and molecular mechanisms underlying traits of importance in sheep (Ma et al.). This probably reflects the complexities of setting up transcriptome experiments in largely uncontrolled smallholder farming systems of developing countries.

Overall, the topic demonstrates the utility of genomics in diverse application across species and geographical regions of the developing countries and the opportunities that exist in the future.

# AUTHOR CONTRIBUTIONS

FM and JS initiated the Research Topic and invited MR, EI-A, AJ, GG, and JM as topic co-editors. FM drafted the Research Topic editorial. All authors participated in the editorial process of this Research Topic, revised, and approved the final draft of the editorial.

# ACKNOWLEDGMENTS

The introductory section of the editorial was adapted from Muchadeyi (2019).

Copyright © 2020 Muchadeyi, Ibeagha-Awemu, Javaremi, Gutierrez Reynoso, Mwacharo, Rothschild and Sölkner. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genetic Diversity and Population Structure of Ethiopian Sheep Populations Revealed by High-Density SNP Markers

### Zewdu Edea<sup>1</sup> , Tadelle Dessie<sup>2</sup> , Hailu Dadi<sup>3</sup> , Kyoung-Tag Do<sup>4</sup> and Kwan-Suk Kim<sup>1</sup> \*

<sup>1</sup> Department of Animal Science, Chungbuk National University, Cheongju, South Korea, <sup>2</sup> International Livestock Research Institute, Addis Ababa, Ethiopia, <sup>3</sup> Ethiopian Biotechnology Institute, Addis Ababa, Ethiopia, <sup>4</sup> Department of Animal Biotechnology, Faculty of Biotechnology, Jeju National University, Jeju, South Korea

### Edited by:

Max F. Rothschild, Iowa State University, United States

### Reviewed by:

Steffen Weigend, Friedrich Loeffler Institut, Germany Tosso Leeb, University of Bern, Switzerland

> \*Correspondence: Kwan-Suk Kim kwanskim@chungbuk.ac.kr

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 04 October 2017 Accepted: 05 December 2017 Published: 22 December 2017

### Citation:

Edea Z, Dessie T, Dadi H, Do K-T and Kim K-S (2017) Genetic Diversity and Population Structure of Ethiopian Sheep Populations Revealed by High-Density SNP Markers. Front. Genet. 8:218. doi: 10.3389/fgene.2017.00218 Sheep in Ethiopia are adapted to a wide range of environments, including extreme habitats. Elucidating their genetic diversity is critical for improving breeding strategies and mapping quantitative trait loci associated with productivity. To this end, the present study investigated the genetic diversity and population structure of five Ethiopian sheep populations exhibiting distinct phenotypes and sampled from distinct production environments, including arid lowlands and highlands. To investigate the genetic relationships in greater detail and infer population structure of Ethiopian sheep breeds at the continental and global levels, we analyzed genotypic data of selected sheep breeds from the Ovine SNP50K HapMap dataset. All Ethiopian sheep samples were genotyped with Ovine Infinium HD SNP BeadChip (600K). Mean genetic diversity ranged from 0.29 in Arsi-Bale to 0.32 in Menz sheep, while estimates of genetic differentiation among populations ranged from 0.02 to 0.07, indicating low to moderate differentiation. An analysis of molecular variance revealed that 94.62 and 5.38% of the genetic variation was attributable to differences within and among populations, respectively. Our population structure analysis revealed clustering of five Ethiopian sheep populations according to tail phenotype and geographic origin—i.e., short fat-tailed (very cool highaltitude), long fat-tailed (mid to high-altitude), and fat-rumped (arid low-altitude), with clear evidence of admixture between long fat-tailed populations. North African sheep breeds showed higher levels of within-breed diversity, but were less differentiated than breeds from Eastern and Southern Africa. When African breeds were grouped according to geographic origin (North, South, and East), statistically significant differences were detected among groups (regions). A comparison of population structure between Ethiopian and global sheep breeds showed that fat-tailed breeds from Eastern and Southern Africa clustered together, suggesting that these breeds were introduced to the African continent via the Horn and migrated further south.

Keywords: Ethiopian sheep, high-density chip, genetic diversity, population structure, SNP

# INTRODUCTION

fgene-08-00218 December 22, 2017 Time: 13:29 # 2

In the Horn of Africa and especially in Ethiopia where the economy is predominantly agriculture-based, sheep and their products play a critical role in the livelihood of millions of farmers and pastoralists (Wilson, 2011). Sheep serve as a source of income, mutton, and manure; provides an economic buffer in the event of crop failures; and fulfill many other sociocultural functions. In some areas such as the cool alpine and arid lowlands where crop production is not a viable economic option, sheep production is the sole option for livelihood (Tibbo, 2006). Sheep are also important for the national economy; indigenous populations have evolved in diverse and harsh environments in Ethiopia where they face disease and parasite burdens, feed shortage, and extreme temperatures. Consequently, these animals likely harbor gene variants uniquely adapted to specific environmental conditions that may not be present in commercial breeds. The economic and agricultural value of sheep is expected to increase as a result of climate change (Seo, 2008; Seo et al., 2010); genetic characterization of local breeds adapted to extreme environments using modern genomic tools can ensure the breeding of hardy sheep populations (Boettcher et al., 2015; Yang et al., 2016).

Given its proximity to the Arabian Peninsula, Ethiopia is considered as a genetic corridor for the introduction of livestock species including sheep to the African continent (Hanotte et al., 2002; Muigai and Hanotte, 2013). Extensive hybridization has occurred between sheep breeds introduced at various times via different routes, making the Horn Africa in general and Ethiopia in particular an excellent resource for the study of genetic diversity in domestic livestock breeds. The ecological, climatological, ethnic, and cultural diversity of Ethiopia is reflected in its large sheep populations (25.5 million heads) (Leta and Mesele, 2014), which can be phenotypically classified into 14 native populations (Gizaw et al., 2007) in addition to populations distributed along the northern, southwestern, and western borders of the country that have yet to be described. These local populations are mainly named after the geographic location or ethnic group/community rearing them, or based on phenotypic characteristics; for instance, the 14 Ethiopian sheep populations are broadly categorized according to their tail phenotypes as thintailed (one breed), fat-tailed (11 populations), and fat-rumped (two populations) (Gizaw et al., 2007). The short fat-tailed population mainly inhabits the sub-alpine regions; long fat-tailed sheep are predominant in mid- to high-altitude environments; and fat-rumped sheep are distributed in dry lowland areas (Gizaw et al., 2007).

Characterizing genetic diversity is a key aspect of developing sustainable breed improvement strategies (Groeneveld et al., 2010) and understanding adaptation to extreme environments (Boettcher et al., 2015). Although several studies have investigated the origin of African sheep breeds, many breeds and populations have yet to be fully characterized. The genetic diversity and population structure of Ethiopian sheep populations have been examined using non-recombinant (mitochondrial DNA) and selection-neutral markers (Gizaw et al., 2007; Helen, 2015). However, microsatellite-based studies have provided limited global picture as it included only local sheep breeds of Ethiopia. In general, at the African continent level, there have been far fewer studies on sheep diversity and population structure using genome-wide nuclear markers as compared to non-recombinant markers (Muigai, 2003a; Bruford and Townsend, 2006; Aswani, 2007; Horsburgh and Rhines, 2010; Helen, 2015). It was recently reported that genetic diversity estimated using microsatellites was not correlated with genome-wide single nucleotide polymorphism (SNP) diversity estimates, with larger genetic differentiation values obtained by the former approach (Ciani et al., 2013; Fischer et al., 2017). On the other hand, the large number of genome-wide SNP markers makes it superior to microsatellites for inferring population structure (Glover et al., 2010; Gärke et al., 2012). The recently developed genome-wide high-density ovine SNP array has provided a tool for investigating genetic diversity at a high resolution, inferring population history, and mapping genomic regions subject to selection and adaptation (Kijas et al., 2009; Yang et al., 2016; Zhao et al., 2017). Despite the richness of Ethiopia's sheep genetic resources, only one population was represented in previous genome-wide global sheep analyses (Kijas et al., 2009). Therefore, the extent of genetic variation and patterns of admixture are not known for most Ethiopian sheep populations. Additionally, polymorphisms in the Ovine HD chip in non-reference African/Ethiopia sheep populations have not been identified or validated.

The present study provides the first analysis of high density (∼600K) ovine SNPs in Ethiopian sheep breeds. We sampled and genotyped five Ethiopian sheep populations adapted to diverse agro-ecologies using the Infinium HD SNP BeadChip (600K). A detailed understanding of the genetic landscape of national populations requires sampling of representative breeds from wider geographic regions, particularly from a center of domestication and along migration routes (Zhao et al., 2017). To establish historical patterns of admixture and the genetic relatedness of Ethiopian sheep breeds on a broader geographic scale, we compared these breeds with 12 others extracted from Ovine SNP50K HapMap datasets as well as one from Morocco, the data for which was generated by the NextGen Consortium. Two North African sheep breeds (Egyptian Barki and Moroccan) were not previously analyzed (Kijas et al., 2012) but were included here to examine their genetic influence on Ethiopian/East African sheep genetic composition.

# MATERIALS AND METHODS

# Breeds/Populations and Samples

Nasal samples were collected using Performagene LIVESTOCK's nasal swab DNA collection kit (DNA Genotek, Kanata, ON, Canada) from a total of 72 animals representing five Ethiopian sheep populations: Arsi-Bale, Horro, Menz, Adilo, and Blackhead Somali. Three of these (Horro, Arsi-Bale, and Adilo) are long fattailed hairy sheep; Menz is a short fat-tailed coarse-wool sheep; and Blackhead Somali belong to the fat-rumped group (Gizaw et al., 2008). Both female and male animals were randomly sampled from multiple flocks.

Blackhead Somali sheep (also known as Blackhead Ogaden or Berbera Blackhead) exist at low altitudes (500–1000 m above sea level [a.s.l.]) and are well adapted to arid and semi-arid environments characterized by high ambient temperature, low precipitation (200–400 mm), and recurrent drought (Wilson, 1991). The breed is distinguished by the absence of horns in both sexes, black head and neck and white body and limbs, and a fat rump (Wilson, 2011), and is reared across the Horn of African (Ethiopia, Djibouti, Somali, Kenya, and Sudan) under a mobile pastoral management system that includes heavy heat stress, long walks in search of pasture and water, long watering intervals, and few health management practices. In contrast, three of the populations (Horro, Arsi-Bale, and Menz) are reared under sedentary farming systems. Horro sheep are mainly distributed throughout western and southwestern parts of the country inhabiting mid- to high-altitude (1400–2000 m a.s.l.) areas with a mean precipitation of 1000–2000 mm. Horro sheep are characterized by a larger body size and higher twinning rate than other indigenous breeds (Gizaw et al., 2013). Arsi-Bale is the predominant breed in the eastern and south-central parts of Ethiopia, spreading from the Central Great Rift Valley to the Bale mountains (>3000 m a.s.l.). Menz sheep have a relatively small body size with an average live weight of 20.1 ± 0.3 kg, are raised for meat and coarse wool production, and are well adapted to cool highland areas (2500–3000 m a.s.l.) (Haile et al., 2002; Gizaw et al., 2008; Getachew et al., 2015). The Adilo (Wolaita) sheep breed is distributed in southern Ethiopia and characterized by long fat-tail and large body size (Melesse et al., 2013). Phenotypic descriptions and environmental variables of the study sheep populations are summarized in **Table 1**.

To compare the genetic relationship between sheep breeds in Ethiopia and those on other continents and investigate historical patterns of admixture, we also used genotype data of 228 animals representing 12 breeds from North Africa, Middle East, South Africa, Europe, and Asia from the Ovine HapMap project (International Sheep Genomics Consortium<sup>1</sup> ). We also included Moroccan sheep data generated by the NextGen Consortium<sup>2</sup> . Details regarding sample sizes, breeds, and geographic origins are summarized in **Table 2**.

# Genotyping, Quality Control, and Markers Screening

Ethiopian sheep samples were genotyped with the Ovine Infinium HD BeadChip (Illumina, San Diego, CA, United States) by GeneSeek/Neogen (Lincoln, NE, United States). Among the 606,006 SNPs, 577,401 were autosomal, 1291 were unmapped to any ovine chromosome (OAR), and 27,314 were located on the X chromosome.

Autosomal SNPs with call rates <90% and minor allele frequency (MAF) <0.01 were filtered out, leaving 497,294 SNPs with average and median gaps of 4.92 and 3.58 kb, respectively. Additionally, 11 samples with call rates ≤ 85% were excluded from further analysis. To test for potential effects of


TABLE 1 | Summary of phenotypic

 characteristics

 and environmental

 variables of the study sheep populations.

<sup>1</sup>http://www.sheephapmap.org/download.php

<sup>2</sup>http://projects.ensembl.org/nextgen/


TABLE 2 | Diversity indices in 18 sheep breeds estimated from different single nucleotide polymorphism (SNP) datasets.

HE, expected heterozygosity; HO, observed heterozygosity; PI\_HAT, average relatedness; SW, South-west.

ascertainment bias on diversity index estimates, 497,294 SNPs were subjected to linkage disequilibrium (LD) pruning using the parameter (50 5 0.20), yielding 80,602 SNPs.

Genotypic data for the 600K and 50K platforms were merged using SNP and Variation Suite v.8.5.0 (Golden Helix, Bozeman, MT, United States<sup>3</sup> ). A total of 41,752 SNPs overlapping between the two platforms were filtered according to quality control criteria; SNPs with call rates <90% and MAF <0.01 were removed, leaving 40,770 SNPs for subsequent analyses. A total of 6163 SNPs remained for population structure analysis after 40,770 SNPs in each population were pruned based on LD using the parameter (50 5 0.80).

# Statistical Analysis

### Genetic Diversity

Minor allele frequency and deviation from Hardy–Weinberg equilibrium (HWE) were estimated by SNPs for each of the five Ethiopian sheep populations using SNP and Variation Suite v.8.5.0. Alleles were categorized into different bins based on their frequency: fixed alleles (MAF = 0.00), rare alleles (>0.00–<0.05), intermediate alleles (≥0.05–<0.10) and common alleles (≥0.10 and ≤0.5). Diversity indices were estimated from three datasets: (i) 497,294 SNPs that passed the quality control threshold of MAF ≥ 0.01 and call rate ≥ 90%; and (ii) 80,602 SNPs that remained after pruning 497,294 based on LD using the parameter (50 5 0.20) in SNP and Variation Suite v.8.5.0; and (iii) 40,770 SNPs common to 600K and 50K platforms.

<sup>3</sup>www.goldenhelix.com

To estimate within-population genetic diversity, we calculated observed heterozygosity (HO), expected heterozygosity (HE), and inbreeding coefficients for the three datasets using PLINK (Purcell et al., 2007). Animal relatedness was estimated as the proportion of gene identity-by-descent between sample pairs within the breed/population as an average relatedness (PI\_HAT) value using the same software.

### Genetic Relationships and Population Structure

Pairwise genetic differentiation (fixation index, FST) (Weir and Cockerham, 1984) and Reynolds' genetic distances (Reynolds et al., 1983) between all pairs of sheep populations were calculated using Arlequin v.3.5.2 (Excoffier and Lischer, 2010). The significance of genetic differences was determined from 10,000 permutation tests. Analysis of molecular variance (AMOVA) with 10,000 permutations was carried out using the same software. Using Reynolds' genetic distance, a neighbor-net tree was constructed using SPLITTREE4 v.14.5 (Huson and Bryant, 2006).

Population genetic structure was assessed using STRUCTURE v.2.3.4 software (Pritchard et al., 2000) using a Bayesian model based on 80,602 SNPs in the five Ethiopian sheep populations and 18 breeds and on 6163 SNPs overlapping between OvineSNP50 and 600K and remaining after pruning based on LD. An admixture ancestry model with correlated allele frequencies was generated for a putative number of subpopulations (K) ranging from 2 to 18. Five runs of 20,000 Markov chain Monte Carlo iterations after a burn-in period of 10,000 iterations were carried out for each K-value. The STRUCTURE output was

analyzed in HARVESTER (Earl, 2012). The most likely number of clusters was identified by the 1K method (Evanno et al., 2005). Population structure was separately inferred by principal component analysis (PCA) based on 497,294 SNPs for the five Ethiopian sheep populations and 40,770 SNPs for all breeds using SNP and Variation Suite v.8.5.0.

# RESULTS AND DISCUSSION

## Intra-population Genetic Variability

The mean MAFs for Arsi-Bale, Horro, Adilo, Menz, and Blackhead Somali sheep were 0.19 ± 0.16, 0.21 ± 0.16, 0.20 ± 0.16, 0.21 ± 0.16, and 0.20 ± 0.16, respectively, with an overall mean of 0.20 across populations. For all genotyped animals, the mean MAF ranged from 0.21 for OAR 11, 12, 14, 2, and 24 to 0.23 for OAR 23. These were lower than the reported average value (0.255 ± 0.136) for New Zealand sheep breeds based on an analysis of 517,902 SNPs and those reported for Corriedale and Merino sheep (0.27), but were higher than the value observed in Creole sheep (Grasso et al., 2014).

Minor allele frequency distribution for different categories is shown in **Figure 1**. The percentage of fixed SNPs (MAF = 0.00) varied from 16.60% in Horro to 24.60% in Arsi-Bale sheep, with an overall mean of 8.10% across populations, which is lower than that reported for Creole (27%) but higher than those in Merino (3%) and Corriedale (4%) breeds (Grasso et al., 2014). In total, 45,723 fixed SNPs were shared by the five Ethiopian sheep populations; the common SNPs (≥0.10 and ≤0.5) accounted for 71.03% of the total and ranged from 58.03% in Adilo to 66.56% in Horro sheep. On average, highly polymorphic SNPs (MAF ≥ 0.30) accounted for 32.69% of total SNPs and ranged from 31.54% in Adilo to 33.40% in Blackhead Somali sheep. The levels of polymorphic SNPs (80.52%, MAF > 0.01) observed in Ethiopian sheep populations were lower than those observed in Merino (89.4%) and Corriedale (86%) sheep, but were higher than the 69% reported in Creole sheep based on a 50K chip analysis (Grasso et al., 2014). The observed difference between the current and previous studies may be explained by a difference in genotyping platforms and ascertainment bias.

The number of breed-specific SNPs detected for comparison of each breed is given in **Supplementary Table S1**. The highest number of breed-specific SNPs (68,265) was detected in the Menz sheep with frequency ranging from 0.04 to 0.50 and mean of 0.15. The lowest number of breed-specific SNPs (14870) was observed in the Arsi-Bale sheep with frequency ranging from 0.062 to 0.50 and mean of 0.09. Breed-specific SNPs have been detected and used for breed assignment and product traceability in several livestock animals including pigs (Ramos et al., 2011), cattle (Negrini et al., 2009; Ripoli et al., 2013), and sheep (Grasso et al., 2014; Heaton et al., 2014). The population-specific SNPs identified in our Ethiopian sheep populations could be used in a similar manner once they have been validated.

On average, 23,649 (4.19%) loci in the five Ethiopian sheep populations deviated significantly from HWE (P < 0.05), with the largest number observed in Adilo (26,348) followed by Blackhead Somali (26,056) sheep. Deviation from HWE is due to inbreeding or genetic substructures within populations (i.e., the Wahlund effect) (Robertson and Hill, 1984; Hart and Clark, 1997; Choi et al., 2009).

The PI\_HAT estimated based on 497,294 loci between pairs of individuals was 0.09, 0.03, 0.09, 0.08, and 0.09% for Arsi-Bale, Horro, Adilo, Menz, and Blackhead Somali sheep, respectively, and 0.01% across populations (**Table 2**). HO over all loci (497,292 SNPs) varied from 0.30 in Arsi-Bale, Horro, and Adilo sheep to 0.33 in Menz. The average gene diversity or HE across the five populations was 0.30 and ranged from 0.29 (Arsi-Bale) to 0.32 (Menz). In all populations, HO was higher than or equal to HE, except in Horro sheep. The levels of within-breed genetic variation for Ethiopian sheep populations were within the range reported for New Zealand sheep breeds (0.249–0.383) analyzed using a 600K SNP chip (Brito et al., 2017).



Sample size and the population in which SNPs are detected affect population parameter estimates (Lachance and Tishkoff, 2013; McTavish and Hillis, 2015). Variability is often overestimated in individuals from which the genotyping panel is developed (Rosenblum and Novembre, 2007). We also investigated the effect of ascertainment bias on genetic diversity parameters using loci pruned based on LD. The HE of the unpruned dataset (0.33) was reduced (0.26) after pruning SNPs with high LD within each breed (**Supplementary Table S2**). Removing SNPs in high LD minimizes the effects of ascertainment bias and reduces heterozygosity (Kijas et al., 2012; Edea et al., 2015). In both datasets, estimated inbreeding coefficients (F) were negative in all populations, except in Horro sheep (F = 0.00–0.02). Overall inbreeding in all populations was estimated as 0.06. The most inbred individual was an Adilo sheep (F = 0.30), whereas the most outbred individual was a Blackhead Somali sheep (F = –0.33).

## Population Divergence and Relationships

Analysis of molecular variance based on 497,294 autosomal SNPs revealed variations of 5.38% (P < 0.0001) and 94.62% among and within populations, respectively. The large withinpopulation variation observed in Ethiopian indigenous sheep populations can be exploited through appropriate breeding strategies to improve productivity. When an analysis was performed for sheep populations grouped based on tail phenotype (long fat-tailed, short fat-tailed, and fatrumped), among-groups variance was 3.33, with 93.82% within individuals (**Table 3**). Further analysis of populations grouped according to ecological distributions (high- vs. lowland) revealed that 1.28% of the variance was among groups, 4.70% (P < 0.0001) among populations within groups, 1.61% among individuals within populations, and 92.47% within populations.

When we previously grouped Ethiopian cattle populations based on their ecological distribution (high- vs. lowland), the estimated among-group variation was 0.42% (Edea et al., 2013), which is lower than the value observed here. The variability among Ethiopian sheep populations was higher than the value of 3.64% reported among five Moroccan sheep breeds based on microsatellite markers (Gaouar et al., 2016).

FST values and Reynolds' genetic distances among the five Ethiopian sheep populations were estimated using 497,294 SNPs (**Table 4**). The overall FST value among the five populations was low (0.053) but significant (P < 0.0001). FST for all pairs of populations also differed significantly from zero (P < 0.001) and ranged from 0.02 to 0.07, with the closest pairwise value (0.02) observed between Arsi-Bale and Horro sheep. Menz sheep were more distantly related to other Ethiopian sheep populations (FST = 0.05–0.07).

The average FST among Ethiopian sheep populations was higher than the values reported for Ethiopian cattle (0.01) and goats (0.0245) (Edea et al., 2013; Mekuriaw, 2016), but similar to the mean value of 0.046 obtained using microsatellite markers (Gizaw et al., 2007) and higher than the values in Moroccan [3.6%; (Gaouar et al., 2016)], Algerian [3.8%; (Gaouar et al., 2015)], and Tunisian [3%; (Sassi-Zaidy et al., 2014)] sheep breeds.

# Population Structure

To illustrate relationships within individuals and among Ethiopian sheep populations, PCA was performed using 497,294 SNPs. PC1 and PC2 accounted for 26.71 and 25.20%, of the variation, respectively, and clustered the five sheep populations according to their tail phenotypes: long fat-tailed (Arsi-Bale, Horro, and Adilo), short fat-tailed (Menz), and fat-rumped (Blackhead Somali). These clustering patterns corresponded with their geographic distribution. PC1 segregated long-fat-tailed and fat-rumped populations from the short fat-tailed Menz sheep, whereas PC2 separated lowland fat-rumped Blackhead Somali sheep from highland fat-tailed populations (**Figure 2**). Menz sheep formed a tight cluster, whereas outliers were detected in the other populations. The unique genetic background of Menz sheep was corroborated by the STRUCTURE analysis results. At K = 2, the three-long fat-tailed sheep populations formed a single group while Menz sheep formed an independent cluster with some admixture from the other populations. Blackhead Somali sheep shared the genetic background of the long fattailed populations (**Figure 3**). At K = 3, Blackhead Somali sheep tended to segregate, yet shared about 35% of its genome with long fat-tailed populations. The PCA and STRUCTURE analysis revealed clear signatures of admixture among Ethiopian sheep populations—particularly among long-fat tailed breeds—as well as genetic introgression from short-fat tailed Menz into other populations.

TABLE 4 | Pairwise genetic differentiation (FST) (below diagonal) and Reynolds' genetic distance (above diagonal) among five Ethiopian sheep populations based on an analysis of 497,294 SNPs.


Grouping of populations according to tail-phenotype and ecology is in line with the previous microsatellite based analysis (Gizaw et al., 2008). Morphological variation analysis also grouped Ethiopian sheep populations according to their tailphenotype (long fat-tailed, short fat-tailed and fat-rumped) and ecological distribution [sub-alpine, wet highland and arid lowlands (Gizaw et al., 2008)]. These results further support the independent introduction of fat-tailed and fat-rumped sheep into Africa. Accordingly, it was thought that fat-tailed sheep were introduced into Africa during the third wave of migration following thin-tailed hair sheep and thin-tailed wool sheep, whereas fat-rumped sheep entered much later (Epstein, 1971; Ryder, 1984).

As indicated by our genetic distance, PCA and STRUCTURE results, the Menz sheep showed greater genetic differentiation and clearly separated from the rest of the populations. Differences in allele frequencies between Menz sheep and other populations might have been due to selection for ecological adaptation, differences in migration histories and geographical isolation. Menz sheep are evolved in the cool sub-alpine climate of highlands (2500–3000 m a.s.l.) and are kept for meat and coarse wool production (Wilson, 1991; Tibbo, 2006), and are one of the most primitive coarse-wool breeds imported from Arabia via the Bab-el-Mandeb Strait (Wilson, 1991). It is thought that fat-tailed coarse-wooled sheep were introduced to Africa after thin-tailed breeds about 3,000 years ago (Wilson, 2011) for which adequate time has elapsed for adaptive evolution to take place. Furthermore, historical data show that the Amhara ethnic group of Ethiopia have inhabited altitudes more 2500 m for at least 5 ky (Alkorta-Aranburu et al., 2012). The Menz sheep have migrated to new areas and co-exist with humans for centuries under such extreme environments. On the other hand, fatrumped Blackhead Somali sheep are well adapted to semi-arid to extremely arid lowlands with high temperatures and sparse and erratic precipitation (Wilson, 2011). The breed is kept for meat production and selected for higher fat deposition on the rump

as a source of energy-dense food during prolonged dry spells (Muigai and Hanotte, 2013).

The low genetic differentiation between the two-long fattailed populations (Arsi-Bale and Horro) was further supported by our population STRUCTURE analysis results. Arsi-Bale and Horro sheep populations are predominantly maintained by the Ethiopian Oromo ethnic group. In addition to geographic isolation, ethnic, cultural, and religious differences may act as barriers to gene flow that shape population genetic structure (Madrigal et al., 2001). The chances of animal exchange are greater within the same ethnic group or tribe than between any two different ethnic groups or tribes (Gizaw et al., 2007). Arsi-Bale and Horro sheep both inhabit highland environments and face common selective pressures, which may have shaped their genomes in a similar manner. We previously reported that Arsi and Horro cattle had the lowest level of genetic differentiation among examined breeds (Edea et al., 2013); our current results imply that sheep dispersal accompanied that of cattle.

# Genetic Diversity of Ethiopian Sheep Populations and Their Relationships to Global Sheep Breeds

### Genetic Diversity and Relationships

To compare genetic diversity and trace historical patterns of Ethiopian sheep population structure on a broader geographic scale, we analyzed 41,752 SNPs that overlap between Ovine50SNP and 600K chips. Polymorphic (MAF > 0.01) and highly polymorphic (MAF > 0.30) SNPs accounted for 92 and 37% of SNPs in Ethiopian sheep populations, respectively. These values were lower than those observed for Australian Merino (96 and 45%, respectively), but higher than those for Dorset Horn (89 and 34%, respectively). Using the OvineSNP50 chip, highly polymorphic (MAF > 0.30) SNPs accounted for 50% of the total in Merino and Corriedale sheep and for 36% of the total in Creole sheep (Grasso et al., 2014). The relatively high levels of genomic variability observed in Merino sheep may be partly ascribed to ascertainment bias, as these breeds were used in SNP discovery of the OvineSNP50 chip (Kijas et al., 2012). Despite their small sample size, Ethiopian sheep populations show moderate genetic variability relative to southern African Namaqua, Indian Garole, and Dorset Horn (**Table 2**). However, Ethiopian sheep populations show slightly lower levels of genetic diversity than the presumed ancestral breeds of the Near East (Afshari; HE = 0.34) and northern Africa (HE = 0.35–0.37). Breeds from or close to domestication centers are expected to retain higher allelic diversity than those that migrated farther away (Canon et al., 2006; Peter et al., 2007). The higher diversity estimates in North African as compared to East African breeds can be further explained by the fact that these populations reflect a high degree of admixture between fat- and thin-tailed sheep, as demonstrated by our STRUCTURE analysis. Given its close proximity to the Near East and Mediterranean sea, North Africa served as a gateway for early livestock introduction to the African continent and is considered as a secondary hotspot of genetic variation (Gautier, 2002).

Pairwise FST (**Figure 4** and **Supplementary Table S3**) and Reynolds' genetic distances (**Supplementary Table S4**) were calculated between the 18 sheep breeds/populations. The lowest differentiation was in Ethiopian populations (Arsi-Bale and Horro; FST = 0.02) and in North African breeds (Egyptian Barki and Moroccan sheep; FST = 0.02). Pairwise genetic differentiation comparisons revealed that the highest FST value (FST = 0.33) was obtained between the Dorset Horn and Namaqua Afrikaner. Within African sheep breeds, the highest differentiation (mean of 0.21) was observed between Ethiopian and Namaqua Afrikaner breeds. The low within-breed genetic diversity in Namaqua Afrikaner and high genetic differentiation between this breed and other East African sheep populations was likely due to genetic drift, which is consistent with the significantly smaller population size of Namaqua Afrikaner (Qwabe et al., 2013). Ethiopian and North African sheep breeds showed moderate genetic differentiation (FST = 0.08–0.09), while a higher value detected between East African and Middle Eastern breeds (FST = 0.12). It is well documented that the Nile River Valley served as a genetic corridor for human and livestock gene flow between the northern and southern parts of the continent across sub-Saharan Africa (Krings et al., 1999; Horsburgh et al., 2013).

Analysis of molecular variance for the 18 global populations grouped based according to geographical regions (Africa, Asia and western) revealed that 3.68% (P < 0.0001) of the variance was among groups, 10.64% among populations within groupings, and 85.69% within populations. The FST value was 0.1431 (P < 0.0001), which showed that 14.31% of the total genetic variation was due to population differences. The variation observed among the geographic regions in this study was lower than the reported value of 5.8% (Kijas et al., 2012). To assess genetic differences among the geographic regions within the African continent, we further ran AMOVA by grouping African sheep breeds according to their geographic distribution (North, East, and South). Results indicated that 8.23% (P = 0.01) of the variation was among groups and 4.20% among populations within groups. The FST value was 0.1243, which revealed 12.43% of the total genetic difference was attributed to population differences, and the remaining 87.57% accounted by variation within populations.

## Phylogenetic Cluster Analysis

A Neighbor-Net network constructed using 40, 770 SNPs clustered the study population according to their geographic region (**Figure 5**), with close clustering of breeds or populations within a region. Among Ethiopian sheep, the two highland and fat-tailed sheep (Arsi-Bale and Horro) were closely clustered. Despite the observed phenotypic differences, fatrumped Blackhead Somali sheep were more closely associated with fat-tailed Red Maasai sheep than with fat-tailed Ethiopian sheep populations. These populations are reared under mobile pastoral and agro-pastoral systems, and there is a high chance of inter-population mating in Kenya (Wilson, 1991). The African Dorper—a composite breed developed from Dorset Horn and Blackhead Persia (Kovács et al., 2008)—was closer to Dorset Horn than to Blackhead Somali, which is a strain of Blackhead Persian sheep.

The Middle Eastern breeds (Afsahri and Awassi) formed another group with Egyptian Barki sheep. In the phylogenetic tree, the Moroccan sheep was positioned in an intermediate position. The Brazilian Creole clustered with the Iberian populations (**Figure 5**). Long branches were noted for Namaqua Afrikaner, Dorset Horn and Indian Garole, possibly due to small effective population size, which concurs with previous reports (Kijas et al., 2009; Spangler et al., 2017). These results are supported by population structure and admixture analyses. Despite the observed significant effect of ascertainment bias on genetic diversity, we did not detect any differences in the phylogenetic tree results for 40,770 and 6163 loci subjected to LD pruning (data not shown). In agreement with our results, it has been demonstrated that increasing the number of loci does not improve the reliability of the phylogenetic tree (Litt and Luty, 1989).

### Population Structure Analyses

Principal component analysis was carried out using 40,770 SNPs overlapping between OvineSNP50 and Ovine HD SNPs and the 6163 SNPs left after LD pruning (**Figure 6** and **Supplementary Figure S1**). PC1 accounted 21.14% of the total variation and separated the African breeds, except Moroccan sheep from the Western breeds. Menz and Namaqua Afrikaner were closer to the rest of the East African population but remained as a separate cluster. Eastern and Southern African breeds were separated from the Middle Eastern and North African breeds by PC2. Admixed populations should fall between their two ancestral populations, and the proportion of ancestry inherited from each can be linearly estimated (McVean, 2009). Accordingly, the African composite Dorper was positioned between Dorset Horn and East African populations, while Egyptian Barki sheep were proximal to the Middle Eastern Awassi breed. These results were consistent for 40,770 SNPs and the 6163 SNPs remaining after pruning based on LD, revealing a lack of strong ascertainment bias (**Supplementary Figure S1**).

The results of the structural analysis for hypothetical populations ranging from 2 to 10 are shown in **Figure 7**. At K = 2 and K = 3, Eastern and Southern African sheep formed one group and Dorset Horn was an independent cluster, which was supported by the PCA results. At K = 3, thin-tailed Indian Garole was separated from the other breeds. From K = 4–10, Namaqua Afrikaner sheep clearly segregated from East African populations, which was well supported by the phylogenetic results. Northern Africa is mostly populated by fat-tailed sheep (Muigai and Hanotte, 2013), but our STRUCTURE analysis revealed substantial signatures of admixture in the genomes of North Africa populations as compared to their Eastern and Southern African counterparts. This is in accordance with

the historical introduction of sheep into Africa and their dispersion across the continent through the Nile Valley; for instance, thin-tailed sheep spread into the Western Sahara via northern Africa (Muigai and Hanotte, 2013), which may have left its genomic legacy in today's North African sheep populations.

The low genetic background of Asiatic and Iberian thintailed sheep detected in fat-tailed East and South African breeds is consistent with the distinct histories and non-overlapping geographic distributions of these populations (Muigai, 2003b), and support the predominance of fat-tailed sheep in the eastern and southern parts of Africa (Muigai and Hanotte, 2013). Archeological evidence traces the first fat-tailed sheep to the Eastern Ethiopian highlands (Clark and Williams, 1978). Moreover, analyses of autosomal markers and the Y chromosome have revealed the distinct evolutionary histories of thin- and fat-tailed African sheep breeds (Muigai, 2003a; Aswani, 2007).

At K = 8, we observed a divergence of the African Dorper from the East African populations, which was also well supported by our PCA and Neighbor-Net network results. At K = 6–10, Menz sheep shared 20–22% its genome with Middle Eastern fat-tailed sheep, whereas this value did not exceed 1% in the remaining Ethiopian sheep populations. The influence of Middle Eastern fat-tailed sheep detected in Menz can be explained by the fact that within Menz and adjacent areas, cross-breeding between Menz and Awassi populations has been ongoing for more than three decades (Gizaw and Getachew, 2009). At the optimum K-value of 10, Red Maasai shared between 8 and 10% of its genomes with African Dorper. It is well known that the Dorper breed was introduced into Kenya in the 1960s and was indiscriminately crossed with local breeds including Red Maasai to increase meat production in local sheep populations (Verbeek et al., 2007). Similarly, Blackhead Somali—which is a strain of Blackhead Persian sheep—was used as a maternal line in the development of African Dorper (Wilson, 1991). The sizeable genetic admixture between Iberian and North African breeds, particularly with Moroccan sheep was clearly illustrated at K = 5–9. This finding mirrors historical human and livestock movements between Northern Africa and the Iberian Peninsula (Boone and Benco, 1999; Botigué et al., 2013); archeological and DNA evidence demonstrates the influence of North African domestic livestock species on indigenous populations of the Iberian Peninsula (Beja-Pereira et al., 2002; Anderung et al., 2005).

The close clustering of East African sheep populations and distinct separation from their northern counterparts was well demonstrated by our phylogenetic, PCA, and STRUCTURE analyses. This result coincides with the evidence that fat-tailed sheep were introduced into Africa via two independent routes: the Horn of Africa and northern Africa from the Middle East (Ryder, 1984). The lowest genetic differentiation obtained for the two Ethiopian sheep populations (Arsi-Bale and Horro; FST = 0.02) was also well supported by population STRUCTURE

and Neighbor network analyses. We suggest that this could be due to gene flow and similarity of production environments (Gizaw et al., 2007). On the other hand, the unique genetic composition of short fat-tailed Menz sheep is consistent with its distinct phenotypes, population histories, and ecological distribution (Gizaw et al., 2007).

# CONCLUSION

Our high-density genome-wide SNP analyses revealed that Ethiopian sheep populations are roughly clustered according to their geographic distribution and tail phenotype. The genetic diversity and structure of Ethiopian sheep populations can be explained by historical events and selection for ecological adaptation. The high-density SNP data generated in this study can be used to identify genes and pathways relevant for physiological adaptation to extreme environments and variation in phenotypic traits. The close clustering of Eastern African breeds and their separation from North African breeds provide evidence that fat-tailed sheep were introduced to the continent via the Horn of Africa and migrated further southwards. Additional genome-wide analyses of thin-tailed sheep breeds from Eastern and Western Africa and fattailed breeds from the Arabian Peninsula can clarify the evolutionary history of sheep on the African continent and provide new insight into the genomic landscape of African sheep breeds.

# DATA ACCESSIBILITY

Genotypic data of 72 animals representing five Ethiopian sheep populations are deposited and available at (https://www. animalgenome.org/repository/pub/KORE2017.1122/).

# ETHICS STATEMENT

Local regulations were observed. This research used Nasal swab DNA collection kits, which does not require injure the animal nor impose pain.

# REFERENCES


# AUTHOR CONTRIBUTIONS

ZE and K-SK conceived the study, analyzed the data, and wrote the manuscript. TD provided logistical support for field data collection and facilitated sample export. HD and K-TD revised the manuscript. All authors read and approved the final manuscript.

# FUNDING

This study was supported by a grant from National Research Foundation of Korea (No. NRF-2017R1A2B1008883). The authors thank the International Livestock Research Center, Addis Ababa, Ethiopia, for providing logistical support; and the International Sheep Genomic Consortium for providing Ovine 50SNP chip data. This study used Moroccan sheep data generated by the NextGen Consortium, which was funded by the European Union Seventh Framework Program (FP7/2010-2014) under grant agreement no. 244356.

# ACKNOWLEDGMENT

We would like to thank the reviewers for their useful comments and suggestions.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2017.00218/full#supplementary-material

FIGURE S1 | Results of PC 1 and 2 from the dataset of 6163 SNP markers.

TABLE S1 | The number of private alleles detected in the comparison of each sheep population.

TABLE S2 | Diversity indices in 5 Ethiopian sheep populations estimated from 80,602 SNPs obtained after LD pruning.

TABLE S3 | Genetic differentiation (FST) between the study sheep breeds based on analysis of 40,770 SNPs.

TABLE S4 | Reynolds' genetic distance between the 18 sheep breeds based on analysis of 40,770 SNPs.




**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Edea, Dessie, Dadi, Do and Kim. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Insertion/Deletion Within the KDM6A Gene Is Significantly Associated With Litter Size in Goat

Yang Cui 1†, Hailong Yan1,2,3†, Ke Wang<sup>1</sup> , Han Xu<sup>1</sup> , Xuelian Zhang<sup>1</sup> , Haijing Zhu2,3 , Jinwang Liu2,3, Lei Qu2,3, Xianyong Lan<sup>1</sup> \* and Chuanying Pan<sup>1</sup> \*

<sup>1</sup> College of Animal Science and Technology, Northwest A&F University, Yangling, China, <sup>2</sup> Shaanxi Provincial Engineering and Technology Research Center of Cashmere Goats, Yulin University, Yulin, China, <sup>3</sup> Life Science Research Center, Yulin University, Yulin, China

### Edited by:

Farai Catherine Muchadeyi, Agricultural Research Council of South Africa (ARC-SA), South Africa

### Reviewed by:

Kieran G. Meade, Teagasc, The Irish Agriculture and Food Development Authority, Ireland Fabyano Fonseca Silva, Universidade Federal de Viçosa, Brazil

### \*Correspondence:

Xianyong Lan lan342@126.com Chuanying Pan panyu1980@126.com

†These authors have contributed equally to this work.

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 14 November 2017 Accepted: 05 March 2018 Published: 20 March 2018

### Citation:

Cui Y, Yan H, Wang K, Xu H, Zhang X, Zhu H, Liu J, Qu L, Lan X and Pan C (2018) Insertion/Deletion Within the KDM6A Gene Is Significantly Associated With Litter Size in Goat. Front. Genet. 9:91. doi: 10.3389/fgene.2018.00091 A previous whole-genome association analysis identified lysine demethylase 6A (KDM6A), which encodes a type of histone demethylase, as a candidate gene associated to goat fecundity. KDM6A gene knockout mouse disrupts gametophyte development, suggesting that it has a critical role in reproduction. In this study, goat KDM6A mRNA expression profiles were determined, insertion/deletion (indel) variants in the gene identified, indel variants effect on KDM6A gene expression assessed, and their association with first-born litter size analyzed in 2326 healthy female Shaanbei white cashmere goats. KDM6A mRNA was expressed in all tissues tested (heart, liver, spleen, lung, kidney, muscle, brain, skin and testis); the expression levels in testes at different developmental stages [1-week-old (wk), 2, 3 wk, 1-month-old (mo), 1.5 and 2 mo] indicated a potential association with the mitosis-to-meiosis transition, implying that KDM6A may have an essential role in goat fertility. Meanwhile, two novel intronic indels of 16 bp and 5 bp were identified. Statistical analysis revealed that only the 16 bp indel was associated with first-born litter size (P < 0.01), and the average first-born litter size of individuals with an insertion/insertion genotype higher than that of those with the deletion/deletion genotype (P < 0.05). There was also a significant difference in genotype distributions of the 16 bp indel between mothers of single-lamb and multi-lamb litters in the studied goat population (P = 0.001). Consistently, the 16 bp indel also had a significant effect on KDM6A gene expression. Additionally, there was no significant linkage disequilibrium (LD) between these two indel loci, consistent with the association analysis results. Together, these findings suggest that the 16 bp indel in KDM6A may be useful for marker-assisted selection (MAS) of goats.

Keywords: cashmere goat, KDM6A gene, meiosis, mitosis, insertion/deletion (indel), litter size

# INTRODUCTION

Improvements in female fertility are of critical importance for the goat industry. As one of the most important factors restricting female fertility, increasing litter size has received much more consideration (Naicy et al., 2016; Yang et al., 2017). However, litter size is a trait with low heritability in many livestock animals, including pigs (Córdoba et al., 2015) and goats (Shaat and Mäki-Tanila, 2009); therefore, traditional direct selection is ineffective. At present, marker-assisted selection

**22**

(MAS), based on relevant genetic variants, is used extensively to improve traits with low heritability, such as those associated with growth and reproduction (Sharma et al., 2013; An et al., 2015; Tomas et al., 2016). To facilitate MAS application to litter size in the goat industry, critical genetic variants causing phenotypic advantage should be verified.

Currently, whole-genome sequencing and genome-wide association studies (GWAS) are used to explore genetic variants strongly associated with production traits (Lai et al., 2016; Mota et al., 2017; Wu et al., 2018); however, numerous potential genes identified by GWAS have not been fully verified. To address this problem, methods which combine GWAS analysis results and MAS to screen for critical genetic variations in large livestock populations have been developed. Previously, Hubert et al. (2014) used whole-genome re-sequencing to reveal that genomic variations within the bovine transmembrane protein 95 (TMEM95) gene are associated with male reproductive performance. A genome scan in a French dairy goat population also found that variants in diacylglycerol o-acyltransferase 1 (DGAT1) were associated with a notable decrease in milk fat content (Martin et al., 2017). These results demonstrate the feasibility of using combined methods to screen for important genetic variations.

In 2016, a study using whole-genome analysis to compare high and low fecundity groups of the Chinese Laoshan dairy goat identified several genes as potentially critical for fecundity, including lysine demethylase 6A (KDM6A), androgen receptor (AR), and anti-Mullerian hormone receptor type 2 (AMHR2) (Lai et al., 2016). Among these genes, KDM6A encodes a protein that demethylates tri- and dimethylated lysine 27 of histone H3, and can affect gametophyte development. Importantly, numerous studies have verified that the KDM6A gene is vital for animal reproduction. In rodents, knock-out of the KDM6A gene disrupted primordial germ cell development (Mansour et al., 2012). In female mice, the Rhox cluster of genes, which contains reproduction-related homeobox genes, is also regulated by KDM6A (Berletch et al., 2013). Furthermore, KDM6A regulates maturation of the mouse oocyte (Xu et al., 2017). Overall, based on whole-genome analysis and rodent studies, there is strong evidence that KDM6A has crucial roles in modulation of goat fecundity.

To date, goat KDM6A gene expression profiles and DNA polymorphisms are largely unexplored. Therefore, in this study, the tissue expression profiles of the KDM6A gene were investigated, two novel indel variants in this gene identified and the relationship between these loci and first-born litter size evaluated in a large Shaanbei white cashmere goat population. Moreover, the relationship between the identified indel loci and KDM6A expression levels was assessed. Our findings provide a basis for further research about the underlying causal mutation and suggest hypotheses for further study leading to the application of MAS to goat breeding.

# MATERIALS AND METHODS

All experiments in this study involving animals were approved by the Faculty Animal Policy and Welfare Committee of Northwest A&F University (protocol number NWAFAC1008). Moreover, the care and use of experimental animals completely conformed with local animal welfare laws, guidelines, and policies.

# Sample Collection

For DNA experiments, a total of 2,326 adult female Shaanbei white cashmere goats were randomly selected from a large population. These goats all received the same diet and were kept under standard conditions after weaning. Among these goats, 1,811 animals had records of first-born litter size data (Wang et al., 2017; Yang et al., 2017). Apart from these female goats, we also collected a total of 18 male goat samples from six different developmental periods for RNA experiments. Nine tissues (heart, liver, spleen, lung, kidney, testis, brain, skin, and muscle) were harvested from 1-week-old (wk) and 2-month-old (mo) male goats (n = 3 per group). Moreover, testes tissues samples were also collected at 2, 3 wk, 1, and 1.5 mo (n = 3 per group). All tissues were immediately frozen in liquid nitrogen and stored at −80◦C.

# Isolation of DNA

Genomic DNA was isolated from ear tissues using the method published by Lan et al. (2007). The quality of genomic DNA samples was assayed using Nanodrop 2000 Spectrophotometer. DNA samples were each diluted to a working concentration of 10 ng/µL and stored at −20◦C.

# Primer Design, PCR Amplification, and Indel Genotyping

Five primer pairs for amplification of indel loci in introns were designed using Primer Premier software (version 5.0) based on the goat KDM6A gene sequence (NW\_017189516.1) and the NCBI SNP-database (https://www.ncbi.nlm.nih.gov/snp; **Table 1**). Assays were performed by touch-down PCR in a 13 µL volume, containing 6.5 µL 2 × mix, 0.3 µL each of forward and reverse primers, 0.8 µL genomic DNA (10 ng/µL), and 5.4 µL ddH2O. The PCR protocol was as follows: initial denaturation for 5 min at 95 ◦C; followed by 18 cycles of denaturation for 30 s at 94◦C, annealing for 30 s at 68◦C (with a decrease of 1 ◦C per cycle), extension for 30 s at 72◦C; another 25 cycles of 30 s at 94◦C, 30 s at 50◦C, and 2 min at 72◦C; and a final extension for 10 min at 72◦C, with subsequent cooling to 4◦C. The genotyping of indel polymorphisms in goat KDM6A was performed by separation of PCR products (5 µL) by agarose gel electrophoresis.

# Total RNA Isolation and Synthesis of cDNA

Total RNA was extracted from tissue samples using TRIzol total RNA extraction reagent (Takara, Dalian, China), according to the manufacturer's instructions. The integrity of total RNA was

**Abbreviations:** KDM6A, lysine-specific demethylase 6A; indel, insertion/deletion; SNPs, single nucleotide polymorphisms; LD, linkage disequilibrium; MAS, marker-assisted selection; Ho, homozygosity; He, heterezygosity; PIC, polymorphism information content; HWE, Hardy-Weinberg equilibrium; II, insertion/insertion; ID, insertion/deletion; DD, deletion/deletion; wk, week-old; mo, month-old; LD, linkage disequilibrium; MSL, Mothers of single lamb; MML, Mothers of multi-lamb.

evaluated by 1% agarose gel electrophoresis in 6×loading buffer (Takara, Dalian, China). The quantity and quality of total RNA was estimated using a Nanodrop 2000 Spectrophotometer with the OD<sup>260</sup> nm/OD<sup>280</sup> nm ratio expected to be between 1.8 and 2.0; meanwhile, the OD<sup>260</sup> nm/OD<sup>230</sup> nm ratio no less than 1.7 (Zhang et al., 2017). Samples were then stored at −80◦C. Prime ScriptTM RT Reagent kit (Takara, Dalian, China) was used to synthesize first strand cDNA, according to the manufacturer's protocol. The resultant cDNA was stored at −20◦C.

# Analysis of KDM6A mRNA Expression Profiles by Quantitative Real-Time PCR

KDM6A gene expression profiles were analyzed by qPCR using cDNA from 1 wk and 2 mo male goat tissue samples. Expression profiles in testes at different time points (1, 2, 3 wk, 1, 1.5, and 2 mo) were also evaluated. qPCR primers were designed covering different exons in order to assure the amplification of the cDNA (**Table 1**). qPCR reactions (12 µL) contained 6 µL 2×SYBR Premix ExTaq (Takara, Dalian, China), 0.5 µL of each primer, and 5 µL cDNA (1/100 dilution). PCR amplification was performed as follows: 95◦C for 5 min followed by 40 cycles of 94◦C for 30 s, 60◦C for 30 s, and 72◦C for 30 s (Yu et al., 2017). The expression levels of RPL19 (ribosomal protein L19), GAPDH (glyceraldehyde-3-phosphate dehydrogenase) and ACTB (β-actin) in all isolated tissues were tested. The reference gene in each tissue was analyzed from the GeNorm program, which based on the M-values (reference gene with the lowest M-value is considered most stable; Vandesompele et al., 2002). After calculation, RPL19 was used as the reference gene in lung, muscle, brain and skin. ACTB was used as the reference gene for evaluation of relative gene expression in heart, liver, spleen,

TABLE 1 | PCR primers used for detecting indel loci and qPCR analysis of goat KDM6A gene.


kidney and testis. And previous studies also used ACTB to determine gene expression in goat testis (Yao et al., 2014; Deng et al., 2017b). The results were determined using the 2−11Ct method (Livak and Schmittgen, 2001).

# Statistical Analysis

To explore the genetic structure of the indel variants in the investigated goat population, genetic diversity indices were calculated. The genotype and allele frequencies reflect the genetic composition of the indel variant in the tested goat population. Nei's methods were used to calculate population genetic diversity indices, including homozygosity (Ho), heterozygosity (He; Ho + He = 1) and polymorphism information content (PIC) (Nei and Roychoudhury, 1974). Ho and He are a measure of genetic variation of a population. PIC is an indicator of polymorphism. Based on PIC values, the genetic variations classified as high genetic diversity (PIC > 0.5), medium genetic diversity (0.25 < PIC < 0.5) and low genetic diversity (PIC < 0.25) (Botstein et al., 1980). The χ 2 test using the SHEsis online platform (http://analysis.bio-x.cn) was conducted to evaluate HWE (Li et al., 2009; Chen et al., 2013).

Linkage disequilibrium (LD) is the nonrandom association of alleles at linked loci. In particular, many genetic variations correlated with each other due to LD; thus, LD plays a crucial role for mapping complex disease or trait-associated genes (Pritchard and Przeworski, 2001; Hazelett et al., 2016). Currently, to detect whether there is a linkage between the two indels identified in KDM6A, the LD structure as measured by D' and r <sup>2</sup> were performed with the SHEsis online platform (http://analysis.biox.cn; Li et al., 2009). The r 2 -value was used as a pairwise measure of LD (Marty et al., 2010; Huang et al., 2015). The case of r <sup>2</sup> = 0 is known as perfect LD, r <sup>2</sup> > 0.33 indicates sufficiently strong LD, and r <sup>2</sup> = 1 suggests complete LD (Ren et al., 2014).

Associations between indels and first-born litter size, to establish the influence of different parameters on litter size, were analyzed using a general linear model: Yij = µ+HYS<sup>i</sup> +G<sup>j</sup> +eij, where Yij is the phenotypic value of litter size, µ is the overall population mean, HYS<sup>i</sup> is the fixed effect of the herd-year-season, Gj is the fixed effect of genotype, and eij is the random error (Yang et al., 2017). The litter size data used in this study was first-born litter size; thus, the lambing year and parity were not included in the general linear model. The analysis was performed with SPSS 19.0 software by one-way ANOVA and compared using Tukey multiple test.

# RESULT

# mRNA Expression Profile of Goat KDM6A

Goat KDM6A mRNA expression profiles were investigated in different tissues at 1 wk (**Figure 1A**) and 2 mo (**Figure 1B**). KDM6A was found to be expressed in all tissues tested at both developmental stages. Notably, the expression levels of KDM6A in heart, liver and spleen tissues were significantly higher at 1 wk than at 2 mo (P < 0.05). In contrast, the expression levels of KDM6A were significantly lower in lung, muscle, brain, and skin at 1 wk than at 2 mo (P < 0.05; **Figure 1C**).

# Goat KDM6A Gene Expression Profiles in Testis Tissues

This study was focused on the reproductive system, thus the expression levels of KDM6A at different testis developmental stages (1, 2, 3 wk, 1, 1.5, and 2 mo) were explored. In testis tissues, the KDM6A mRNA expression levels at 2 and 3 wk were significantly lower than that at 1, 1.5, and 2 mo (P < 0.05; **Figure 2A**). In a previous study of Liaoning cashmere goat (the male parents of Shaanbei white cashmere goat) spermatogonia was found to be actively mitotic from the postnatal period, with primary spermatocytes, which result from meiosis, first appearing at 1 mo (Zhan, 2015). Thus, we divided testis development into two phases: birth to 1 mo, referred to as the mitosis period, and 1–2 mo, referred to as the meiosis period. KDM6A mRNA expression levels were significantly increased in meiosis period compared with the mitosis period (P < 0.05; **Figure 2B**). Together, these findings provide evidence that KDM6A has an important role in fertility. To explore potential DNA markers for improvement of goat fertility, we next focused on the identification of polymorphisms in KDM6A.

# Identification of Genetic Variants That Regulate KDM6A Expression

In this study, two novel indel variants were detected in goat KDM6A introns; one of 16 bp indel (intron 17) (NW\_017189516.1:g.138431\_138446delAATGTATAGCTTAAAA; rs636691921) and another of 5 bp indel (intron 17) (NW\_017189516.1:g.138708\_138712delTTAAT; rs653321281).

These indels were detected using primer 3 and 4, respectively. PCR products separated by agarose gel electrophoresis, and sequence diagrams of these two novel indels are presented in **Figure 3** and **Supplement Figure 1**.

Several previous studies have reported that variants in intron can affect gene transcription (Ren et al., 2011); therefore, KDM6A expression levels at different developmental periods were conducted in animals with the same genotype. However, in the mitosis period individuals that had DD and ID genotypes were not found at the 16 bp locus; thus, for this locus, we only compared the KDM6A expression levels of II genotype carriers between the mitosis and meiosis periods. The results demonstrated that KDM6A expression levels were significantly higher in the meiosis period of the II genotype at the 16 bp locus (P < 0.01). Furthermore, KDM6A expression was significantly higher during the meiosis period of animals with the II and DD genotypes of the 5 bp locus (P < 0.01; **Figure 4**). Together, these results indicate that the two indel loci could affect the expression levels of KDM6A and may influence the reproductive phenotype of goats. Therefore, the relationship between these indel loci and goat reproductive traits were further investigated in a large goat population.

# Genetic Parameters and LD of the Identified Indel Loci

The genotype and allele frequencies, as well as other genetic parameters, associated with the KDM6A indel loci were calculated to determine the genotype distribution among Shaanbei white cashmere goats (**Table 2**). The data indicated

FIGURE 3 | The electrophoresis diagrams and sequence diagrams of goat KDM6A gene indel loci. (A) 16 bp indel locus. (B) 5 bp indel locus.

that "I" allele (0.941) of the 16 bp indel was more frequent than "D" allele (0.059). For the 5 bp indel, analysis of 615 individuals indicated that the frequency of the "I" allele was lower than 0.278, with the "D" allele present at a higher frequency (0.722). Additionally, the χ 2 test indicated that the 5 bp indel genotype frequency was in agreement with HWE (P > 0.05) in the Shaanbei cashmere goat population; however, the 16 bp indel did not conform to HWE (P < 0.05; **Table 2**). Based on PIC values, the 16 bp locus had low genetic diversity (PIC = 0.105), and the 5 bp locus had medium genetic diversity (PIC = 0.321). Moreover, we analyzed the LD between these two indel loci; however, no LD was detected between them (r <sup>2</sup> = 0.047; **Table 3**; **Figure 5**).

# Analyses of Associations Between Indel Variations and First-Born Litter Size

Next, the associations between KDM6A indel loci and there productive performance of female goats (first-born litter size) were investigated. The results showed that there was no relationship between the 5 bp indel and first-born litter size in populations of different sizes (n = 300–600 individuals) randomly selected from the whole population (P > 0.05; **Table 4**). Notably, the 16 bp locus was always associated with first-born

TABLE 2 | Genetic parameters of the 16 and 5 bp loci within KDM6A gene in Shaanbei white cashmere goat.


TABLE 3 | Estimated values of linkage equilibrium analysis for two indels in KDM6A gene in studied populations.


litter size (P < 0.01) from 300 to 600 and even reaching 1811 individuals, with animals with the II genotype having larger first-born litter size than those with the DD genotype (**Table 4**).

Furthermore, we investigated the genotype distributions of these two indel loci in groups of goats with first-born singlelamb and multi-lamb litters, using the same test groups described above (n = 100–600 individuals; **Tables 5**, **6**). The results demonstrate that only the 16 bp indel had different genotype distributions between the two groups of goats with different litter types (P < 0.01). These results were consistent with those of association analyses; therefore, we tested the 16 bp indel in a total of 1,811 individuals. The results indicated a significant difference in genotype distributions between groups with first-born singlelamb and multi-lamb litters at the 16 bp indel (P = 0.001; **Table 5**). There was no LD between the two indel loci, consistent with the results of the association analysis (**Figure 5**).

# Influence of the 16 Bp Indel on KDM6A Expression During the Meiosis Period

Based on the results of the association analyses, we hypothesized that the 16 bp indel can influence goat reproductive phenotype. This phenomenon may be attributable to the effect of genotype at this locus on KDM6A mRNA expression levels. Therefore, we tested KDM6A mRNA expression levels in testis tissue from animals with three genotypes at the two indel loci during the meiosis period. At the 16 bp indel locus, the individuals with II genotype had significantly higher levels of KDM6A mRNA expression than those with the ID and DD genotype (P < 0.05; **Figure 6**); however, there were no significant differences in KDM6A expression levels among different genotypes at the 5 bp indel locus (P > 0.05; **Figure 6**).

# DISCUSSION

Previously, Lai et al. (2016) determined that variants of the KDM6A gene were closely related to fecundity in Laoshan dairy goats using deep sequencing analysis. Several studies have also explored the role of KDM6A in reproductive biology (Yap et al., 2011), and their findings suggest that this gene has an essential role in reproduction. However, there are no previous reports of goat KDM6A tissue expression profiles. The relationship between KDM6A gene variants and first-born litter size in large Shaanbei white cashmere goat population (n = 2,326) required further investigation.

First, we determined the expression profiles of the goat KDM6A gene, and the results demonstrated that it was widely expressed in various organs. As the KDM6A gene is reported to be associated with spermatogenesis (Teperek et al., 2016), we next determined its expression patterns at different developmental



Data represent means ± SE. Cells with different letters (a, b) means P < 0.05. Bold values mean P < 0.05.

stages in testis. Interestingly, the mRNA expression levels of KDM6A at later developmental stages (1, 1.5, and 2 mo) were higher than those at earlier stages (1, 2, and 3 wk). A study of the Liaoning cashmere goat reported that their spermatogonia gradually proliferate via mitotic division from birth, and primary spermatocyte development, which initiate meiosis from 1 mo (Zhan, 2015). Therefore, we combined individuals at 1, 2, and 3 wk classified as the mitosis period; similarly the 1, 1.5, and 2 mo data were considered the meiosis period. Our results demonstrate that KDM6A mRNA expression levels during the mitosis stage were lower than those in the meiosis stage (P < 0.05), suggesting KDM6A may be associated with the mitosis-to-meiosis transition in the Shaanbei white cashmere goat. Additionally, previous reports indicate that KDM6A regulates oocyte meiosis resumption in female mice, and abnormal expression of this gene causes aberrant H3K27me3, leading to disruption of oocyte maturation (Xu et al., 2017). Together, these data indicated that the KDM6A gene may have an essential role in meiosis resumption in both male and female animals.

In addition to the KDM6A gene, deep sequencing analyses of the Laoshan dairy goat have also identified genetic variants in male sex differentiation genes, including AR and AMHR2 that are closely associated with female fecundity (Lai et al., 2016). Furthermore, with the development of modern and intensive breeding condition, the number of male livestock is far less than the female (Wang et al., 2017). We hoped to explore the genetic variation in goat KDM6A, with the aim of implementing the identified polymorphisms as molecular markers to contribute to MAS in goat breeding. Therefore, we performed further analysis


TABLE 5 | The 16 bp locus genotype distribution between mothers of single lamb and multi-lamb litters in Shaanbei white cashmere goats.

MSL, Mothers of single lamb, MML, Mothers of multi-lamb (≥2). Bold values mean P < 0.05.

TABLE 6 | The 5 bp locus genotype distribution between mothers of single lamb and multi-lamb litters in Shaanbei white cashmere goats.


MSL, Mothers of single lamb; MML, Mothers of multi-lamb (≥2). Bold values mean P < 0.05.

of KDM6A genetic effects in the female Shaanbei white cashmere goat population.

Currently, natural genetic variations are divided into three forms: SNPs (single nucleotide polymorphisms), indels and SVs (larger structural variants; Julienne et al., 2010). Unlike other genetic variations, indels can be directly detected by simple PCR amplification and agarose gel electrophoresis, making them convenient and practical (Naicy et al., 2016). Therefore, indel variants in the KDM6A gene were identified and their associations with first-born litter size investigated in a large commercial population of 2,326 Shaanbei white cashmere goats. Two novel indel loci (16 and 5 bp) were identified in putative intron 17 sequences, and each had three genotypes (II, ID, and DD). The 5 bp indel was in HWE (P > 0.05); however, the 16 bp locus was not (P < 0.05), because of the lower number of observed DD genotypes. One possible reason for this is rapid, powerful, and effective selection, which could affect the allelic balance of the indel locus (Zhao et al., 2013; Wang et al., 2015). Therefore, our data indicate that the selection pressure was more powerful on 16 bp than the 5 bp indel locus in the investigated goat population.

To analyze the association between indel loci and first-born litter size, we developed a novel strategy. Initially, analysis of the two indel loci was investigated in the same groups of 100– 600 individuals, which were selected randomly from the whole population. When an indel locus in any investigated subset shows significant correlation with phenotype, it can be considered that this site is indeed correlated with the tested trait, especially in large population. This strategy improves the credibility of the test. Using groups of 300–600 individuals, there was no relationship between first-born litter size and the 5 bp indel locus (P > 0.05; **Table 4**). Interestingly, the 16 bp locus was consistently associated with first-born litter size in the same test groups (P < 0.01). Based on this data, we performed further analysis of the 16 bp indel among all individuals, and found that the association with first-born litter size was retained (P < 0.01), with the II genotype associated with larger litter size relative to the other genotypes, suggesting that the allele "I" of the KDM6A gene positively effects fecundity in this breed. Next, we adopted the same strategy to compare genotype distributions at these two indel loci between females who had first-born single-lamb and multi-lamb litters. The analysis results indicate that the 16 bp indel was very strongly associated with goat first-born litter size. Compared with direct analysis in the tested population (Deng et al., 2017a), this new strategy may provide more detailed and reliable results of association analysis. Additionally, the 16 bp locus had a significant effect on KDM6A gene expression, further implying a huge potential application for analysis of this locus. Moreover, linkage analysis demonstrated no LD between the two analyzed indel loci, consistent with the different results of association analyses.

The association analysis based on the large experimental population revealed that the 16 bp indel located in the 17 intron of KDM6A was strongly associated with litter size in goats, which was consistent with the previous whole-genome analysis for Laoshan dairy goats (Lai et al., 2016). Since Ren et al. (2011) reported that intronic variations could affect the gene expression level, the relationship between the 16 bp indel and the expression of KDM6A was evaluated in the current study. Our results showed that the intronic 16 bp indel significantly associated with the expression of KDM6A. According to previous investigations, the intronic variations could impact the interaction between transcription factors and host genes (Van Laere et al., 2003; Fushan et al., 2009; Soldner et al., 2016). Therefore, the transcription factor binding site on the 16 bp indel sequence was predicted using the online software Genomatix MatInspector (http://www.genomatix.de; Cartharius et al., 2005). The bioinformatics analysis results showed that myocyte-specific enhancer factor 2 (MEF2), as transcription factor, could bind to the sequence in the context of lacking the 16 bp nucleotides (**Figure 7**). This discovery provided a possibility that MEF2 factor influence goat litter size. However, in mouse, MEF2 was expressed in the testis throughout development but absent in the ovary (Daems et al., 2014), which meant the impact of the 16 bp indel on litter size might not caused by MEF2 factor. In addition, some intronic variations could be in perfect LD with known phenotype-associated mutations (Nakaoka et al., 2016). For example, a 40 bp indel variant residing in the mouse double minute 2 homolog (MDM2) gene promoter is in complete LD with a SNP (rs2279744), and the SNP locus has been demonstrated to be associated with the susceptibility to several cancers. Through linkage with the SNP locus, this indel locus had positive association with risk of colon cancer (Gansmo et al., 2016). Of course, whether the 16 bp intronic indel influences phenotype through linkage with causal mutations need further study to be proven.

# CONCLUSION

In this study, the results indicated that goat KDM6A mRNA was expressed in all tissues tested (heart, liver, spleen, lung, kidney, muscle, brain, skin, and testis), and the expression levels in testis were significantly increased through mitosis-to-meiosis transition. Meanwhile, two novel intronic indels of 16 and 5 bp were identified, and only the 16 bp indel was significantly associated with first-born litter size (P < 0.01). Additionally, the 16 bp indel had a significant effect on KDM6A gene expression. These findings would provide a basis for further research about the underlying causal mutation and the application of MAS to goat breeding.

# ETHICS STATEMENT

On the basis of experimental animal management measures in Shaanxi province (016000291szfbgt-2011-000001), all experiment procedures were approved by the Review Committee for the Use of Animal Subjects of Northwest A&F University.

Animal experimentation, including sample collection, was performed in agreement with the ethical commission's guidelines.

# AUTHOR CONTRIBUTIONS

YC, XL, and CP came with idea and wrote manuscript. KW, HX, JL, HZ, and LQ collected the goat samples and isolated of genomic DNA. YC, HY, and XZ performed the experiments. YC, HY, and HX analyzed the data. All authors approved the final version of the manuscript for submission.

# FUNDING

This work was funded by the National Natural Science Foundation of China (No.31760650;

# REFERENCES


No.31172184) and Provincial Key Projects of Shaanxi (2014KTDZ02-01).

# ACKNOWLEDGMENTS

We greatly thanked the staffs of Shaanbei white cashmere goat breeding farm, Shaanxi province, P.R. China for their collecting samples.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00091/full#supplementary-material

Supplement Figure 1 | The original image of the electrophoresis diagrams. (A) 16 bp indel locus. (B) 5 bp indel locus.


associated with litter size and growth traits in goats. Anim. Genet. 48, 735–736. doi: 10.1111/age.12617


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Cui, Yan, Wang, Xu, Zhang, Zhu, Liu, Qu, Lan and Pan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genome-Wide Association Studies Identify Candidate Genes for Coat Color and Mohair Traits in the Iranian Markhoz Goat

Anahit Nazari-Ghadikolaei<sup>1</sup> , Hassan Mehrabani-Yeganeh<sup>1</sup> \*, Seyed R. Miarei-Aashtiani<sup>1</sup> , Elizabeth A. Staiger<sup>2</sup> , Amir Rashidi<sup>3</sup> and Heather J. Huson<sup>2</sup> \*

<sup>1</sup> Department of Animal Science, College of Agriculture and Natural Resources, University of Tehran, Karaj, Iran, <sup>2</sup> Department of Animal Science, Cornell University, Ithaca, NY, United States, <sup>3</sup> Department of Animal Science, Faculty of Agriculture Engineering, University of Kurdistan, Sanandaj, Iran

### Edited by:

Max F. Rothschild, Iowa State University, United States

### Reviewed by:

Tosso Leeb, Universität Bern, Switzerland Brian Kirkpatrick, University of Wisconsin–Madison, United States

### \*Correspondence:

Hassan Mehrabani-Yeganeh hmehrbani@ut.ac.ir Heather J. Huson hjh3@cornell.edu

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 27 November 2017 Accepted: 16 March 2018 Published: 04 April 2018

### Citation:

Nazari-Ghadikolaei A, Mehrabani-Yeganeh H, Miarei-Aashtiani SR, Staiger EA, Rashidi A and Huson HJ (2018) Genome-Wide Association Studies Identify Candidate Genes for Coat Color and Mohair Traits in the Iranian Markhoz Goat. Front. Genet. 9:105. doi: 10.3389/fgene.2018.00105 The Markhoz goat provides an opportunity to study the genetics underlying coat color and mohair traits of an Angora type goat using genome-wide association studies (GWAS). This indigenous Iranian breed is valued for its quality mohair used in ceremonial garments and has the distinction of exhibiting an array of coat colors including black, brown, and white. Here, we performed 16 GWAS for different fleece (mohair) traits and coat color in 228 Markhoz goats sampled from the Markhoz Goat Research Station in Sanandaj, Kurdistan province, located in western Iran using the Illumina Caprine 50K beadchip. The Efficient Mixed Model Linear analysis was used to identify genomic regions with potential candidate genes contributing to coat color and mohair characteristics while correcting for population structure. Significant associations to coat color were found within or near the ASIP, ITCH, AHCY, and RALY genes on chromosome 13 for black and brown coat color and the KIT and PDGFRA genes on chromosome 6 for white coat color. Individual mohair traits were analyzed for genetic association along with principal components that allowed for a broader perspective of combined traits reflecting overall mohair quality and volume. A multitude of markers demonstrated significant association to mohair traits highlighting potential candidate genes of POU1F1 on chromosome 1 for mohair quality, MREG on chromosome 2 for mohair volume, DUOX1 on chromosome 10 for yearling fleece weight, and ADGRV1 on chromosome 7 for grease percentage. Variation in allele frequencies and haplotypes were identified for coat color and differentiated common markers associated with both brown and black coat color. This demonstrates the potential for genetic markers to be used in future breeding programs to improve selection for coat color and mohair traits. Putative candidate genes, both novel and previously identified in other species or breeds, require further investigation to confirm phenotypic causality and potential epistatic relationships.

Keywords: genome-wide association study, coat color, mohair, fleece, Markhoz, goat, Angora, wattles

# INTRODUCTION

fgene-09-00105 March 30, 2018 Time: 16:17 # 2

Throughout history, goats have played a vital role in the livelihood of humans, being a main source of meat, milk, fiber, and hides, especially in harsh environmental conditions. Archeological evidence supports goat domestication around 10,000 years ago in the Zagros Mountain region in Iran (Zeder and Hesse, 2000) with the UN Food and Agriculture Organization (FAO) estimating 25 million goats in the country today (FAOSTAT, 2008). Many goat breeds, including the Iranian Markhoz breed, have adapted to climates with extremely high temperatures, low rainfall, and low humidity. Yet it is the mohair or long, silky hair of the Markhoz goat that make this indigenous breed unique. Mohair is generally associated with the popular Angora breed and as such, the Markhoz goat is oftentimes considered an Angora goat. The Markhoz goat is the only mohair-producing breed within the Kurdistan province in the west of Iran. A prominent feature of this breed is the coat color variation, which can be dark to light brown, black, gray, or white. This coat color variation is unique among Angora goats which are predominantly selected for a white coat color (Rashidi et al., 2006). The mohair, particularly the brown color from the Markhoz goats, is often used to make clothing for important cultural ceremonies, especially weddings. Unfortunately, due to a reduction in population size, a genetic bottleneck has been observed in the Markhoz population reducing diversity (Rashidi et al., 2015). Cross breeding and inbreeding are additional concerns for all local breeds as inbreeding is likely increasing as population sizes diminish and innate characteristics of the indigenous breeds may be lost during admixture with other breeds (Hanotte et al., 2010).

While the Markhoz goat is indigenous to Iran, Angora goats are now found predominantly in South Africa, the United States, and Argentina, with smaller herds in other countries like Turkey, Australia, and New Zealand (Mohair South Africa, 2017). Mohair production is unique to these animals in that they are the only single coated breeds in which the primary and secondary hair follicles produce the same fiber (Mohair South Africa, 2017). Adult Angora goats are typically shorn twice a year in the aforementioned major production countries providing 2–2.5 kg of mohair per goat (Mohair South Africa, 2017). Mohair is a type of fiber collected from Markhoz and Angora goats that is exceptionally soft, has a high luster, and is used worldwide in the textile and clothing industries. The softness of mohair and other quality traits are determined based on the hair's diameter, kemp, and medullated and greasy fiber content. Kemp is an undesirable fiber characteristic that causes irregular dying properties and a coarse appearance due to the medullated hair fibers having a core of air-filled cells and course medulla. The less kemp, the better the mohair quality and true fiber percentage of the fleece. Increased greasy fiber percentage or kemp percentage is unfavorable for industry purposes, yet both qualities are important adaptive mechanisms which protect the mohair against humidity and environmental contamination such as dirt and vegetation (Mohair South Africa, 2017). A softer hair fiber has a smaller diameter with less kemp, and less medullated and greasy fiber content. True fiber is considered pure fiber without kemp while fiber efficiency equates clean fleece without any grease content.

There are few genome-wide studies for production and disease related traits in goats as compared to other mammals (Zidi et al., 2014; Becker et al., 2015; Lan et al., 2015; Reber et al., 2015; Martin P. et al., 2016; Martin P.M. et al., 2016; Menzi et al., 2016). To date, goats are not included in a Quantitative Trait Loci (QTL) database (Hu et al., 2016). However, linkage analysis has identified QTL regions for coefficient of variation of fiber diameter, kemp fiber, discontinuous medullated fiber, staple length, fleece weight, fiber diameter, and comfort factor spinning fineness in Angora goats and for fleece yield in Cashmere goats (Cano et al., 2007, 2009; Visser et al., 2011; Roldan et al., 2014). The keratin (KRT) and keratin associated protein (KRTAP) family genes located on chromosomes 1 and 5 were highlighted as candidate genes potentially responsible for diameter and kemp traits in Angora goats and have been previously confirmed in sheep (Parsons et al., 1994; Cano et al., 2007). More recently, whole-genome sequences of two Chinese breeds of cashmere goats were compared for signatures of selection and identified genes and biological pathways potentially related to cashmere production (Li et al., 2017). Gene editing of the fibroblast growth factor 5 (FGF5) gene in goat embryos resulted in an increased number of second hair follicles and longer fiber length which would suggest greater cashmere production (Wang et al., 2016).

The vast majority of Markhoz goats have a brown coat color likely due to the higher value of this color fiber for cultural events. Coat color pigmentation is a polygenic trait with genes often having epistatic interactions (Sturm et al., 2001). Well-known genes involved in coat color include melanocyte-stimulating hormone receptor (MC1R) and agouti signaling protein (ASIP or Agouti) which have a consistent effect across many species. The MC1R gene plays a key role in melanin color theme synthesis and the concentration of eumelanin or pheomelanin, which can lead to the black/brown or red/yellow phenotype, respectively. There are several studies that have investigated MC1R in cattle and sheep for coat color patterns (Klungland et al., 1995; Joerg et al., 1996; Våge et al., 1999; Fontanesi et al., 2009a; Switonski et al., 2013). Similarly, mutations in the MC1R gene have been associated with various coat color patterns in the Girgentana, Maltese, Derivata di Siria, Murciano-Granadina, Camosciata delle Alpi and Saanen goats as well (Fontanesi et al., 2009a). ASIP has an epistatic effect with MC1R and can reduce MC1R activity to generate more pheomelanin by preventing cAMP production. Yellow or pheomelanin pigmentation is a result of a dominant allele (A) at the ASIP locus, while recessive allele (a) produces eumelanin resulting in the black/brown phenotype (Adalsteinsson et al., 1994). In Saanen goats, the dominant AWt (white/tan) allele seems to be responsible for the white coat color (Martin P.M. et al., 2016). In sheep, a gene duplication in the ASIP gene is responsible for white and black phenotypes (Norris and Whan, 2008). Another gene for coat color is protooncogene receptor tyrosine kinase (KIT) which plays a key role for different white color patterns in the pig, cat, cow, mouse, horse, rabbit, dog, and camel (Geissler et al., 1988; Marklund et al., 1998; Pielberg et al., 2002; Haase et al., 2009, 2015;

Fontanesi et al., 2010a,b, 2014; Wong et al., 2013; David et al., 2014; Durig et al., 2017; Holl et al., 2017).

The aim of this study was to characterize mohair and coat color phenotypes in Markhoz goats, and identify candidate genes likely influencing these characteristics by exploring the entire genome. Traits included true fiber percentage, grease percentage, kemp percentage, efficiency, diameter, staple length, mature fleece weight and yearling fleece weight. Additionally, we performed principal component analysis (PCA) of the mohair traits to combine qualities into a single variable. The use of PCA generated two new traits with PC1 reflecting fiber quality and PC2 reflecting fiber volume. Candidate genes were identified through genome-wide association of genetically characterized goats using the Illumina Caprine 50K beadchip (Tosser-Klopp et al., 2014). Genome-wide association studies (GWAS) were conducted with significant associations identifying putative candidate genes including ASIP and KIT genes for brown/black and white coat color, respectively. These studies support the established roles of well-known genes such as ASIP and KIT in coat color, and identify novel genes such as melanoregulin (MREG), which potentially influence mohair quality. Additionally, specific genetic markers and haplotypes are identified which show potential for use in genetic selection schemes for coat color. This provides a foundation to further investigate the biological pathways and causative mutations influencing industry-valued qualities of mohair and the biological implication to animal adaptation.

# MATERIALS AND METHODS

# Animals and Phenotypes

Coat color and seven fleece traits from a total of 228 Markhoz goats (44 males and 184 females) were sampled at the Markhoz goat Research Station in Sanandaj, Kurdistan province, located in western Iran. All animal procedures were approved by the Cornell University Institutional Animal Care and Use Committee prior to sampling (protocol #2014-0121), and were conducted in a manner to minimize animal stress and handling. Goats ranged in age from 1 to 7 years old. Fleece traits included diameter, kemp percentage, staple length, true fiber percentage, efficiency, grease percentage, and fleece weight. True fiber is considered pure fiber without kemp. Fiber efficiency is equivalent to the clean fleece with no grease and can be calculated by subtracting clean fleece weight from greasy fleece weight and dividing this total by the greasy fleece weight. Coat color was classified into three different categories: brown (n = 168), black (n = 26), or white (n = 26; **Figure 1**). Coat colors have been routinely recorded by the research station since 1992. Photographs were taken of each animal at sample collection according to the USDA\_AGIN Goat Sample Protocol (USDA, 2016).

# Statistical Analysis

All animals were assessed for data quality and completeness. Statistical analysis was performed using proc GLM in SAS studio university edition (SAS Institute Inc., Cary, NC, United States) for each fleece trait to determine their relationship to variables such as sex, age (from 1 to 7 years), dam's age (2 to 8 years), type of birth (single, twin, or triplet) and color (black, brown, or white) for subsequent inclusion as covariates in the GWAS. The specific number of animals used in each GWAS is denoted in **Table 1** and varies based on the number of animals with quality trait information and the GWAS model chosen. We also performed least square means and used the Tukey method for comparing means of males and females for yearling fleece weight and comparing coat colors (black, brown, and white) for fiber volume. PCA was performed on seven of the fiber traits for animals with complete records across all fiber traits using JMP PRO 12 (SAS Institute Inc., Cary, NC, United States). The correlation matrix was applied due to the wide variation in quantitative measures of each trait. PCA was used to generate single quantitative variables combining the seven mohair traits with principal components retained for interpretation and analysis if the eigenvalue score was greater than 1.0.

# Genotyping and Quality Control

Whole blood (5 ml) was obtained via the jugular vein into vacutainers with the anticoagulant K2EDTA for subsequent DNA extraction. Genomic DNA was extracted following a standard Phenol-Chloroform extraction protocol (Sambrook and Russell, 2006). All samples were genotyped on the Illumina Caprine 50K beadchip (Illumina, Inc., San Diego, CA, United States) at VHL Genetics (VHL Genetics, Wageningen, Netherlands). The initial 53,347 SNPs were assessed for quality and removed if they had a call rate less than 0.9 (n = 624) and a minor allele frequency less than 0.03 (n = 2540). An additional 419 SNPs, unassigned to a chromosome position, were removed, leaving 49,764 SNPs for analysis. Five samples were subsequently removed with a genotyping call rate less than 0.9. To evaluate population structure and relatedness, we used an identity-by-state (IBS) similarity matrix to calculate genome-wide identity-by-descent (IBD) estimates. Three animals were removed due to having an estimated IBS score greater than 0.90 denoting substantial relatedness. Genotype quality control was conducted using Golden Helix SVS v8.3.4 (Golden Helix, Bozeman, MT, United States).

# Genome-Wide Association Studies

Multiple genome-wide tests were performed for coat color and mohair traits (**Table 1**) using Golden Helix SVS v8.3.4 (Golden Helix, Bozeman, MT, United States). Two hundred and twenty individuals were included in the coat color GWAS and 138 individuals were included in the mohair trait GWAS including 179 females and 41 males, and 115 females and 23 males, respectively. Quantitative or case-control associations were used in an Efficient Mixed Model Linear analysis (EMMAX) (Kang et al., 2010) to correct for remaining population structure and relatedness by including genomic relationship matrix as a random effect in a model. Coat color GWAS were performed in a case-control study design comparing the identified coat color to all other coat colors combined, including brown compared to black and white, black compared to brown and white and white compared to black and brown. Additional GWAS were evaluated with smaller sample sizes to compare single coat colors to one

TABLE 1 | Genome-wide association studies conducted for coat color and mohair traits in Markhoz goats.


The parameters for each trait association including the model of best fit, number of individuals assessed, and covariates used are stated.

another (i.e., brown compared to black). Variation in brown coat color was considered but small sample size and insufficient differentiation between color variations precluded an association analysis. Association studies with the covariates of sex (male or female), age (1–7 years old), dam's age (2–8 years old), and type of birth (single or twins) were considered in additive, dominant, and recessive inheritance models. Quantile–quantile (QQ) plots were used to determine the model of best fit for each trait.

Quantitative measures were used in the GWAS for true fiber, grease percentages, and yearling fleece weight. For the remaining mohair traits in which no statistically significant loci were observed using a quantitative variable, a case/control model was then applied using a threshold based on the median or quartile values to compare representatives demonstrating the greatest degree of phenotypic variation within the group. For traits that did not surpass an adjusted Bonferroni significance cutoff or an adjusted false discovery rate (FDR) significance cutoff of 0.05, we investigated significance using adaptive permutation for the model of best fit using PLINK v1.9 (Chang et al., 2015). Adaptive permutation evaluates the genomic dataset more quickly in that it discards SNPs which are not demonstrating association from further permutations, while continuing to analyze associated SNPs to the set threshold. Adaptive permutation output provides both the number of permutations achieved and corresponding P-value. Parameters used in the adaptive permutation testing included a minimum of five permutations performed but no more than 1,000,000 to determine significance using a confidence interval of 0.0001, alpha threshold of 0, intercept interval for pruning of 1, and slope interval of 0.001 for pruning (Che et al., 2014). Linkage disequilibrium (LD) structure and haplotype analysis was examined between associated markers using HAPLOVIEW v4.2 to assist in candidate gene identification (Barrett et al., 2005). Haplotypes blocks were defined using the algorithm from Gabriel et al. (2002). Putative candidate gene(s) within one million base pairs up or down stream or within LD blocks of significantly associated SNPs were identified based on the GCF-001704415.1(ARS1) assembly in Genome Data Viewer on National Center for Biotechnology Information (NCBI) (Bickhart et al., 2017).

# RESULTS

# Statistical Analysis

Descriptive statistics of mean, standard deviation, minimum, and maximum for the seven fleece traits and yearling fleece weight are shown in **Table 2**. Covariate usage was determined using proc GLM in SAS for each GWAS (**Table 1**). No significant covariates were identified for kemp percentage or greasy fleece percentage. Sex was significant for diameter and true fiber percentage with a P-value < 0.0001, and for efficiency with a P-value < 0.05. Age


<sup>1</sup>The number of individuals assessed.

was significant as a covariate for diameter and staple length with a P-value < 0.05, and for fleece weight with a P-value < 0.0001. Color was significant for diameter, true fiber, staple length and fleece weight with a P < 0.05. Least square means for yearling fleece weight in males was 392.4 ± 25.53, and 298.29 ± 18.69 in females (Tukey adjusted P-value < 0.0001). We normalized yearling fleece weight based on yearling body weight due to the positive correlation between these two traits.

To further investigate the level of variation of fiber traits in the Markhoz goats, we performed PCA. Principal components 1 and 2 cumulatively accounted for 56.1% of the total variance and were retained for further analysis. The specific fiber traits with absolute loading values greater than 0.4 were used to broadly describe the variation represented by each principal component. Principal component 1 (PC1) accounted for 29.9% of the total variance with the fiber traits for diameter and kemp percentage loading strongly on the positive side, while true fiber percentage loaded negatively (**Figure 2A**). PC1 broadly describes individual fiber quality with negatively scoring individuals having desirable mohair traits such as increased true fiber percentage and efficiency as the fiber diameter decreases and kemp percentage is reduced. Principal component 2 (26.2%) generally describes fiber volume, with fleece weight and staple length loading positively in contrast to kemp percentage loading negatively (**Figure 2B**). Animals with a positive PC2 score will have longer, thicker fibers supporting greater fleece weight. Coat color was significantly associated with PC2 (P-value < 0.05). Least square means for PC2 scores in black animals was 11.45 ± 2.03, in brown animals was 11.22 ± 2.05, and in white animals was 12.13 ± 2.11 (P-value < 0.0001). White animals, particularly in comparison to brown animals, demonstrated a higher value of PC2 (Tukey adjusted P-value < 0.05) which correlated to white animals producing a greater volume of fiber than colored animals.

# Coat Color Genome-Wide Associations

Genome-wide association studies were performed for the coat colors of black, brown, and white utilizing an additive inheritance case-control model for each (**Figures 3A–C** and Supplementary Figures S1, S2). Utilizing the broader approach of comparing the coat color of interest (case) to all individuals not presenting that coat color (control), QTL were identified with proposed candidate genes (**Figures 3A–C**). Pairwise comparisons of each coat color also identified significantly associated markers despite the low sample sizes with results supporting the broader analysis (Supplementary Figure S2 and Supplementary Table S2). Significant regions associated with black coat color were located on chromosomes 1, 6, 13, 18, 19, and 25, with EMMAX P-values ranging from 1.20 × 10−<sup>05</sup> to 9.62 × 10−<sup>15</sup> (Supplementary Table S1). Loci for brown coat color were identified on chromosomes 13, 19, and 25, with EMMAX P-values ranging from 1.62 × 10−<sup>06</sup> to 5.18 × 10−<sup>09</sup> (Supplementary Table S1). The assessment of white coat color produced the greatest number of results with significant associations on 11 different chromosomes (passing both Bonferroni and FDR cutoffs) with EMMAX P-values ranging from 9.68 × 10−<sup>05</sup> to 5.12 × 10−<sup>12</sup> (Supplementary Table S1).

FIGURE 2 | Principal Component Analysis (PCA) depicting the loading values (x-axis) of seven mohair traits (y-axis) for PC1 (A) and PC2 (B). PC1 (blue) broadly describes "mohair quality" as indicated by the positive correlation between diameter and kemp percentage, while having negative correlation to true fiber percentage. PC2 (red) reflects fleece volume with a positive correlation between increased staple length and fleece weight.

The most significantly associated SNP for both the black and brown GWAS (**Figures 3A,B**) is located on chromosome 13 (snp55189-scaffold849-226217) within the Adenosylhomocysteinase (AHCY) gene. This SNP is within a 465 Kb block of LD encompassing AHCY, ASIP, and RALY heterogeneous nuclear ribonucleoprotein (RALY) genes, among others (**Figures 4A,B**). The second most associated SNP for black and third associated SNP for brown is also located on chromosome 13 (snp55186-scaffold849-83968) but within the itchy E3 ubiquitin protein ligase (ITCH) gene, also found in this same LD block (**Figures 4A,B**). The SNP on chromosome 25 falls within the sidekick cell adhesion molecule1 (SDK1) gene for both brown and black animals. The minor alleles and minor allele frequencies associated with these SNPs are presented in Supplementary Table S1. The pairwise comparison of black to brown supported these findings with variation in allele frequencies highlighting the same QTLs on chromosomes 13 and 25 (Supplementary Figure S2A and Supplementary Table S2). A new QTL was highlighted on chromosome 7 which narrowly missed the FDR threshold in the broader black comparison.

Despite the common association of multiple SNPs for both brown and black coat color, genotypic frequencies and haplotype association reflect specific color descriptions. Nine haplotypes, incorporating seven SNPs, were identified within the 465 Kb LD block which included the above mentioned SNPs on chromosome 13 (**Figure 4C**). In total, five of the haplotypes were significantly associated with both black and brown coat color. Of the associated haplotypes, four of these were observed at a higher frequency in black animals while one was seen at a higher frequency in brown animals. A sixth haplotype was only associated and found at a higher frequency in brown animals. The percentage of each haplotype frequency found within black or brown coat color animals is depicted in **Figure 4D** and Supplementary Table S3. Genotypic frequencies for the individually associated SNP on chromosome 25 similarly reflects coat color variation between black and brown (**Figure 5A**).

For white animals, significant SNPs were detected on chromosomes 1, 2, 5, 6, 11, 12, 13, 18, 20, 26, and 24 with EMMAX P-values ranging from 9.68 × 10−<sup>05</sup> to 5.12 × 10−<sup>12</sup> (Supplementary Table S1). The most associated SNP (snp18579-scaffold1878-601002) is located on chromosome 26 (**Figure 3C**). However, the second associated SNP (snp58053-scaffold94-3724779) is on chromosome 6, relatively close (15,450 bp upstream) to the RAS like family 11 member B (RASL11B) gene but potentially more importantly, it is 1.6 Mb from the KIT gene (Supplementary Table S1). Association to the KIT gene is also supported by the association of snp58103-scaffold94-5833878, which is only 370 Kb downstream (Supplementary Table S1). Genotypic frequencies for these three SNPs are presented in **Figures 5B–D**. Pairwise comparison of white to black and white to brown (Supplementary Figures S2B,C, respectively) largely reflect the multiple signals found in the broader analysis of white coat color. This is particularly evident in the comparison of the brown to white individuals. However, the comparison of black to white individuals identified many new SNPs primarily on the same chromosomes and near the same regions. Novel QTL were identified on chromosomes 8 and 28 when comparing black to white individuals but yielded no candidate genes whereas the broader white comparison showed unique QTL on chromosomes 10, 11, 18, and 25.

# Mohair Trait Genome-Wide Association Studies

We performed eight GWASs for the mohair traits. Only GWAS for true fiber, efficiency, grease percentage and yearling fleece weight could surpass our Bonferroni or FDR corrected P-value

of less than 0.05 (Supplementary Table S1 and Supplementary Figure S1). GWAS for PC1, PC2, diameter, staple length and mature fleece weight did not surpass the Bonferroni or FDR cutoff, but following one million permutations, candidate regions were identified for each trait (Supplementary Table S1 and Supplementary Figure S1).

The GWAS for PC1 representing fiber quality identified nine loci on chromosomes 1, 6, 8, 15, and 18 with permutated P-values ranging from 3.18 × 10−<sup>05</sup> to 1.53 × 10−<sup>06</sup> (**Figure 6A** and Supplementary Table S1). SNPs related to true fiber percentage were identified on chromosome 24 and X with the EMMAX P-values ranging from 8.78 × 10−<sup>6</sup> to 3.55 × 10−<sup>8</sup> (**Figure 6B**). After one million permutations, four SNPs associated to fiber diameter (**Figure 6C**) were identified on chromosomes 13 and 27 passing one million permutations and additional SNPs on chromosome 1 and 6 reaching 999,991 and 826,000 permutations with the P-value from 4.36 × 10−<sup>5</sup> to 1.00 × 10−<sup>6</sup> (**Figure 6C**), respectively. SNPs on chromosomes 6 and 13 were located within the secreted protein acidic and cysteine rich (SPARC) and solute carrier family 24 member 3 (SLC24A3) genes (Supplementary Table S1). There were only two significantly associated SNPs on chromosome 7 and 9 for kemp percentage with EMMAX P-values of 7.41 × 10−<sup>07</sup> and 7.83 × 10−<sup>08</sup> P-values, respectively, (**Figure 6D**).

Permutation testing of PC2 scores reflecting fiber volume identified six SNPs that passed one million permutations on chromosomes 1, 2, 6, and 12 with permutated P-values ranging from 3.79 × 10−<sup>05</sup> to 6.00 × 10−<sup>06</sup> (**Figure 6E**). The SNPs on chromosome 2 fell within the coiled-coil domain containing 148 (CCDC148) gene. SNPs for mature fleece weight, measured at the sampling age of the goats (1–8 years old), were unable to reach one million permutations, however, a SNP on chromosome 16 reached 734,732 permutations (**Figure 6F**) and is located within the uncharacterized LOC102169208 gene (Supplementary

Table S1). Staple length produced a SNP on chromosome 29 with a permutated P-value of 1.58 × 10−<sup>05</sup> after reaching 886,838 permutations (**Figure 6G**) that fell within the potassium voltage-gated channel subfamily Q member 1 (KCNQ1) gene. The GWAS for efficiency only identified one significantly associated

fgene-09-00105 March 30, 2018 Time: 16:17 # 8

from NCBI Genome Data Viewer (https://www.ncbi.nlm.nih.gov/genome/gdv/browser/?acc=GCF\_001704415.1&context=genome), accessed 10/11/2017).

SNP on chromosome 4 with an EMMAX P-value of 6.38 × 10−<sup>07</sup> (**Figure 6H**) within the inner mitochondrial membrane peptidase subunit2 (IMMP2L) gene. Grease percentage had multiple significantly associated SNPs on chromosomes 1, 2, 7, 16, and 19 with EMMAX P-values ranging from 8.36 × 10−<sup>06</sup>

to 1.72 × 10−<sup>08</sup> (**Figure 6I** and Supplementary Table S1). The GWAS for yearling fleece weight identified one associated SNP on chromosome 10 with EMMAX P-value 5.21 × 10−<sup>07</sup> (**Figure 6J**) within the sorbitol dehydrogenase (SORD) gene (Supplementary Table S1).

# DISCUSSION

Markhoz goats are one of the few mohair-producing breeds and, due to their important cultural and economic roles, have unique coat color diversity with selection toward the brown coat color as opposed to the traditionally white Angora mohair. Thus, it would be extremely valuable to identify genetic variants and the underlying genes related to coat color and mohair traits. Recently, the release of an improved goat genome assembly (Bickhart et al., 2017) and caprine 50K SNP beadchip have offered more opportunities to examine the genetics of economically important traits in the goat (Tosser-Klopp et al., 2014). Here, we have identified multiple QTL and several putative candidate genes associated with coat color and mohair traits through genomewide association that warrant further investigation for causative effect and potential use for genomic selection.

# Coat Color Loci

Association mapping for the black and brown coat colors separately identified a major locus on chromosome 13 (**Figures 3A,B**). Based on LD, the target region expanded to include the AHCY, ASIP, RALY, and ITCH genes (**Figures 4A,B**). The ASIP gene is well known for its role in coat color across several species and for its epistatic interaction with MC1R gene (Graham et al., 1997). Genetic variations within ASIP have been associated with coat color variation in other goat breeds such as the Saanen breed. Fontanesi et al. (2009b) reported that the dominant AWt allele can lead to white color in Saanen goats while a comparison of eight different goat breeds identified numerous missense mutations in the ASIP gene and a copy number variant (CNV) in ASIP and AHCY genes (Fontanesi et al., 2009b). The CNV is suggested to be responsible for introducing the AWt allele into both the Girgentana and Saanen goat breeds in a similar manner as observed in sheep previously (Norris and Whan, 2008). A GWAS for pink and pink necks in Saanen goats identified a QTL near the ASIP gene as well (Martin P.M. et al., 2016). Given the dominant effect of the ASIP allele for light coat color in Saanen goats and sheep, we suspect the recessive ASIP allele may contribute to the darker coat color of the Markhoz goats, but further testing is needed.

While ASIP is likely the major contributor to brown and black coat color in the Markhoz goat, the strong LD within the region identified specific haplotypes for brown vs. black coat color and suggests the AHCY, RALY, and ITCH genes could play regulatory roles in coat color. Significantly associated SNPs fell within each of these genes. Norris and Whan (2008) reported that a duplication containing the coding regions of ASIP and AHCY,

and the ITCH promoter site is responsible for white coat color in sheep. The ITCH gene plays a role in apoptosis in melanoma cells (Yang et al., 2010). Melanoma is also a well-known skin cancer related to skin pigmentation. Individuals with fairer skin color, lighter hair and eye color are at higher risk for melanoma (Bradford, 2009) which supports a potential relationship between the ITCH gene and coat color, potentially contributing to the brown color variation observed. In horses, a duplication of the syntaxin 17 (STX17) gene is responsible for the gray coat color and melanoma, with increased susceptibility to melanoma in horses homozygous for the recessive ASIP allele (Rosengren Pielberg et al., 2008), indicating it is plausible for the ITCH gene to have similar pleiotropic effects. While further study is required to unravel the molecular interactions of the region in the Markhoz goats, the haplotypes we have identified for coat color (**Figures 4C,D**) could be valuable today in the genetic selection of black and brown animals.

A prior study of brown coat color in goats identified a nonsynonymous variant in the TYRP1 gene region on chromosome 8 (Becker et al., 2015). However, we were unable to identify this region within our own sample population. We did identify a SNP within the RALY gene that is present only in our black animals. The lethal yellow Ay allele of ASIP is known to disrupt the structure and expression of the RALY gene (Michaud et al., 1993). The RALY gene has also been associated with the saddle tan phenotype in the black and tan Basset hounds and Pembroke welsh corgis (Dreger et al., 2013). Therefore, we suspect that RALY together with ASIP gene could have a potential role in black coat color.

Association mapping for white coat color identified 98 significantly associated SNPs spanning 11 chromosomes which was substantially greater than results for either black or brown coat color. Retrospective analysis of the dataset showed that all 26 white animals also had wattles (Supplementary Figure S3). Wattles are a hair-covered appendage consisting of skin, blood vessels, muscle and core cartilage with an unclear biological function (Imagawa et al., 1994). None of our black or brown animals had wattles, therefore we cannot differentiate if our results are associated with a white coat color or with the presence of wattles. We suspect we have captured regions associated with both traits as genes within the regions have roles in keratinocyte differentiation, tissue morphogenesis, and coat color.

KIT and platelet derived growth factor receptor alpha (PDGFRA), both on chromosome 6, are the most promising genes we detected for white coat color (**Figure 3C**). To date, no study has associated the KIT gene with coat color in goats despite several studies in other species identifying KIT's role in coat color (Marklund et al., 1998; Pielberg et al., 2002; Fontanesi et al.,

2010a,b, 2014; Wong et al., 2013; David et al., 2014; Yan et al., 2014; Holl et al., 2017). In pigs, the PDGFRA gene is shown to be tightly associated with the dominant white coat color (Johansson et al., 1992; Johansson Moller et al., 1996). Within the same region as KIT and PDGFRA, is the KDR gene, well-known for playing a role in angiogenesis, vascular development, and hematopoiesis regulation (Risau, 1997; Gogat et al., 2004). The gene complex of KDR, KIT, and PDGFRA has been associated with the reddening coat color pattern in Angus cattle (Hanna et al., 2014). While KDR has been identified as a putative candidate for coat color in the cattle study, we suspect KDR is more likely contributing to the wattle vascular development in our goats based on its known role in angiogenesis. The fact that these three genes, including KIT and PDGFRA which are likely influencing white coat color and KDR which might contribute to wattle development, are in the same region could explain why wattles are only present in the white goats within our dataset.

Additional candidate genes functionally related to pigmentation, hair growth, and keratinocyte differentiation have a less obvious influence on white coat color and suggest either a complex regulation of the trait or are instead related to the development of wattles or increased fiber volume. Indeed, PC2 was directly linked to white animals which produced more fiber. This further complicates the interpretation of genetic signatures associated with white animals but provides some insight as to why the GWAS for white produced so many QTL.

The GWAS for white coat color and PC2 independently highlighted overlapping QTL on chromosome 2 for which the midpoints of the respective QTL were 312 Kb apart. Within this region were the MREG gene, which regulates melanosome transfer for which inhibition results in skin lightening (Wu et al., 2012), and the Abca12 ATP-binding cassette sub-family A (ABC1), member 12 (ABCA12) gene which regulates keratinocyte differentiation and epidermal lipid transportation (Akiyama, 2014). The fibronectin 1 (FN1) gene, which regulates tissue morphogenesis (Foolen et al., 2016), is also in this region but seems a more likely candidate for wattle development.

The following genes were highlighted in QTL for white coat color and retrospectively, wattles. The RAB11 family interacting protein 2 (RAB11FIP2) gene is thought to suppress the internalization of epidermal growth factor receptors (EGFR) (Cullis et al., 2002) which activate hair growth in both the mouse and human (Moore et al., 1981; Mak and Chan, 2003) while the Receptor type K (PTPRK) gene can be regulated by TGF-β pathway which decreases the EGFR activation in human primary keratinocytes (Xu et al., 2015). The nuclear factor of activated T cells 1 (NFACTC1) gene plays a role in skin tumorigenesis regulation via DMBA metabolism in which the loss of expression of this gene will decrease the skin tumorigenesis (Goldstein et al., 2015). The glutaredoxin and cysteine rich domain containing 1 (GRXCR1) gene is known to influence hair cell development with protein expression in the sensory epithelia in the inner ear. Mutations in the GRXCR1 have been linked to hearing loss in both mice and humans (Odeh et al., 2010).

When focusing solely on candidate genes plausible for wattle development, we identified the previously mentioned FN1, as well as the sarcoglycan gamma (SGCG), and iroquois homeobox 2 (IRX2). Mutations in SGCG have been associated with muscle degradation (El Kerch et al., 2014) and IRX2 plays a role in digit formation (Zulch et al., 2001). Thus, we hypothesized that these genes may influence wattle development due to their roles in tissue morphogenesis, muscle, and digit development, respectively. Our data did not reveal a QTL in the FMN1/GREM1 region on chromosome 10 which was previously associated with wattle formation in a genome-wide analysis of nine Swiss goat breeds (Reber et al., 2015). Breed variation or our confounding overlap of white coat color and fiber volume traits may have influenced the different results. In general, further studies are needed to decipher the roles of these genes for which their functional annotation suggests potential roles in coat color, wattle formation, and/or fiber volume.

# Mohair Traits

Two different strategies were applied for mapping mohair traits in our goat population. First, PCA was used to group overlapping mohair qualities correlating to broader characteristics such as fiber quality and volume, which were then mapped using the resulting principal components 1 and 2, respectively. Second, we mapped each individual fiber trait. This mapping strategy was planned for three reasons: (1) to identify QTL and candidate genes related to broader characteristics of mohair quality and quantity which are economically important, (2) to compare the results from the PCA GWAS to individual trait mapping to look for potential overlapping regions, and (3) identify regions driving the regulation of specific mohair traits.

Principal component 1 described overall fiber quality, with true fiber scores being negatively correlated to increased kemp and larger diameter. Permutation mapping of PC1 identified a region on chromosome 1 near the POU class 1 homeobox 1(POU1F1) gene. This gene is known to play an important role in wool production in sheep and in greasy fiber percentage and staple length in cashmere goats (Lan et al., 2009; Zeng et al., 2011; Sun et al., 2013). Additionally, other studies have demonstrated that POU1F1 has some effect on growth and milk production in other mammals (Mura et al., 2012; Ozmen et al., 2014; Sadeghi et al., 2014). This is the first study to apply PCA to fiber traits for a broader perspective of the genetic regulation of overall mohair quality and quantity.

Association mapping for the individual traits of true fiber percentage and kemp percentage did not identify genes with obvious roles in hair or fiber characteristics. However, association studies on both PC1 and fiber diameter highlighted the same region on chromosome 1 close to the POU1F1 gene described above. With PC2, we were able to describe overall fleece volume related to fleece weight and staple length. Candidate genes such as MREG and ABCA12 were previously described as they overlapped with white coat color QTL. ABCA12 has a more plausible relationship with fiber volume as it regulates keratinocyte differentiation and epidermal lipid transportation. Laminin subunit beta 3 (LAMB3) and hydroxysteroid 11-beta dehydrogenase 1 (HSD11B1) genes, which influence hair morphogenesis and dermatitis in both mice and humans, respectively, were identified as the most likely candidates for influencing mature fleece weight within the associated QTL

(Imanishi et al., 2014; Terao et al., 2016). While the GWAS for fiber efficiency highlighted the novel gene of IMMP2L, previously unassociated with fiber traits, the association mapping for greasy fleece percentage produced more intriguing results. This included the adhesion G protein-coupled receptor V1 (ADGRV1) gene which has a role in the development of auditory hair bundles in mice and is related to Usher syndrome, a highly heritable disease which consists of various symptoms including hearing loss and vision impairment (McGee et al., 2006; Kahrizi et al., 2014; Yang et al., 2016). It would be of interest to explore the role of auditory hairs, which collect wax within the ear canal, and the incidence of deafness among the goats. Ironically, both ADGRV1 and GRXCR1 are associated with auditory hair development and mutations are linked to hearing loss. Another gene includes the histamine N-methyltransferase (HNMT) gene, which has a link to skin lesions in mice (Furukawa et al., 2009). As these genes are involved in hair and skin disorders, they might influence fiber development and fiber traits both directly and indirectly via different pathways.

Lastly, the GWAS for yearling fleece weight, which is related to a finer mohair yield due to the younger age of the animals, was conducted. As the goats increase in age, the mohair fiber becomes more coarse as the fiber diameter increases (Mohair South Africa, 2017). The SORD, dual oxidase 1 (DUOX1), dual oxidase maturation factor 2 (DUOXA2), dual oxidase 2 (DUOX2), dual oxidase maturation factor 1(DUOXA1), and arginine-glycine aminotransferase (GATM) genes on chromosome 10 reside in the associated QTL for yearling fleece weight. The SORD gene is regulated by androgens and expressed in epithelial cells (Szabo et al., 2010). Coincidently, our data showed a positive correlation between increased yearling fleece weight and the male sex which we hypothesize may be due to differential regulation of the SORD gene. Studies in humans have reported that DUOX1 plays a role in the expression levels of normal keratinocytes (Choi et al., 2014). As keratinocytes produce keratin, the main protein for hair, nail, and skin synthesis, we hypothesize that this gene may be associated with additional fiber development. Although not related to fiber, a deficiency in the GATM gene is related to an autosomal-recessive disorder with varying symptoms including myopathy (Choe et al., 2013; Stockler-Ipsiroglu et al., 2015). The analysis of yearling fleece weight was normalized as it was positively correlated with yearly weight likely reflecting muscle mass.

In all, many of our QTL differed from previous mapping studies for fiber related traits (Cano et al., 2007, 2009; Mohammad Abadi et al., 2009; Visser et al., 2011; Roldan et al., 2014). Disagreement in our findings compared to these studies is likely due to breed differences as well as marker placement and density used in the analysis. Fine mapping and expression studies of some of the unique fiber quality related genes such as POU1F1. ADGRV1, MREG, LAMB3, and HSD11B1 may lead to new insight toward biological pathways influencing hair development and growth. In contrast, our QTL related to coat color highlighted candidate genes extensively documented for influencing color patterns in goats as well as a variety of other species. Future fine mapping of the identified regions, especially ASIP, RALY, KIT, and PDGFRA genes is needed to identify causal mutations or other structural phenomenon such as CNVs that are contributing to coat color of the Markhoz breed. Unexpectedly, we were also able to highlight genes potentially influencing the presence of wattles, which currently remain a biological mystery.

Despite advances in the genome assembly and tool development in several species over the last 10 years, there have been few genome scale studies for traits considered economically important in goats, particularly in Angora and Cashmere goats. This is the first genetic study to identify regions associated with coat color and fiber traits in the Markhoz breed as well as potential SNPs and haplotypes for genetic selection of coat color. The relatively small cohort of goats investigated was a limiting factor yet provided biological insight for these traits and a foundation to further genomic research on coat color and mohair traits in goats.

# DATA AVAILABILITY

All SNP genotype data are available at the publically assessable Zenodo repository (https://zenodo.org/record/1198730#.Wq lmbcPwZdg).

# AUTHOR CONTRIBUTIONS

All authors were engaged in the development of the overall research plan and assisted with research advisement. AN-G was the lead researcher performing data collection, statistical and genomic analysis, result interpretation, and drafting of the manuscript. AR provided assistance with sample and data collection of the Markhoz goats. HM-Y and SM-A provided primary advisement during sample collection and laboratory support for DNA extraction. ES assisted with the genomic methodology, data analysis, result interpretation, and initial manuscript draft. HH managed the genotyping and directed the genomic data analysis, result interpretation, and was the primary editor of the manuscript.

# FUNDING

Funding for the genomic research and analysis was supported by the laboratory of HH.

# ACKNOWLEDGMENTS

Animal sampling was made possible through the cooperation of the Markhoz goat performance testing station staff and through the assistance of Mr. Jafari, Rashid, Kakehkhani and also Ms. Karimi for providing mohair traits.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2018. 00105/full#supplementary-material

# REFERENCES


FAOSTAT (2008). Available at: http://faostat.fao.org/default.aspx

Fontanesi, L., Beretti, F., Riggio, V., Dall'Olio, S., González, E. G., Finocchiaro, R., et al. (2009a). Missense and nonsense mutations in melanocortin 1 receptor (MC1R) gene of different goat breeds: association with red and black coat colour phenotypes but with unexpected evidences. BMC Genet. 10:47. doi: 10.1186/ 1471-2156-10-47


Johansson, M., Ellegren, H., Marklund, L., Gustavsson, U., Ringmar-Cederberg, E., Andersson, K., et al. (1992). The gene for dominant white color in the pig is closely linked to ALB and PDGRFRA on chromosome 8. Genomics 14, 965–969. doi: 10.1016/S0888-7543(05)80118-1

Johansson Moller, M., Chaudhary, R., Hellmen, E., Hoyheim, B., Chowdhary, B., and Andersson, L. (1996). Pigs with the dominant white coat color phenotype carry a duplication of the KIT gene encoding the mast/stem cell growth factor receptor. Mamm. Genome 7, 822–830. doi: 10.1007/s003359900244

Kahrizi, K., Bazazzadegan, N., Jamali, L., Nikzat, N., Kashef, A., and Najmabadi, H. (2014). A novel mutation of the USH2C (GPR98) gene in an Iranian family with Usher syndrome type II. J. Genet. 93, 837–841. doi: 10.1007/s12041-014-0443-3

Kang, H. M., Sul, J. H., Service, S. K., Zaitlen, N. A., Kong, S. Y., and Freimer, N. B. (2010). Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354. doi: 10.1038/ng.548

Klungland, H., Vage, D. I., Gomez-Raya, L., Adalsteinsson, S., and Lien, S. (1995). The role of melanocyte-stimulating hormone (MSH) receptor in bovine coat color determination. Mamm. Genome 6, 636–639. doi: 10.1007/bf00352371

Lan, R., Zhu, L., and Yao, X.-R. (2015). Genome-wide association study of lambing number in goat. Acta Vet. Zootechn. Sin. 46, 549–554.

Lan, X. Y., Shu, J. H., Chen, H., Pan, C. Y., Lei, C. Z., Wang, X., et al. (2009). A PstI polymorphism at 30UTR of goat POU1F1 gene and its effect on cashmere production. Mol. Biol. Rep. 36, 1371–1374. doi: 10.1007/s11033-008-9322-4

Li, X., Su, R., Wan, W., Zhang, W., Jiang, H., Qiao, X., et al. (2017). Identification of selection signals by large-scale whole-genome resequencing of cashmere goats. Sci. Rep. 7:15142. doi: 10.1038/s41598-017-15516-0

Mak, K. K., and Chan, S. Y. (2003). Epidermal growth factor as a biologic switch in hair growth cycle. J. Biol. Chem. 278, 26120–26126. doi: 10.1074/jbc. M212082200

Marklund, S., Kijas, J., Rodriguez-Martinez, H., Ronnstrand, L., Funa, K., Moller, M., et al. (1998). Molecular basis for the dominant white phenotype in the domestic pig. Genome Res. 8, 826–833. doi: 10.1101/gr.8.8.826

Martin, P., Palhière, I., Tosser-Klopp, G., and Rupp, R. (2016). Heritability and genome-wide association mapping for supernumerary teats in French Alpine and Saanen dairy goats. J. Dairy Sci. 99, 8891–8900. doi: 10.3168/jds.2016-11210

Martin, P. M., Palhiere, I., Ricard, A., Tosser-Klopp, G., and Rupp, R. (2016). Genome wide association study identifies new loci associated with undesired coat color phenotypes in Saanen goats. PLoS One 11:e0152426. doi: 10.1371/ journal.pone.0152426

McGee, J., Goodyear, R. J., McMillan, D. R., Stauffer, E. A., Holt, J. R., Locke, K. G., et al. (2006). The very large G-protein-coupled receptor VLGR1: a component of the ankle link complex required for the normal development of auditory hair bundles. J. Neurosci. 26, 6543–6553. doi: 10.1523/jneurosci.0693-06.2006

Menzi, F., Keller, I., Reber, I., Beck, J., Brenig, B., Schütz, E., et al. (2016). Genomic amplification of the caprine EDNRA locus might lead to a dose dependent loss of pigmentation. Sci. Rep. 6:28438. doi: 10.1038/srep28438

Michaud, E. J., Bultman, S. J., Stubbs, L. J., and Woychik, R. P. (1993). The embryonic lethality of homozygous lethal yellow mice (Ay/Ay) is associated with the disruption of a novel RNA-binding protein. Genes Dev. 7, 1203–1213. doi: 10.1101/gad.7.7a.1203

Mohair South Africa (2017). Available at: http://www.mohair.co.za/page/mohair\_ knowledge\_and\_information\_database

Mohammad Abadi, M. R., Askari, N., Baghizadeh, A., and Esmailizadeh, A. K. (2009). A directed search around caprine candidate loci provided evidence for microsatellites linkage to growth and cashmere yield in Rayini goats. Small Rumin. Res. 81, 146–151. doi: 10.1016/j.smallrumres.2008.12.012

Moore, G. P., Panaretto, B. A., and Robertson, D. (1981). Effects of epidermal growth factor on hair growth in the mouse. J. Endocrinol. 88, 293–299. doi: 10.1677/joe.0.0880293

Mura, M. C., Daga, C., Paludo, M., Luridiana, S., Pazzola, M., Bodano, S., et al. (2012). Analysis of polymorphism within POU1F1 gene in relation to milk production traits in dairy Sarda sheep breed. Mol. Biol. Rep. 39, 6975–6979. doi: 10.1007/s11033-012-1525-z

Norris, B. J., and Whan, V. A. (2008). A gene duplication affecting expression of the ovine ASIP gene is responsible for white and black sheep. Genome Res. 18, 1282–1293. doi: 10.1101/gr.072090.107

Odeh, H., Hunker, K. L., Belyantseva, I. A., Azaiez, H., Avenarius, M. R., Zheng, L., et al. (2010). Mutations in Grxcr1 are the basis for inner ear dysfunction in the pirouette mouse. Am. J. Hum. Genet. 86, 148–160. doi: 10.1016/j.ajhg.2010. 01.016

Ozmen, O., Kul, S., and Unal, E. O. (2014). Polymorphism of sheep POU1F1 gene exon 6 and 30UTR region and their association with milk production traits. Iran J. Vet. Res. 15, 331–335.

Parsons, Y. M., Cooper, D. W., and Piper, L. R. (1994). Evidence of linkage between high-glycine-tyrosine keratin gene loci and wool fibre diameter in a Merino half-sib family. Anim. Genet. 25, 105–108. doi: 10.1111/j.1365-2052. 1994.tb00088.x

Pielberg, G., Olsson, C., Syvanen, A. C., and Andersson, L. (2002). Unexpectedly high allelic diversity at the KIT locus causing dominant white color in the domestic pig. Genetics 160, 305–311.

Rashidi, A., Mokhtari, M. S., and Gutiérrez, J. P. (2015). Pedigree analysis and inbreeding effects on early growth traits and greasy fleece weight in Markhoz goat. Small Rumin. Res. 124, 1–8. doi: 10.1016/j.smallrumres.2014.12.011

Rashidi, A., Ramazanian, M., and Torshizi, R. V. (2006). Genetic Parameter Estimates for Growth Traits and Fleece Weight in Markhoz Goats. Minas Gerais: Instituto Prociência.

Reber, I., Keller, I., Becker, D., Flury, C., Welle, M., and Drogemuller, C. (2015). Wattles in goats are associated with the FMN1/GREM1 region on chromosome 10. Anim. Genet. 46, 316–320. doi: 10.1111/age.12279

Risau, W. (1997). Mechanisms of angiogenesis. Nature 386, 671–674. doi: 10.1038/ 386671a0

Roldan, D., Debenedetti, S., Cano, E. M., Taddeo, H. R., and Poli, M. A. (2014). "Preliminar refined localization of QTL for fleece traits in five goat chromosomes using SNP markers in a backcross population," in Proceedings of the, 10th World Congress of Genetics Applied to Livestock Production, Vancouver, BC, 885.

Rosengren Pielberg, G., Golovko, A., Sundstrom, E., Curik, I., Lennartsson, J., Seltenhammer, M. H., et al. (2008). A cis-acting regulatory mutation causes premature hair graying and susceptibility to melanoma in the horse. Nat. Genet. 40, 1004–1009. doi: 10.1038/ng.185

Sadeghi, M., Jalil-Sarghale, A., and Moradi-Shahrbabak, M. (2014). Associations of POU1F1 gene polymorphisms and protein structure changes with growth traits and blood metabolites in two Iranian sheep breeds. J. Genet. 93, 831–835. doi: 10.1007/s12041-014-0438-0

Sambrook, J., and Russell, D. W. (2006). Purification of nucleic acids by extraction with phenol:chloroform. CSH Protoc. 2006:pdb.prot4455. doi: 10.1101/pdb. prot4455

Stockler-Ipsiroglu, S., Apatean, D., Battini, R., DeBrosse, S., Dessoffy, K., Edvardson, S., et al. (2015). Arginine:glycine amidinotransferase (AGAT) deficiency: Clinical features and long term outcomes in 16 patients diagnosed worldwide. Mol. Genet. Metab. 116, 252–259. doi: 10.1016/j.ymgme.2015.10.003

Sturm, R. A., Teasdale, R. D., and Box, N. F. (2001). Human pigmentation genes: identification, structure and consequences of polymorphic variation. Gene 277, 49–62. doi: 10.1016/S0378-1119(01)00694-1

Sun, W., Ni, R., Yin, J. F., Musa, H. H., Ding, T., and Chen, L. (2013). Genome array of hair follicle genes in lambskin with different patterns. PLoS One 8:e68840. doi: 10.1371/journal.pone.0068840

Switonski, M., Mankowska, M., and Salamon, S. (2013). Family of melanocortin receptor (MCR) genes in mammals—mutations, polymorphisms and phenotypic effects. J. Appl. Genet. 54, 461–472. doi: 10.1007/s13353-013-0163-z

Szabo, Z., Hamalainen, J., Loikkanen, I., Moilanen, A. M., Hirvikoski, P., Vaisanen, T., et al. (2010). Sorbitol dehydrogenase expression is regulated by androgens in the human prostate. Oncol. Rep. 23, 1233–1239.

Terao, M., Itoi, S., Matsumura, S., Yang, L., Murota, H., and Katayama, I. (2016). Local glucocorticoid activation by 11beta-hydroxysteroid dehydrogenase 1 in keratinocytes: the role in hapten-induced dermatitis. Am. J. Pathol. 186, 1499–1510. doi: 10.1016/j.ajpath.2016.01.014

Tosser-Klopp, G., Bardou, P., Bouchez, O., Cabau, C., Crooijmans, R., Dong, Y., et al. (2014). Design and characterization of a 52K SNP chip for goats. PLoS One 9:e86227. doi: 10.1371/journal.pone.0086227

USDA (2016). Available at: https://www.ars.usda.gov/office-of-internationalresearch-programs/ars-international-action-agin-iv/

Våge, D. I., Klungland, H., Lu, D., and Cone, R. D. (1999). Molecular and pharmacological characterization of dominant black coat color in sheep. Mamm. Genome 10, 39–43. doi: 10.1007/s003359900939


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Nazari-Ghadikolaei, Mehrabani-Yeganeh, Miarei-Aashtiani, Staiger, Rashidi and Huson. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genome-Wide Association Analyses Highlight the Potential for Different Genetic Mechanisms for Litter Size Among Sheep Breeds

Song-Song Xu1,2† , Lei Gao3,4† , Xing-Long Xie1,2, Yan-Ling Ren<sup>5</sup> , Zhi-Qiang Shen<sup>5</sup> , Feng Wang<sup>6</sup> , Min Shen3,4, Emma Eyþórsdóttir<sup>7</sup> , Jón H. Hallsson<sup>7</sup> , Tatyana Kiseleva<sup>8</sup> , Juha Kantanen<sup>9</sup> and Meng-Hua Li1,2 \*

### Edited by:

Joram Mwashigadi Mwacharo, International Center for Agricultural Research in the Dry Areas (ICARDA), Ethiopia

### Reviewed by:

Shahin Eghbalsaied, Islamic Azad University, Iran Clare A. Gill, Texas A&M University, United States David Wragg, The University of Edinburgh, United Kingdom Mourad Rekik, International Center for Agricultural Research in the Dry Areas (ICARDA), Jordan

### \*Correspondence:

Meng-Hua Li menghua.li@ioz.ac.cn †These authors have contributed equally to this work.

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 13 December 2017 Accepted: 23 March 2018 Published: 10 April 2018

### Citation:

Xu S-S, Gao L, Xie X-L, Ren Y-L, Shen Z-Q, Wang F, Shen M, Eyþórsdóttir E, Hallsson JH, Kiseleva T, Kantanen J and Li M-H (2018) Genome-Wide Association Analyses Highlight the Potential for Different Genetic Mechanisms for Litter Size Among Sheep Breeds. Front. Genet. 9:118. doi: 10.3389/fgene.2018.00118 <sup>1</sup> CAS Key Laboratory of Animal Ecology and Conservation Biology, Institute of Zoology, Chinese Academy of Sciences (CAS), Beijing, China, <sup>2</sup> College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China, 3 Institute of Animal Husbandry and Veterinary Medicine, Xinjiang Academy of Agricultural and Reclamation Science, Shihezi, China, <sup>4</sup> State Key Laboratory of Sheep Genetic Improvement and Healthy Breeding, Xinjiang Academy of Agricultural and Reclamation Science, Shihezi, China, <sup>5</sup> Shandong Binzhou Academy of Animal Science and Veterinary Medicine Academy, Binzhou, China, <sup>6</sup> Institute of Sheep and Goat Science, Nanjing Agricultural University, Nanjing, China, <sup>7</sup> Faculty of Natural Resources and Environmental Sciences, Agricultural University of Iceland, Borgarnes, Iceland, <sup>8</sup> All-Russian Research Institute of Genetics and Farm Animal Breeding, Russian Academy of Sciences, Moscow, Russia, <sup>9</sup> Production Systems, Natural Resources Institute Finland, Jokioinen, Finland

Reproduction is an important trait in sheep breeding as well as in other livestock. However, despite its importance the genetic mechanisms of litter size in domestic sheep (Ovis aries) are still poorly understood. To explore genetic mechanisms underlying the variation in litter size, we conducted multiple independent genome-wide association studies in five sheep breeds of high prolificacy (Wadi, Hu, Icelandic, Finnsheep, and Romanov) and one low prolificacy (Texel) using the Ovine Infinium HD BeadChip, respectively. We identified different sets of candidate genes associated with litter size in different breeds: BMPR1B, FBN1, and MMP2 in Wadi; GRIA2, SMAD1, and CTNNB1 in Hu; NCOA1 in Icelandic; INHBB, NF1, FLT1, PTGS2, and PLCB3 in Finnsheep; ESR2 in Romanov and ESR1, GHR, ETS1, MMP15, FLI1, and SPP1 in Texel. Further annotation of genes and bioinformatics analyses revealed that different biological pathways could be involved in the variation in litter size of females: hormone secretion (FSH and LH) in Wadi and Hu, placenta and embryonic lethality in Icelandic, folliculogenesis and LH signaling in Finnsheep, ovulation and preovulatory follicle maturation in Romanov, and estrogen and follicular growth in Texel. Taken together, our results provide new insights into the genetic mechanisms underlying the prolificacy trait in sheep and other mammals, suggesting targets for selection where the aim is to increase prolificacy in breeding projects.

Keywords: sheep, prolificacy, genome-wide association study, biological pathways, regulation

# INTRODUCTION

Reproduction is one of the most important traits in livestock production particularly for females. Selection for higher prolificacy in domestic sheep (Ovis aries) has led to variable litter size (LS) within and among breeds. For example, individual litter size of 1 to 8 has been recorded in the Hu sheep and Finnsheep (Yue, 1996; Davis et al., 2006a).

Previous studies reported that the exceptional prolificacy of the Booroola Merino was attributed to a single major gene, while a number of mutations of a major effect on litter size have been identified in other sheep breeds (**Table 1**; see also Xu and Li, 2017). Vage et al. (2013) detected a mutation FecG<sup>F</sup> in gene GDF9 strongly associated with litter size in Norwegian White Sheep and Finnish Landrace (Finnsheep) using a genome-wide association analysis. Demars et al. (2013) reported the mutations FecXGr in Grivette sheep and FecX<sup>O</sup> in Olkuska sheep associated with the highly prolific phenotype by a genome-wide association analysis. Cao et al. (2016) found that nine candidate genes including the well-known FecB mutation played important roles in the variable litter size in Hu and Small-tailed Han sheep through methylated DNA-immunoprecipitation sequencing data. Miao et al. (2016) identified a set of differentially expressed genes (e.g., FecB) between low- and high-prolificacy breeds (Dorset vs. Small-tailed Han sheep) through implementing integrated analysis of miRNAs and lncRNAs. Lassoued et al. (2017) found the mutation FecXBar associated with the prolificacy in Tunisian Barbarine. Despite its great importance the genetic mechanisms of the high prolificacy trait in domestic sheep are still poorly understood, partly due to shortage of studies conducted across multiple prolific sheep breeds. To date, numerous fecundity-associated mutations have been identified in different sheep breeds, but very few mutations have been consistently detected across the breeds. Despite the reproduction of ewes can be affected by the complex interactions of environmental conditions (i.e., climate, density, and food abundance) (Wilson et al., 2009), previous studies suggested that genetic factor could play important roles in the variable litter size of ewes.

In this study, we conducted multiple independent genomewide association studies (GWAS) on litter size in the sheep breeds of high (Wadi, Hu, Icelandic, Finnsheep, and Romanov) and low (Texel) prolificacy with a litter size ranging from 1 to 6 from different geographic regions (**Figure 1A**) and genetic origins (**Figure 1B**) of the world, respectively. Wadi sheep is a high-prolificacy native breed from the Shandong Province of China (Peng et al., 2017). Hu sheep is famous for early sexual maturity and high fecundity, and are distributed in the Taihu Lake area of Eastern China (Yue, 1996). Icelandic and


Finnsheep (Finnish Landrace) sheep are northern European high-fecundity breeds (Mullen and Hanrahan, 2014; Eiriksson and Sigurdsson, 2017). Romanov sheep from the Volga Valley shows outstanding reproduction qualities: early sexual maturity, out-of-season breeding and extraordinary prolificacy (Deniskova et al., 2017). The Texel sheep is a relatively low-prolificacy breed originally from the island of Texel in the Netherlands and excels in muscle growth and lean carcasses (Casas et al., 2004). Our results will be important for further genetic improvement of the trait and for better understanding the molecular basis of reproduction in sheep as well as other mammals.

# MATERIALS AND METHODS

# Sample Collection and Phenotyping

A total of 522 ewes from five sheep breeds of high (Wadi, n = 160; Hu, n = 117; Icelandic, n = 54; Finnsheep, n = 54; and Romanov, n = 78) and one low (Texel, n = 59) prolificacy were collected from farms in China, Iceland, Finland, and Russia (**Figure 1A**). Animals included were as unrelated as possible based on analysis of pedigree records and farmers' knowledge. Data for the phenotype of litter size and the total number of litters collected from farm records are shown in **Figure 2**. The litter size ranged from 1 to 6 based on parity from 1 to 11 in six sheep breeds. Genomic DNA was extracted from the ear marginal tissues following a standard phenol/chloroform method and was diluted to 50 ng/µl for the SNP BeadChip genotyping (Köchl et al., 2005), except for the Icelandic samples which were isolated from whole-blood using MasterPureTM Complete DNA Purification Kit (Epicentre Biotech) following the manufacturers protocol.

# Genotyping and Quality Control

All the samples were genotyped using the Ovine Infinium HD BeadChip according to the manufacturer's protocol. Genotypes of

a total of 606,006 SNPs were obtained (genotype and phenotype datasets<sup>1</sup> ). We implemented quality control of these SNPs using PLINK v1.07 software (Purcell et al., 2007). The SNPs or individuals were excluded if they met any of the criteria: (1) no chromosomal or physical location, (2) call rate < 0.95, (3) missing genotype frequency > 0.05, and/or (4) minor allele frequency (MAF) < 0.05. SNPs were excluded from the analysis if a p-value of Fisher's exact test for Hardy–Weinberg equilibrium less than 0.001.

# Genetic Relationships and Population Structure

To investigate the genetic relationships and population structure among the six domestic sheep, we performed global FST,

<sup>1</sup>https://www.animalgenome.org/repository/pub/CAAS2018.0302/

neighbor-joining (NJ) tree and principle component analysis (PCA). The global FST value was calculated using GENEPOP v4.2 (Raymond and Rousset, 1995). The genetic distances between populations were calculated using an identity by state (IBS) similarity matrix (Kang et al., 2010). Then, the distances were used to construct a NJ tree with 1000 bootstraps using the package PHYLIP v.3.695 (Felsenstein, 1989). In addition, PCA was conducted using the SmartPCA program from the EIGENSOFT package version 4.2 (Patterson et al., 2006) based on the genotypes data.

# Genome-Wide Association Analysis

To explore genetic structure within the breeds, multidimensional scaling (MDS) analysis was performed based on the independent SNPs using PLINK v1.07. Firstly, we implemented the option of 'indep-pairwise 50 5 0.05' in PLINK v1.07, which calculated pairwise linkage disequilibrium (LD) in a 50-SNP-window shifted at a pace of five SNPs. If the LD estimate was r <sup>2</sup> > 0.05, one of the pairs of SNPs was removed (Purcell et al., 2007). The independent SNPs retained by the LD criteria were then used in the MDS analysis, and the results were plotted using the GenABEL package in R v3.2.2 (Aulchenko et al., 2007).

We performed genome-wide association studies within five sheep breeds of high prolificacy (Wadi, Hu, Icelandic, Finnsheep, and Romanov) and one low prolificacy (Texel) using the case/control design. We ranked all individuals within the breeds according to their litter size from the highest to lowest. Then, we selected individuals from two tails for each breed as 'case' and 'control,' respectively. Based on the distribution of phenotypes, 114 samples (LS ≥ 2) in Wadi, 66 samples (LS ≥ 2) in Hu, 20 samples (LS > 2) in Icelandic, 37 samples (LS ≥ 2.5) in Finnsheep, 40 samples (LS ≥ 2.5) in Romanov and 28 samples (LS ≥ 1.6) in Texel sheep were selected as 'cases,' while 28 samples (LS = 1) in Wadi, 15 samples (LS = 1) in Hu, 15 samples (LS ≤ 1.75) in Icelandic, 9 samples (LS ≤ 2) in Finnsheep, 26 samples (LS ≤ 2) in Romanov and 14 samples (LS ≤ 1.33) in Texel sheep were selected as 'controls.' In the GWAS, we used the function of "qtscore" in the GenABEL package. Associated SNPs were identified at both the genome-wide and chromosomewise significance levels (p < 0.05) after the Bonferroni correction (Bonferroni, 1936). To account for systematic biases caused by within-population substructure, the first and second dimensions from the MDS analyses were used as the covariates (Price et al., 2006). The correlation analysis between litter size and parity within breeds showed that there were significant effects between litter size and parity in four breeds (Wadi, Hu, Icelandic, and Texel), and the effect of parity 1 on litter size was less than that of parities 2 through 10 (**Supplementary Table S1** and **Supplementary Figure S1**). However, the parity of individuals within breeds was different, and we mainly focused on the mean of litter size of individual (total litter size/parity) in per breed. Therefore, we excluded the effect of parity from the model. The Quantile–Quantile (Q–Q) plots were visualized by plotting the distribution of obtained vs. expected genome-wide p-values. For genotype effect of potential SNPs on litter size in each breed, differences between means were analyzed by the Student's t-test. The p < 0.05 was considered statistically significant. All the results were presented as mean ± standard error (SE). We implemented pairwise tests of linkage disequilibrium (LD) between the most significant SNPs and their flanking SNPs within approximately 1 Mb upstream and downstream using PLINK v1.07. Regional association plots were generated using the R package v3.2.2.

# Bioinformatics Analysis

We annotated the genes associated with litter size in each breed using the O. aries assembly Oar\_v.4.0<sup>2</sup> . Further, we submitted the genes to the DAVID (database for annotation, visualization and integrated discovery) database<sup>3</sup> for gene ontology (GO) enrichment and pathways analyses (Huang et al., 2009a,b). The p-value of 0.1 and at least two genes from the input gene list in the enriched category were considered for the enriched GO terms. Also, we investigated the protein–protein interaction network for the candidate genes using the STRING database version 10.5 (Szklarczyk et al., 2017). In addition, differential expressions of the candidate genes in various tissues were examined using the EMBL-EBI Expression Atlas database<sup>4</sup> (Petryszak et al., 2016).

# RESULTS

# Population Relationship and Differentiation

Pairwise FST value varied from 0.023 to 0.104 among the populations with the least genetic differentiation observed between Wadi and Hu sheep breeds (**Supplementary Table S2**). The NJ tree showed that these breeds were clustered into two major groups according to their Chinese and European origins (**Figure 1B**). A similar geographic pattern was seen in the PCA analyses with the grouping of Wadi and Hu sheep separated from the other four European breeds (**Supplementary Figure S2**).

# Genome-Wide Association Analysis

After the quality control, 508,444 SNPs and 114 individuals (91 cases vs. 23 controls) in Wadi, 506,031 SNPs and 80 individuals (66 cases vs. 14 controls) in Hu, 443,125 SNPs and 23 individuals (8 cases vs. 15 controls) in Icelandic, 492,165 SNPs and 37 individuals (28 cases vs. 9 controls) in Finnsheep, 465,794 SNPs and 38 individuals (29 cases vs. 9 controls) in Romanov, 475,955 SNPs and 39 individuals (28 cases vs. 11 controls) in Texel sheep were retained in the working dataset for the GWAS. We did find several animals outlying the clusters of cases, which might cause biases in the association analyses (**Supplementary Figure S3**). We have repeated the association analyses without these animals, and found the results are very similar. Thus, we did not exclude these animals in the association analyses due to the small sample size for the breeds. The resulting genomic inflation factors were equal to 1.07 in Wadi, 1.14 in Hu, 1.12 in Icelandic, 1.14 in Finnsheep, 1.10 in Romanov, and 1.05 in

<sup>2</sup>http://www.ncbi.nlm.nih.gov/genome?term=ovis%20aries

<sup>3</sup>https://david.ncifcrf.gov/

<sup>4</sup>https://www.ebi.ac.uk/gxa/home/

Texel sheep, suggesting well-controlled population stratifications (**Supplementary Figure S4**).

In Wadi sheep, we detected 59 and 8 SNPs at the chromosomewise and genome-wide (p < 1.92 × 10−<sup>6</sup> ) 5% significance after the Bonferroni correction, respectively (**Figure 3A** and **Supplementary Tables S3**, **S4**). We observed a high level of LD between the top significant SNP rs416717560 and rs421635584 located in gene BMPR1B (**Figure 4A**). For the SNP rs416717560, average litter size of individuals with the G/G genotype (n = 115, LS = 2.05 ± 0.06) was significantly (p < 0.01) higher than that of the ewes with the A/G (n = 15, LS = 1.47 ± 0.16) genotype (**Figure 5A**). Also, we found three additional significant

SNPs (rs429416173, rs402803857, and rs160917020) neighboring genes BMPR1B, FBN1, and MMP2 (**Table 2** and **Supplementary Table S3**).

In Hu sheep, we identified 98 and 9 SNPs at the chromosomewise and genome-wide (p < 2.18 × 10−<sup>6</sup> ) 5% significance after Bonferroni correction (**Figure 3B** and **Supplementary Tables S3**, **S4**). The regional plot showed that the top significant SNPs rs429755189 and rs420460180 on chromosome 17 were in an LD block that contained gene GRIA2 (**Figure 4B**). For the rs429755189, average litter size of individuals with the genotypes G/G (n = 38, LS = 1.99 ± 0.07) and A/G (n = 52, LS = 1.94 ± 0.06) were significantly (p < 0.001) higher than that of ewes with the genotype A/A (n = 20, LS = 1.40 ± 0.09) in the present population (**Figure 5B**). Among these significant SNPs, 3 (rs406357666, rs427436644 and rs412185353) are located within the genes SMAD1 and CTNNB1 (**Table 2** and **Supplementary Table S3**).

In Icelandic sheep, we found 22 SNPs at the chromosomewise 5% significance after the Bonferroni correction (**Figure 3C** and **Supplementary Tables S3**, **S4**). The top significant SNP rs429836421 on chromosome 3 was located within gene NCOA1 (**Figure 4C**). For rs429836421, average litter size of individuals with the A/G genotype (n = 19, LS = 2.03 ± 0.05) is significantly

(p < 0.05) higher than that of the ewes with the genotype A/A (n = 33, LS = 1.81 ± 0.04) (**Figure 5C**).

In Finnsheep, we detected 102 and 6 SNPs at the chromosomewise and genome-wide (p < 3.64 × 10−<sup>6</sup> ) 5% significance after the Bonferroni correction, respectively (**Figure 3D** and **Supplementary Tables S3**, **S4**). The regional plot revealed strong LD between the top significant SNP rs412280524 and its neighboring SNPs rs401960737 and rs407751830 harbored gene INHBB (**Figure 4D**). For the SNP rs412280524, litter size of ewes with the genotype A/A (n = 40, LS = 2.84 ± 0.09) is significantly (p < 0.001) higher than that of the ewes with the genotype A/G (n = 13, LS = 2.08 ± 0.16) (**Figure 5D**). Also, five additional significant SNPs (rs160509574, rs417444297, rs404890873, rs401746929, and rs402764237) were found to be located near to genes FLT1, NF1, PTGS2, and PLCB3 (**Table 2** and **Supplementary Table S3**).

In Romanov sheep, we identified 77 and 2 SNPs at the chromosome-wise and genome-wide (p < 4.56 × 10−<sup>6</sup> ) 5% significance after the Bonferroni correction (**Figure 3E** and **Supplementary Tables S3**, **S4**). The top significant SNP rs423810437 on chromosome 7 was in the gene ESR2 (**Figure 4E**). Forrs423810437, litter size of ewes with the genotype A/A (n = 69, LS = 2.50 ± 0.06) is significantly (p < 0.001) higher than that of the ewes with the genotype A/G (n = 8, LS = 1.79 ± 0.18) (**Figure 5E**).

In Texel sheep, we observed 133 SNPs at the chromosomewise 5% significance after the Bonferroni correction (**Figure 3F** and **Supplementary Tables S3**, **S4**). The regional plot showed that the top significant SNPs rs161146164 and rs413776054 on chromosome 16 were in a strong LD region containing one functional gene GHR (**Figure 4F**). For rs161146164, litter size of ewes with the genotype A/A (n = 53, LS = 1.64 ± 0.05)

Population SNP Chr Position (bp) MAF p-unadjusted p-adjusted Genes Location Wadi rs416717560<sup>∗</sup> 6 29295803 0.07 3.65E-08 8.19E-09 BMPR1B<sup>1</sup> 3 <sup>0</sup>UTR rs421635584<sup>∗</sup> 6 29361782 0.05 4.36E-06 9.78E-07 BMPR1B<sup>1</sup> Intron rs429416173 6 29302788 0.2 7.55E-05 2.75E-05 BMPR1B<sup>1</sup> CDS rs402803857 7 58598895 0.1 4.96E-05 2.93E-05 FBN1<sup>1</sup> Intron rs160917020<sup>∗</sup> 14 23133427 0.19 1.10E-06 3.71E-07 MMP2 Downstream Hu rs429755189<sup>∗</sup> 17 41621298 0.43 1.94E-06 3.21E-07 GRIA2<sup>1</sup> Intron rs420460180 17 41621269 0.29 8.50E-06 2.43E-06 GRIA2<sup>1</sup> Intron rs406357666 17 12487861 0.19 1.40E-05 2.66E-05 SMAD1<sup>1</sup> Intron rs427436644 19 13639996 0.32 7.69E-05 2.14E-05 CTNNB1 Downstream rs412185353 19 13641870 0.33 1.51E-04 4.49E-05 CTNNB1 Downstream Icelandic rs429836421 3 32030054 0.16 4.55E-05 3.63E-05 NCOA1<sup>1</sup> Intron Finnsheep rs412280524<sup>∗</sup> 2 184578329 0.09 2.62E-05 5.32E-07 INHBB Downstream rs401960737<sup>∗</sup> 2 184579671 0.09 2.62E-05 5.32E-07 INHBB Downstream rs160509574 10 31933001 0.27 1.50E-05 4.71E-05 FLT1<sup>1</sup> Intron rs417444297 11 18552961 0.11 4.20E-05 5.65E-05 NF1 Downstream rs404890873 12 65662842 0.05 1.87E-04 1.59E-05 PTGS2 Upstream rs401746929 21 41915064 0.08 1.85E-03 1.75E-04 PLCB3 Upstream rs402764237 21 41919836 0.08 1.85E-03 1.75E-04 PLCB3 Upstream Romanov rs423810437<sup>∗</sup> 7 73335157 0.07 1.65E-05 3.12E-06 ESR2<sup>1</sup> 5 0 flanking region Texel rs409969387 8 75353388 0.08 1.11E-03 1.21E-04 ESR1 Intron rs410595930 14 23645021 0.06 1.33E-04 1.46E-04 SPP1<sup>1</sup> Intron rs401207152 14 25147418 0.06 1.33E-04 1.46E-04 MMP15 Downstream rs161146164 16 31834495 0.06 1.33E-04 9.11E-06 GHR<sup>1</sup> CDS rs413776054 16 31834942 0.06 1.33E-04 9.11E-06 GHR CDS rs426666828 16 31882869 0.18 1.88E-04 7.54E-05 GHR<sup>1</sup> Intron rs413148060 21 30950537 0.15 1.02E-04 4.17E-05 ETS1 Upstream rs405994606 21 31001548 0.15 1.02E-04 4.17E-05 ETS1<sup>1</sup> Intron rs161612044 21 31009743 0.14 5.41E-04 1.01E-04 ETS1<sup>1</sup> Intron rs412251543 21 31178275 0.1 4.01E-03 1.46E-04 ETS1/FLI1 Upstream/Downstream

TABLE 2 | Genome-wide and chromosome-wise significant SNPs and associated genes.

For genes the best SNP of which is located outside of upstream/downstream 150 kb region. Chr., chromosome; MAF, Minor Allele Frequency. The p-unadjusted corresponds to exact p for the Fisher's test. The p-adjusted corresponds to the corrected significance of GWAS after principle component adjustment. The SNPs with symbol (<sup>∗</sup> ) denote that bonferroni-corrected genome-wide significant SNPs. The genes with symbol (<sup>1</sup> ) denote that the SNPs are intragenic, otherwise they are the nearest genes upstream and downstream of the tested SNPs.

is significantly (p < 0.01) higher than that of the ewes with the genotype A/C (n = 6, LS = 1.15 ± 0.14) (**Figure 5F**). The two mutations (rs161146164, Asn > His; rs413776054, Pro > Ser) cause the amino acid change in coding region of the GHR gene. In addition, we found eight additional significant SNPs (rs426666828, rs409969387, rs410595930, rs401207152, rs413148060, rs405994606, rs161612044, and rs412251543) surrounding genes ESR1, ETS1, FLI1, SPP1, and MMP15 (**Table 2** and **Supplementary Table S3**).

In addition to the source breed where the target SNPs have been detected, we further assessed genotype effect of the most significant SNPs on litter size in the other five sheep breeds. In general, genotypes of the target SNPs did not show significant association with increased litter size in the breeds other than the source breed (**Supplementary Table S7**). Nevertheless, we observed some exceptions. For example, the genotype A/G of rs429836421, which was identified in Icelandic sheep, showed significant associations with increased litter size in both Icelandic and Hu sheep breeds. However, a lack of homozygotes for the SNPs such as the genotype G/G for rs412280524 in Finnsheep, G/G for rs423810437 in Romanov and C/C for rs161146164 in Texel sheep could be because of low frequency of the mutations and small sample size.

# Bioinformatics Analysis

We found significantly (p < 0.1) enriched GO terms associated with reproduction for the candidate genes. The GO clusters were primarily enriched in the categories of ovarian and oocyte development (PTGS2, BMPR1B, INHBB, CTNNB1, MMP2, MMP15, FBN1, GHR, and SPP1), phospholipase C activity (FLT1 and ESR1), SMAD protein (INHBB and SMAD1) and BMP signaling (SMAD1 and BMPR1B) and positive regulation of transcription (NCOA1, FLI1, ESR1, ESR2, CTNNB1, ETS1, and BMPR1B), all of which are involved in the folliculogenesis, follicle growth and granulosa cell proliferation (**Figure 6** and **Supplementary Table S5**). Another relevant GO category was hindbrain development (SMAD1 and CTNNB1), which participated in regulating ovulation (Baird et al., 2006). In addition, we detected 11 genes (i.e., PLCB3, ESR1, ESR2,

MMP2, NCOA1, CTNNB1, INHBB, SMAD1, BMPR1B, PTGS2, and GRIA2) involved in estrogen, thyroid hormone, TGF-beta, retrograde endocannabinoid and hippo signaling pathways, and these pathways played important roles in regulating follicle growth and ovulation in livestock (**Supplementary Table S5**). However, we observed different GO terms for the candidate genes in different sheep breeds. For example, I-SMAD binding were enriched in Hu sheep, and chromatin binding were enriched in Texel sheep (**Supplementary Table S6**). In the gene network analysis, we observed that 16 genes (i.e., BMPR1B, FBN1, MMP2, SMAD1, CTNNB1, GRIA2, NCOA1, FLT1, NF1, PTGS2, PLCB3, ESR2, ESR1, ETS1, SPP1, and GHR) showed protein–protein interactions in the network (**Figure 7**). Expression data further showed that the genes BMPR1B, FBN1, MMP2, GRIA2, SMAD1, CTNNB1, NCOA1, NF1, FLT1, PTGS2, PLCB3, ESR2, ESR1, GHR, ETS1, MMP15, FLI1, and SPP1 were either highly or moderately expressed in reproduction-related tissues such as ovary, uterine cervix, placenta, corpus luteum, cerebellum, pituitary gland or uterus in sheep (**Figure 8**). Also, gene INHBB showed a high expression in ovary and uterus of Mus musculus<sup>5</sup> .

# DISCUSSION

In this study, we conducted multiple independent GWAS in different sheep breeds to investigate the genetic mechanisms underlying the litter size in sheep. Coupled with population relationship and bioinformatics analyses, the GWAS identified different genes associated with the litter size in different breeds and revealed their differentially genetic regulation mechanisms associated with follicle growth and ovulation in the reproduction of ewes.

The diverse biological pathways identified from the novel genes annotation play an important role in follicle growth and ovulation of females in different sheep breeds (**Figure 9**). The three genes identified in Wadi sheep, BMPR1B, FBN1, and

<sup>5</sup>https://www.ebi.ac.uk/gxa/home/

MMP2, all play a crucial role in regulating hormone secretion (Mulsant et al., 2001; Basini et al., 2011; Zhang et al., 2011; Zhai et al., 2013). For example, BMPR1B gene can lead to an increased density of the follicle-stimulating hormone (FSH) and luteinizing hormone (LH) receptors with a concurrent reduction in apoptosis to increase the ovulation rate of ewes (Regan et al., 2015; Hu et al., 2016). As the main component of microfibrils in the extracellular matrix, the gene FBN1 regulates cumulus cell apoptosis by reducing the expression level of BMP15 involved in estrogen signaling in porcine ovaries (Zhai et al., 2013). The MMP2 gene plays a key role in ovulation and follicle atresia by regulating FSH and insulin like growth factor 1 (IGF1) (Knapp and Sun, 2017). In Hu sheep, the three genes GRIA2, SMAD1, and CTNNB1 are related to estrogen response element (Chang et al., 2013; Kumar et al., 2016; Vastagh et al., 2016). For example, the gene GRIA2 has been shown to participate in the glutamatergic pathway that regulates gonadotropin-releasing hormone (GnRH), a known prerequisite of the subsequent hormonal cascade inducing the ovulation in mice (Vastagh et al., 2016). The gene SMAD1 encodes an intracellular BMP signaling molecule, which is involved in mediating ovulation rate of ewes (Xu et al., 2010). The CTNNB1 gene enhances FSH and

LH actions in follicles by stimulating WNT/CTNNB1 pathway and G protein-coupled gonadotropin receptors in female (Fan et al., 2010). In Icelandic sheep, the gene NCOA1 can alter the expression of multiple key genes PBP, AIB3, and FGFR2, which are important for aberrant labyrinth morphogenesis of the placenta and embryonic lethality (Chen et al., 2010; Huang et al., 2011). In Finnsheep, the five candidate genes INHBB, NF1, FLT1, PTGS2, and PLCB3 played important roles in the development of folliculogenesis and LH signaling (Ding et al., 2006; Tal et al., 2014; De Cesaro et al., 2015; Ben Sassi et al., 2016; Cadoret et al., 2017). For example, the INHBB gene encodes an inhibitor of apoptosis, which regulates porcine ovarian follicular atresia (Terenina et al., 2017). The coding region of gene NF1 presents non-CpG methylation in the murine oocyte, which plays a critical role in mammalian development (Haines et al., 2001). The FLT1 gene has an important role in the activity of vascular endothelial growth factor that linked to folliculogenesis (Celik-Ozenci et al., 2003). The PTGS2 gene plays a critical role in the ovulation by stimulating LH signaling in zebrafish (Tang et al., 2017). The PLCB3 gene is highly expressed in bovine cells of the ovulatory-sized follicles, with the role of activating LH/LHR signaling (Castilho et al., 2014). In Romanov sheep, the gene ESR2 activates ovulation and regulates preovulatory follicle maturation through regulating estrogen response element (Laliotis et al., 2017; Rumi et al., 2017). In Texel sheep, the six candidate genes ESR1, GHR, ETS1, MMP15, FLI1, and SPP1 are relevant to estrogen and follicular growth (Putnova et al., 2001; Bachelot et al., 2002; Munoz et al., 2007; Xiao et al., 2009; Hatzirodos et al., 2015; Ogiwara and Takahashi, 2017). As a key gene affecting estrogen biosynthesis, ESR1 gene functions similarly to ESR2, and is critical for follicular growth and successful ovulation in ewes (Foroughinia et al., 2017). The GHR gene plays a role in follicular growth through stimulating IGF1 in mice (Bachelot et al., 2002). The ETS1 gene was linked to the regulator of protein signaling protein-2 (RGS2) involved in the ovulation in bovine (Sayasith et al., 2014). As a proteolytic enzyme gene, the MMP15 gene has been shown to mediate LH and its receptor in the preovulatory follicles of teleost medaka (Ogiwara and Takahashi, 2017). The FLI1 gene encodes a critical transcription factor, which regulates gene ETS1 (Vo et al., 2017). The SPP1 gene accounts for establishing and maintaining cellular interactions between steroidogenic and non-steroidogenic cells during the development of corpus luteum (Poole et al., 2013). In addition, the GO categories as well as protein–protein network and expression analysis showed that these genes played an essential role in follicle growth and ovulation of ewes. However, further expression analyses of these genes in each breed are necessary in future study. Taken together, the apparent difference for the litter size among the breeds might be explained by diverse regulation mechanisms.

Also, we calculated genetic differentiation among populations using the global FST, PCA, and NJ tree methods to obtain a refined picture of population genetic relationships. The result showed that the genetic groups were consistent with the geographic origins of the breeds. The different genetic mechanisms associated with physiological processes for the litter size among sheep breeds could be related to the various environments in different geographic regions.

We noticed that previous studies had identified several genes of major effect such as BMPR1B, BMP15, and GDF9 for the prolificacy in ewes (**Table 1**). Different from early investigations, we detected a set of novel genes for the litter size in ewes. The main reason could be that most of early studies are based on genome-wide selection tests between prolific and nonprolific breeds using a lower density of SNPs. Instead, here we implemented GWAS within specific sheep breeds of high or low prolificacy using a high density SNP BeadChip array, which should lead to more reliable associations. In addition, the difference in threshold value used to define the 'case' and 'control' groups for each breed was also another potentially influential factor. When we implemented the GWAS using a two-step approach via the general linear model and genomewide efficient mixed-model analysis (GEMMA), we did not find interesting candidate genes associated with reproduction across the six breeds (see **Supplementary Material** for further details). The fact that no candidate genes associated with reproduction were detected could be due to that the power to detect such associations will be weak when treating the trait of interest as quantitative given the small sample size. Also, these populations could have been subjected to selection on litter size through environmental variables such as climate and diet. However, we did not obtain data for local environmental variables in our data. Thus, environmental variables as well as the age of reproduction for the ewes were not taken into account in the model of the GWAS, which would be essential for future study.

# CONCLUSION

We revealed a set of novel functional genes for the litter size in different sheep breeds across the world. Our results suggested differentially genetic regulation mechanisms for the functional genes in the reproduction of sheep. The significant SNPs and genes identified here are useful for future molecular-based breeding for a higher fertility. Also, our results provide important insights into the regulation of reproduction in sheep and other mammals.

# AUTHOR CONTRIBUTIONS

M-HL conceived and designed the project. FW, Z-QS, Y-LR, MS, EE, JH, JK, and TK collected the samples. X-LX extracted the DNA. JK provided help in Beadchip genotyping. S-SX and LG analyzed the data. S-SX wrote the paper with contributions from M-HL. All authors reviewed and approved the final manuscript.

# FUNDING

This work was supported by grants from the National Natural Science Foundation of China (Grant Nos. 91731309 and 31661143014), the Taishan Scholars Program of Shandong

Province (No. ts201511085), the National Transgenic Breeding Project of China (2014ZX0800952B), the Academy of Finland (Grant No. 250633), and the Climate Genomics for Farm Animal Adaptation (ClimGen) Project.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00118/full#supplementary-material

FIGURE S1 | Parity effect for litter size in the six breeds. X-axis is labeled as the number of parity and Y-axis represents litter size. Pairwise statistical comparisons between means of litter size in parity's clades were performed using Student's t-test. <sup>∗</sup>p < 0.05; ∗∗p < 0.01, and ∗∗∗p < 0.001.

FIGURE S2 | Principle component plots for 522 ewes from the six sheep breeds (WAD: Wadi sheep, HUS: Hu sheep, ICE: Icelandic sheep, FIN: Finnish sheep, ROM: Romanov sheep, and TEX: Texel sheep), respectively.

FIGURE S3 | Multidimensional scaling (MDS) plots in (a) Wadi, (b) Hu, (c) Icelandic, (d) Finnish, (e) Romanov, and (f) Texel sheep. The red squares indicate

# REFERENCES


animals from the case group (highly prolific ewes), and the purple dots represent animals in the control group (normally prolific ewes).

FIGURE S4 | Q–Q (quantile–quantile) plots of GWAS in (a) Wadi, (b) Hu, (c) Icelandic, (d) Finnish, (e) Romanov, and (f) Texel sheep. Gray and black rings represent association statistics before and after correction for population stratification, respectively.

TABLE S1 | Parity effect for litter size and pairwise statistical comparisons between means of litter size in parity's clades in the six breeds.

TABLE S2 | Pairwise FST value among six breeds.

TABLE S3 | Bonferroni-corrected 5% chromosome-wise significance threshold in the six sheep breeds, respectively.

TABLE S4 | Bonferroni-corrected genome-wide and chromosome-wise significant SNPs and their nearest gene based on the GWAS.

TABLE S5 | GO enrichment analysis of the genes associated with the target SNPs at the chromosome-wise level as identified by the GWAS.

TABLE S6 | GO enrichments of the novel genes identified by the GWAS at the chromosome-wise level for the six sheep breeds, respectively.

TABLE S7 | Genotype effects of the most significant SNPs on litter size in six sheep breeds, respectively.


its association with reproductive traits in an Erhualian × Duroc F2 population. Yi Chuan Xue Bao 33, 213–219. doi: 10.1016/S0379-4172(06)60042-5


fgene-09-00118 April 9, 2018 Time: 17:46 # 13


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Xu, Gao, Xie, Ren, Shen, Wang, Shen, Eyþórsdóttir, Hallsson, Kiseleva, Kantanen and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-09-00118 April 9, 2018 Time: 17:46 # 14

# Milk Composition for Admixed Dairy Cattle in Tanzania

### Evans K. Cheruiyot1,2 \*, Rawlynce C. Bett<sup>1</sup> , Joshua O. Amimo<sup>1</sup> and Fidalis D. N. Mujibi2,3

<sup>1</sup> Department of Animal Production, College of Agriculture and Veterinary Sciences, University of Nairobi, Nairobi, Kenya, <sup>2</sup> Nelson Mandela Africa Institution of Science and Technology, Arusha, Tanzania, <sup>3</sup> USOMI Limited, Nairobi, Kenya

It is well established that milk composition is affected by the breed and genotype of a cow. The present study investigated the relationship between the proportion of exotic genes and milk composition in Tanzanian crossbred dairy cows. Milk samples were collected from 209 animals kept under smallholder production systems in Rungwe and Lushoto districts of Tanzania. The milk samples were analyzed for the content of components including fat, protein, casein, lactose, solids-not-fat (SNF), and the total solids (TS) through infrared spectroscopy using Milko-Scan FT1 analyzer (Foss Electric, Denmark). Hair samples for DNA analysis were collected from individual cows and breed composition determined using 150,000 single nucleotide polymorphism (SNP) markers. Cows were grouped into four genetic classes based on the proportion of exotic genes present: 25–49, 50–74, 75–84, and >84%, to mimic a backcross to indigenous zebu breed, F1, F2, and F3 crosses, respectively. The breed types were defined based on international commercial dairy breeds as follows: RG (Norwegian Red X Friesian, Norwegian Red X Guernsey, and Norwegian Red X Jersey crosses); RH (Norwegian Red X Holstein crosses); RZ (Norwegian Red X Zebu and Norwegian Red X N'Dama crosses); and ZR (Zebu X GIR, Zebu X Norwegian Red, and Zebu X Holstein crosses). Results obtained indicate low variation in milk composition traits between genetic groups and breed types. For all the milk traits except milk total protein and casein content, no significant differences (p < 0.05) were observed among genetic groups. Protein content was significantly (p < 0.05) higher for genetic group 75–84% at 3.4 ± 0.08% compared to 3.18 ± 0.07% for genetic group >84%. Casein content was significantly lower for genetic group >84% (2.98 ± 0.05%) compared to 3.18 ± 0.09 and 3.16 ± 0.06% for genetic group 25–49 and 75–84%, respectively (p < 0.05). There was no significant difference (p < 0.05) between breed types with respect to milk composition traits. These results suggest that selection of breed types to be used in smallholder systems need not pay much emphasis on milk quality differences as most admixed animals would have similar milk composition profiles. However, a larger sample size would be required to quantify any meaningful differences between groups.

Keywords: milk composition, breed type, genetic group, genomic markers, SNP, crossbred cows, Tanzania

# INTRODUCTION

Development of efficient strategies to optimize milk composition has long been an active area of research and continues to attract increasing interest for the global dairy industry. Milk component levels and characteristics are important factors that have a significant effect on dairy product quality and yield (Murphy et al., 2016). Farmers in many developed countries are currently paid for milk

### Edited by:

Johann Sölkner, Universität für Bodenkultur Wien, Austria

### Reviewed by:

John B. Cole, United States Department of Agriculture, United States Gustavo Augusto Gutierrez Reynoso, National Agrarian University, Peru Negar Khayatzadeh, Universität für Bodenkultur Wien, Austria

### \*Correspondence:

Evans K. Cheruiyot evans.kiptoo@usomi.com; evanskip1@gmail.com

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 24 August 2017 Accepted: 06 April 2018 Published: 24 April 2018

### Citation:

Cheruiyot EK, Bett RC, Amimo JO and Mujibi FDN (2018) Milk Composition for Admixed Dairy Cattle in Tanzania. Front. Genet. 9:142. doi: 10.3389/fgene.2018.00142

**62**

deliveries based on fat and protein levels (Bailey et al., 2005; Cunha et al., 2010) such that milk composition has taken new importance in the dairy industry having a direct impact on farmers' income and product processing. As such, the dairy industry must make strategic decisions on optimizing factors that affect milk composition to better meet the ever-changing technological requirements and consumer preferences.

In East Africa, milk component pricing based on milk fat, true proteins, and other dairy solids has not been adopted. However, major dairy processors in the region have expressed strong interest in implementing a quality-based pricing system and routinely offer bonus payment depending on other measures of milk quality (which include both compositional completeness as well as somatic cell and bacterial counts; Foreman and De Leeuw, 2016). This has been largely driven by the demand for highquality dairy products that meet consumer and export market demand. Kenya is currently the only country in Africa which has recently implemented a quality-based milk payment system (QBPS; Foreman and De Leeuw, 2016).

Whereas there are three broad options for modifying milk composition: (i) cow nutrition and management, (ii) cow's genetic intervention, and (iii) dairy manufacturing technologies, long-term changes of milk parameters can be achieved through breeding and other genetic interventions(Walker et al., 2004).

Significant progress has been made in the past to improve the gross composition of milk through selective breeding and nutrition management of cows (Jenkins and McGuire, 2006). Bovine milk composition is influenced by many factors including breed and genotype (Coleman et al., 2010; Palladino et al., 2010; Gustavsson et al., 2014), nutrition (Welter et al., 2016), season (Heck et al., 2009), parity (Yang et al., 2013), stage of lactation (Stoop et al., 2009), as well as the physiological state of the animal (Gurmessa and Melaku, 2012) which offer many practical ways of altering milk composition. Previous studies have established the potential to exploit variation of milk composition among breeds to improve milk quality (Glantz et al., 2009; Heck, 2009).

According to De Marchi et al. (2008), the breed of the cow is the main genetic aspect affecting milk quality characteristics, cheese making technology, and quality of dairy products. Variations in the milk composition among breeds have been widely demonstrated in the literature (see review Schwendel et al., 2015). Although it is well established that there is significant variation in milk quality among cattle breeds, little is known about the variation in milk composition of different dairy crosses with varying admixture levels. The limited studies available have shown that increasing the proportion of exotic genes in a cow leads to decreased milk component levels (Haile et al., 2008; Islam et al., 2014).

In smallholder systems, pedigree records are typically unavailable. The only way to estimate an animals' breed composition is by way of molecular markers and admixture analysis. The use of single nucleotide polymorphism (SNP) markers for prediction of breed composition of admixed animals is gaining popularity with the substantial decrease in genotyping costs. Kuehn et al. (2011) and Frkonja et al. (2012) have demonstrated accurate prediction of breed composition using SNP markers in admixed cattle populations. The information on breed composition obtained through SNP markers is not only useful in understanding the variation of milk traits in crossbred animals, but also allows their incorporation into genomic selection programs to improving milk quality traits (VanRaden and Cooper, 2015).

Crossbreeding of local indigenous breeds with exotic cattle has been widely adopted in Tanzania since independence, mainly with the aim of increasing the productivity of local breeds. Often, these breeding practices are carried out indiscriminately resulting in animals with unknown and large variation in breed composition (Weerasinghe et al., 2013). Therefore, the complex within herd genetic composition and variability in Tanzania provides a unique opportunity to investigate the effect of breed admixture on milk quality traits in a smallholder setting as well as under a wide range of production environments. Understanding the milk quality profile of crossbred cattle is critical in the planning for the extent to which smallholder farmers, who are the main suppliers of milk in East Africa, can participate and maximize their incomes in the QBPS.

The aim of this study was to evaluate the relationship between individual animal exotic gene proportions and associated milk composition profiles. In addition, the study examines the effect of breed types and other environmental factors on milk components.

# MATERIALS AND METHODS

# Ethics Statement

This study was performed following the International Livestock Research Institute (ILRI) Institutional Animal Care and Use Committee (IACUC) guidelines, with approval reference number 2014.35. Animals were handled by experienced animal health professionals to minimize discomfort and injury.

# Site Selection and Animal Recruitment

The study was undertaken in two districts of Tanzania, namely, Rungwe and Lushoto located in the Southern Highlands and the Usambara Mountains in Tanga, respectively. Study sites were chosen based on the possible availability of a wide range of breeds, the density of improved dairy cattle, the presence of other dairy cattle projects led by ILRI under the "Maziwa Zaidi" platform, and the site having been identified as being in an emerging high dairy potential region in Tanzania.

Households selected to participate in the study were recruited based on strict entry criteria. They had to have at least two cows, one of which was in milk or have a crossbred bull in active service. Additionally, unrelated animals were preferred and where possible households with observable breed diversity were sought. Animal recruitment was purposive within households. To qualify, animals had to be either pregnant heifers or cows in the third trimester of pregnancy or a cow that had calved 3 months prior to recruitment.

# Hair Samples and Genotyping

Hair samples were collected from the tail switch of the animals, taking care to avoid fecal contamination following the protocol

described by the Animal Genetics Laboratory (2013). A total of 839 samples were obtained from 490 animals in Rungwe district and 349 animals in Lushoto district. Samples were genotyped at Geneseek (Neogen Corporation, Lincoln, NE, United States) using the Geneseek Genomic Profiler (GGP) high-density (HD) SNP array consisting of 150,000 SNPs. Data quality control on the merged data (study and reference) was undertaken using PLINK v 1.9 (Purcell et al., 2007). Data quality control included removal of SNPs with less than 90% call rate, less than 5% minor allele frequency (MAF), and samples with more than 10% missing genotypes. A total of 4,324 SNPs were removed, leaving 129,971 SNPs available for analysis. The unsupervised model-based clustering method implemented by the program ADMIXTURE v. 1.3.0 (Alexander et al., 2009) was used to estimate the breed composition of individual animals.

# Genetic Groups and Breed Types

Cows were classified into four genetic groups based on the individual admixture profile and level of exotic dairy genes (the whole complement of genetic material derived from international commercial dairy breeds). The groups were defined as follows: 25–49% exotic level (n = 20), 50–74% exotic level (n = 64), 75–84% exotic level (n = 43), and cows with >84% exotic level (n = 81) to mimic a backcross to indigenous zebu breed, F1, F2, and F3 crosses, respectively. Two explanations informed this definition. First, the range indicated around the classic proportions (50, 75%, etc.) expected provides for possible outcomes of Mendelian sampling. Second, due to the need to balance the number of individuals in each genetic group, a hard cutoff point was not considered, e.g., instead of the F3 starting at 82.5%, we used >84% to ensure that a sufficient number of animals were available in the lower group. One individual cow had less than 25% exotic gene composition and was excluded from the study. Additionally, cows were categorized into four breed types according to the level of international commercial dairy breeds as follows: RG (Norwegian Red X Friesian, Norwegian Red X Guernsey, and Norwegian Red X Jersey); RH (Holstein X Norwegian Red and Norwegian Red X Holstein); RZ (Norwegian Red X Zebu and Norwegian Red X N'Dama); and ZR (Zebu X GIR, Zebu X Norwegian Red, and Zebu X Holstein). The first breed in the combinations is the dominant breed in terms of proportions of exotic genes present. This grouping resulted in 9, 51, 109, and 39 individuals for the RG, RH, RZ, and ZR types, respectively. Both genetic group and breed types were assigned to each cow using the admixture methodology.

# Cluster Analysis

The clusters used in this study were obtained from classification done as part of the larger AgriTT (Agricultural Technology Transfer) project (manuscript in preparation). Briefly, baseline data encompassing the totality of farm characteristics as well as farmer's behavior and characteristics were subjected to cluster analysis to group households into production/management groups. Next, factor analysis was performed and five broad factors that can be used to describe smallholder farmers in the study sites were derived: supplementation intensity and diversity of supplement use, milk productivity and sale, use of maize germ and bran, household wealth, and the purchase and the intensity of use of Napier grass. These extracted factors were subjected to cluster analysis. The squared Euclidean distance and Ward's linkage method were used as the criterion for determining inter-object distance. Duda and Hart's index stopping rule was used to decide on the optimum number of clusters. The analysis revealed four distinct production clusters. The main factors that determined the production environment groupings were: the intensity of feed supplementation as well as the diversity of supplements used; the level of milk productivity and sales per cow; the off-farm income and the size of land owned; and the use of maize germ or maize bran and the extent of purchase of Napier grass, the main source of cultivated forage in the country. Households in cluster 1 (26%) were characterized by low production and sale of milk as well as low usage of maize germ supplements. Cluster 2 households (33%) were characterized by the intense use of supplements such as maize germ and oil by-products and higher milk production. Cluster 3 households (24%) were characterized by low intensity and diversity of supplement usage. Cluster 4 had households (17%) that predominantly used maize germ at high intensity as the main supplement with little diversity of other supplements. Given that the herd sizes were very small (some farmers had only two qualifying animals in the analysis), these production clusters served as the contemporary group used in the association analysis.

# Milk Samples

Approximately 10 ml of raw milk was collected in the months of June 2015 and December 2015 from each of 209 cows in both Rungwe and Lushoto districts. A larger sample size could not be obtained given that milk yields in the target households are often low and farmers would not agree to larger samples being drawn.

Sampling was done once per animal for either morning or evening milk. The samples collected were immediately put under ice and transported to a field lab for storage at −20◦C until later transportation to the ILRI, Nairobi, for analysis. Transportation from the field labs to ILRI was done with the samples placed under dry ice.

Information regarding parity, the age of the cow, and season of calving for each cow was also collected. Other variables related to production system including farm characteristics, feeding practices, as well as general health management practices were recorded and used to determine production clusters. Since the cows in the study sites were managed differently, cluster analysis was necessary to group animals into homogeneous clusters in order to minimize the confounding effect of production management on milk component traits. The number of milk samples available for the present study from each cluster was 57, 90, 37, and 25 for cluster 1, 2, 3, and 4, respectively. Only one milk sample was available for each cow.

# Laboratory Analysis of Milk Composition

Milk samples were evaluated for the content of fat, protein, casein, lactose, solids-not-fat (SNF), as well as total solids (TS)

content by infrared spectroscopy using Milko-Scan FT1 analyzer (Foss Electric, Denmark) at the ILRI, Nairobi, Kenya.

The Milko-Scan FT1 analyzer requires a minimum of 26 ml of milk for duplicate analysis of each sample. However, since the total milk sample volume obtainable was low (8–12 ml), samples had to be diluted to obtain the optimum volume suitable for analysis. Consequently, and before analysis, two dilution procedures were undertaken based on the exact volume of each milk sample. Samples with 10 ml volume were diluted to 33.3% (v/v) in distilled water to obtain 30 ml while samples with less than 10 ml were diluted to 16.7% (v/v).

# Statistical Analysis

To obtain regression models for predicting the actual milk composition for the diluted study samples, 50 ml fresh milk samples from 15 individual cows were collected from the University of Nairobi farm. The milk samples were collected purposely from crossbred cows to be comparable with the study cows with respect to genetic composition. The cows at the University of Nairobi farm are managed semi-intensively and were milked twice a day. Samples were analyzed immediately after collection using Milko-Scan FT1 analyzer (Foss Electric, Denmark). Three sets of estimates [undiluted milk, dilution 1 (33.3% v/v), and dilution 2 (16.7% v/v)] for milk component content were obtained for each sample.

After checking for normality and presence of outliers for each of the analyzed milk trait (fat, protein, casein, lactose, and SNF), two prediction models were obtained by regressing milk composition estimates for the undiluted milk samples on the diluted samples using the REG procedure of SAS version 9.2 (SAS Institute, Inc., 2008) to obtain two separate models for each dilution. Before analysis, the values obtained for fat percentage were log transformed to base 10 to correct for non-uniform variance and skewness. All the other milk components (protein, casein, lactose, SNF, and TS) did not show any obvious deviation from normality or non-constancy of variance, and hence they were not log transformed. Actual milk component content of the study cows was determined as predicted values using the defined models for the respective dilutions.

To find out the relationship between breed type and genetic group on predicted milk composition traits, data were analyzed using the MIXED procedure in SAS version 9.2. Fixed effects included in the model were the genetic group, breed type, the age of the cow (at the time of milk sample collection), the month of sampling, and production cluster membership of cows (cluster).

Component trait averages for each genetic group and the breed type were obtained by fitting two separate statistical models, Model 1 and Model 2 for breed types and genetic group, respectively.

Model 1: Yijkl = u + breed-type<sup>i</sup> + agej+ month<sup>k</sup> + cluster<sup>l</sup> + eijkl

Model 2: Yijkl = u + genetic-group<sup>i</sup> + age<sup>j</sup> + month<sup>k</sup> + cluster<sup>l</sup> + eijkl,

where Yijkl = individual sample measurement of fat, protein, casein, lactose, SNF, or TS content; u = overall mean; breedtype<sup>i</sup> = fixed effect of breed-type i (i = RG, RH, RZ, and ZR); genetic-group<sup>i</sup> = fixed effect of genetic group<sup>i</sup> (i = 25–49% exotic level, 50–74% exotic level, 75–84% exotic level, and >84% exotic level); age<sup>j</sup> = fixed effect of the jth age in years (j = 2, 3, 4, 5–10, and >10); month<sup>k</sup> = fixed effects of the kth month of milk sample collection (k = June and December); cluster<sup>i</sup> = fixed effect of the ith cluster (i = 1, 2, 3, and 4); and eijkl = random residual term ∼ N (0, σ 2 e). The degrees of freedom were calculated according to the Satterthwaite method (DDFM = Satterth).

Although farmers provided parity information for study cows, this information was mainly based on guesses and estimates (since most farmers purchase cows that are already in production and have calved several times before). As such, parity information was deemed unreliable and was excluded from the analysis. The significance of the fixed effects included in the two models was tested using the F statistic (p < 0.05). For the main effects of genetic group and breed type, multiple comparisons of least square means were performed using Tukey test with significance set at p < 0.05.

# RESULTS

# Summary Statistics

**Table 1** summarizes the number of animals per breed type, genetic group, and cluster included in the analysis. Most animals consisted of crosses of Norwegian Red and East African Shorthorn Zebu (RZ) breeds. Compared to other genetic groups, a relatively high proportion (39%) of cows were represented in the genetic group with greater than 84% exotic genes (>84%). On the other hand, the lowest proportion (10%) of animals was represented in the genetic group with 25–49% exotic genes. Overall, the differences between means were small for all traits, within breed types, genetic groups, and production clusters.

# Effect of Milk Dilution on Parameter Estimates

Milk samples were diluted in order to obtain the volume required by the infrared spectrometer to quantify the content of the milk components. Regression equations were then used to determine the predicted component content of the undiluted milk samples. The prediction model's coefficient of determination (R 2 ), rootmean-square error (RMSE), and the coefficient of variation (CV) for the analyzed milk traits are presented in **Table 2**. The coefficients of determination of the prediction models for the milk traits ranged from 91 to 99%. The parameter estimates for all the milk traits were slightly lower for dilution 2 (16.7% v/v) compared to dilution 1 (33.3% v/v). Fat content exhibited the largest CV; 2.1 and 4.6 for dilution 1 and dilution 2, respectively (**Table 2**). On the other hand, lactose had the small relative variability (CV = 0.77 and 1.23 for dilution 1 and dilution 2, respectively).

TABLE 1 | Summary of the number of cows per breed type, genetic group, and production cluster included in the study and their respective raw means ± SD of each milk trait.


<sup>1</sup>Breed types were classified based on the individual breed composition estimated from SNP markers: RG = (Norwegian Red X Frisian, Norwegian Red X Guernsey, and Norwegian Red X Jersey); RH = (Holstein X Norwegian Red and Norwegian Red X Holstein); RZ = (Norwegian Red X Zebu and Norwegian Red X N'Dama), and ZR = (Zebu X GIR, Zebu X Norwegian Red, and Zebu X Holstein).

<sup>2</sup>Proportion of exotic genes estimated using SNP genotype markers and classified into four classes as cows with 25–49% exotic genes, 50–74% exotic genes, 75–84% exotic genes, and those with greater than 84% exotic genes.

<sup>3</sup>Classification of households based on the farm characteristics; SNF, solids-not-fat; TS, total solids.

TABLE 2 | Coefficient of determination (R 2 ), root-mean-square error (RMSE), and coefficient of variation (CV) of the prediction models for the milk traits derived from the University of Nairobi dairy cattle used as a training population.


SNF, solids-not-fat; TS, total solids.

# Estimates for Milk Composition Traits

The descriptive statistics and CV for the analyzed milk traits are presented in **Table 3**. The mean contents for fat, protein, casein, lactose, SNF, and TS content were 3.70, 3.24, 2.95, 4.28, 7.49, and 11.64, respectively. Of all the milk traits, fat content and lactose had the largest (38.23%) and lowest (9.63%) CV, respectively. Milk total protein and casein displayed a relatively moderate and similar CV with mean content ranging from 2.24 to 4.78 for protein and 2.14 to 4.22 for casein.

# Effects of Genetic and Non-genetic Factors on Milk Constituents Genetic Factors

A fixed model was used to determine the relationships between milk component content and a set of fixed effects. The fixed

# and month of sampling.

**Age of the cow** The least square means for the effect of the age of the cow are provided in **Table 4**. Overall, the age of the cow did not have significant effect on all the milk component traits (p < 0.05).

effects included in the model were breed type, dairyness (proportion of exotic genes), age of the cow, production cluster,

### **Genetic group**

The least square means for the effect of genetic group are provided in **Table 5**. The genetic group of the cows had a significant effect on total protein and casein content (p < 0.05). The total protein content was higher (3.4 ± 0.08%) in the 75–84% genetic group compared to 3.18 ± 0.07% in the >84% genetic group. Similarly, casein content significantly (p < 0.05) differed in three genetic groups: 25–49, 75–84, and >84%, with the highest content observed for genetic group 25–49% (3.18 ± 0.1%) and the lowest for genetic group >84% (2.98 ± 0.05%). We

TABLE 3 | Means and the coefficients of variation of the predicted milk traits for the study samples (Tanzanian milk data).


CV, coefficient of variation; SNF, solids-not-fat; TS, total solids.

TABLE 4 | Least square means and standard errors for milk component traits for the age of the cow.


<sup>1</sup>Age of the cows reported by farmers at the time of milk collection and defined in four classes (3, 4, 5–10, and >10 years). SE, standard errors; SNF, solids-not-fat; TS, total solids.

TABLE 5 | Least square means and standard errors for milk component traits for each genetic group.


<sup>1</sup>Proportion of exotic genes estimated using SNP genotype markers and classified into four classes as cows with 25–49% exotic genes, 50–74% exotic genes, 75–84% exotic genes, and those with greater than 84% exotic genes.

<sup>2</sup>Number of samples in each genetic group.

<sup>3</sup>Standard errors.

SNF, solids-not-fat; TS, total solids.

a , <sup>b</sup>Means within a row with different superscripts differ significantly (p < 0.05).

observed no significant difference (p < 0.05) for fat, lactose, SNF, and TS between genetic groups.

Plots of least square mean estimates for milk component traits by genetic group are shown in **Figures 1**, **2**. Overall, the mean was higher for the 75–84% genetic group and lowest for the 25–49% genetic group. For fat and protein, the trend seems to suggest a general increase in component levels as dairyness increases, with a sharp drop for the animals in the >84% group. For lactose and casein, the trend is not clear. However, the drop for the >84% group is consistent for all components evaluated.

### **Breed type**

**Table 6** gives the least square means and associated standard errors of the milk traits for each breed type. Overall, we observed no significant difference in milk composition among breed types (p < 0.05). The RG breed type (consisting of crossbreeds of Jersey, Guernsey, Holstein, and Norwegian Red breed) had the highest average fat content (4.05 ± 0.51%) while breed type ZR (consisting of crossbreeds of Zebu and Norwegian Red breed) had the lowest average fat content (3.04 ± 0.25%). The total protein and casein content was similar across breed types.

## Non-genetic Factors **Effect of season**

In this study, the months of sampling coincided with the two seasons in Tanzania: wet season (June) and dry season (December). Least square means and the respective SD for the effect of season on milk traits are given in **Table 7**. The month of sampling had a significant effect (p < 0.001) on the content of milk fat, casein, and SNF. Casein content was higher in milk sampled in the wet season (3.27 ± 0.06%) than in the dry season (2.88 ± 0.06%) with a mean difference of 0.39 ± 0.08%. Similarly, SNF was greater in the wet season (7.81 ± 0.13%) than in the dry season (7.32 ± 0.12%) with a recorded mean difference of 0.49 ± 0.18%. On the contrary, fat content was significantly (p < 0.001) higher (3.97 ± 0.24%) in the dry season than in the wet season (2.59 ± 0.24%). The mean difference for the fat content was 1.38 ± 0.34%. The TS and lactose contents were not affected by the month of sampling (p = 0.089).

### **Effect of production environment**

The least square means of the clusters is shown in **Table 8**. Cluster membership of cows significantly (p < 0.05) affected

the total protein, casein, SNF, and TS content. Casein content was higher for cows in cluster 3 (3.23 ± 0.08%) and cluster 4 (2.91 ± 0.08%) (p = 0.0001 and p = 0.021, respectively). On the other hand, protein content was significantly lower in cluster 4 (3.05 ± 0.1%) compared to cluster 1 (3.35 ± 0.09%), cluster 2 (3.32 ± 0.08%), and cluster 3 (3.38 ± 0.1%). SNF content was higher in cluster 1 (7.62 ± 0.15%) than cluster 4 (7.28 ± 0.17%). TS content was significantly (p < 0.05) greater in cluster 1 (12.10 ± 0.35%) than in cluster 2 (11.0 ± 0.39%), cluster 3 (11.35 ± 0.32%), and cluster 3 (10.51 ± 0.42%). There

TABLE 6 | Least square means and standard errors for milk component traits per breed type.


<sup>1</sup>Breed types were classified based on the individual breed composition estimated from SNP markers: RG = (Norwegian Red X Frisian, Norwegian Red X Guernsey, and Norwegian Red X Jersey); RH = (Holstein X Norwegian Red and Norwegian Red X Holstein); RZ = (Norwegian Red X Zebu and Norwegian Red X N'Dama); and ZR = (Zebu X GIR, Zebu X Norwegian Red, and Zebu X Holstein).

SE, standard errors; SNF, solids-not-fat; TS, total solids.

TABLE 7 | Least square means and standard errors for milk component traits for month of sampling.


SNF, solids-not-fat; TS, total solids.

a, bMeans within a row with different superscripts differ significantly (p < 0.05).

was no significant difference in lactose and fat content among the clusters.

# DISCUSSION

# Summary Statistics and Parameter Estimates

The small differences between means for all traits, within breed types, genetic groups, and production clusters observed in this study (**Table 1**) are likely because of small sample sizes within each grouping. Differences in the parameter estimates among the two dilutions used for prediction suggest a noticeable effect on the variability of milk composition and that the relationship between dilutions is not linear. Fat content, for instance, exhibited the largest CV; 2.1 and 4.6 for dilution 1 and dilution 2, respectively (**Table 2**). This variability may be partly attributed to the effect of the stability of milk fat emulsion and the varying sizes of fat globules (Suranindyah and Pretiwi, 2015) which probably becomes more unstable with increased dilution. On the other hand, the small relative variability of lactose (0.77 and 1.23 for dilution 1 and dilution 2, respectively) largely reflects its greater solubility in water. We undertook to collect milk samples for prediction specifically from crossbred cows at the University of Nairobi farm in order to be comparable to the study samples. However, it is important to point out that the milk samples obtained from the farm were from one herd and collected in the same season. On the contrary, the milk samples from Tanzania cows were collected over two seasons and from different management systems. The prediction equations obtained from dilution of samples from the University of Nairobi farm were useful because they provided a mechanism to understand how dilution affects milk component content and the resultant equations could then be used to predict the milk component content for the undiluted target samples. To the extent that the training data for producing the equations was only from a small sample set, the estimates for undiluted components could have introduced some bias.

# Estimates for Milk Composition Traits

Overall, the average milk component content recorded in this study was within the range of values reported in previous studies for Holstein–Friesian dairy breeds (Glantz et al., 2009; Palladino et al., 2010; Penasa et al., 2014). This is not surprising given that our analysis of admixture and genetic composition of the study population indicated a dominant Holstein–Friesian origin. Compared to studies by Heck et al. (2009) and Penasa et al. (2014), this study had larger CVs. However, results similar to those obtained in this study were reported by Varotto et al. (2015) except for fat content whose CV was much higher (38.23%) in the present study. It should be emphasized that the results observed in this study are predicted mean values obtained from the diluted milk samples. The large CV for the content of milk fat might be partly due to the effect of dilution.

# Effects of Genetic and Non-genetic Factors on Milk Constituents Genetic Factors

## **Age of cow**

Results indicated that the age of cow did not have a statistically significant effect on all milk component traits. This observation was expected given that the age information provided by the

TABLE 8 | Least square means and standard errors for milk component traits for each cluster.


<sup>1</sup>Groups of households defined based on the farm characteristics.

<sup>2</sup>Number of samples in each cluster.

<sup>3</sup>Standard errors.

SNF, solids-not-fat; TS, total solids.

a, bMeans within a row with different superscripts differ significantly (p < 0.05).

farmers was based on estimates rather than written records since most farmers purchase mature cows already in production, with no accompanying pedigree or performance records.

### **Genetic group**

Previous studies have demonstrated the relationship between breed type and milk quality (Carroll et al., 2006; Palladino et al., 2010). However, smallholder production systems in sub-Saharan Africa utilize non-descript crossbred animals with unknown breed type. We used admixture analysis to estimate the breed proportions of known dairy breeds in the study cattle. Based on the extent of the dairyness (proportion of exotic genes) of the animals, they were grouped into four genetic groups as follows; 25–49, 50–74, 75–84, and >84%, to mimic a backcross, F1, F2, and F3 exotic crosses, respectively.

Although it is well documented that the genotype of the cow has significant effect on milk composition (Coleman et al., 2010; Schwendel et al., 2015), failure to detect any relationship between the genetic group and majority of the milk component traits studied (fat, lactose, SNF, and TS) could be related to the fact that our milk samples were obtained from smallholder farms characterized by diverse dairy management practices. These estimates are therefore confounded by other environmental influences acting on this genotype and which cannot be accounted for in our model, especially owing to the very small herd sizes. Additionally, the breed composition of the cow was based on admixture from many different breeds, which also adds to the complexity of estimating genetic effects. To understand the lack of significant difference between observed means, we performed a post hoc power analysis which revealed that a sample size of 580 was required to observe a detectable deference for an effect size of 0.23 (the difference observed between fat content for genetic group 25–49 and >84%), considering a power of 0.8 and 95% significance level. This was well beyond the available sample size (101 cows in the two genetic groups being compared) and reinforces the need for a larger study with appropriate sample size.

The trends for fat percentage observed here are quite contrary to expectations, since indigenous animals tend to have milk with higher fat percentage (Haile et al., 2008). Available data from literature indicate that average milk fat content ranges between 2.0 and 6.1% in animals fed total mixed ration (TMR) (Kelsey et al., 2003) and between 2.68 and 4.50% for grass-fed cattle (Kay et al., 2005; Myburgh et al., 2012). Our results fall within the range for grass fed cattle as expected given that most animals are subsist on leafy greens (mostly Guatemala or elephant/Napier grass) as the main feed. Additionally, it is well established that there are breed differences with respect to milk fat content (reviewed by Samková et al., 2012). Further, indicine cattle tend to have higher fat content than taurine cattle (Haile et al., 2008). Based on this premise, we expected that animals with relatively high indicine proportions (25–49% genetic group) would have higher milk fat composition. The disparity between our expectations and what was observed is likely due to a management effect, where the animals are kept in confinement but receive little supplemental feeds and thrive only on leafy greens whenever available. There are limited published studies of equivalent systems and animal types to compare our results to. In their study, Haile et al. (2008) using crosses of Holstein–Friesian and Boran reported that the content of milk components decreased with increasing the proportion of exotic gene content. This runs contrary to our results. These results could be due to the differences in relative sizes of the additive and heterosis effects which likely differed among genetic groups (Cunnigham and Syrstad, 1987). From our results, it would appear that the 75–84% genetic group maximizes the heterotic effects obtainable from the crossbred population studied.

### **Breed type**

We found no significant difference in milk components content among breed types. However, the relatively higher fat percentage in breed type RG is probably due to the excess of Jersey and Guernsey genes in this breed type which is in conformity with numerous studies that indicate superior milk quality due to Jersey and Guernsey genes (Croissant et al., 2007; De Marchi et al., 2008; Palladino et al., 2010). The lack of variability in mean estimates among breeds is likely a function of our definition of breed types (as a combination of the breeds making up the top 75% gene composition in animal cow, with the breed name being defined by the breed of highest presence). It is also possible that an increase in sample size would allow the confounding effects to average out such that true differences can be estimated.

### Non-genetic Factors

fgene-09-00142 April 21, 2018 Time: 11:37 # 10

### **Effect of season**

In Tanzania, there is an extreme seasonal fluctuation in milk production due to changes in rainfall and feed production for dairy animals (Nell et al., 2014). The seasonal variation of milk component levels observed in this study can be explained by seasonal changes in the composition of the feeds available to the animals. Jenkins and McGuire (2006) observed that lactose content in milk is less sensitive to dietary changes. The findings of this study are similar to those of other studies such as Auldist et al. (2000) and Heck et al. (2009) who also observed large seasonal variation of major composition in Holstein dairy cows. The higher fat content in the dry season compared to the wet season is likely related to reduced moisture levels in feeds as well as the feeding practices adopted. Typically, dairy feeding in smallholder system is largely based on crop residues, roadside grazing, and occasionally on fodder crops. However, the dry season in Tanzania is usually characterized by scarcity and poor quality of feeds. Farmers, therefore, tend to increase the use of commercial supplements such as oil by-products, maize germ, cottonseed cake, and sunflower cake. Nevertheless, the use of these concentrates has been shown to result in an increase in the content of milk fat (Carroll et al., 2006).

### **Effect of production environment**

As described in the previous section, one of the key factors used for defining clusters in this study was the animal feeding practices adopted by smallholder farmers in the study sites. It is well established that diet has a profound effect on both milk composition and yield (Turner et al., 2006). Carroll et al. (2006) demonstrated that casein proportion decreases linearly with increased supplemental fat. It is not surprising, therefore, that cluster 4 characterized by intensive use of maize germ as supplement had lower casein content compared to cluster 3 which was characterized by the low intensity in the use of supplements. It has been proposed that increased use of concentrate supplements leads to decreased release of somatotropin which reduces mammary extraction of amino acids (Cant et al., 1993) and thus a decline in casein content.

Compared to cluster 1, cows in cluster 2 were managed intensively with diverse use of supplements such as maize bran and oilseed by-products. Notably, the farmers in cluster 1 practiced subsistence dairy farming, characterized by minimal

# REFERENCES


supplementation that manifested as low productivity and low milk sales. Given the negative correlation between milk yield and TS (Bobe et al., 2007), the low milk yield and high TS content were expected. Based on the results of this study, it would appear that cluster 1 and cluster 3 maximize the milk component content of the study population.

# CONCLUSION

The results obtained in this study indicate low variability in milk composition traits among breed types and genetic groups (defined by the level of the exotic genes). The 75–84% genetic group tended to have superior performance with regard to maximizing milk component content. However, it is clear that a more rigorous and larger study would be required to understand how breed type and genetic group affect milk quality in systems with highly admixed animals. Such an understanding is critical in recommending the types of crossbred cows farmers need to keep in order to produce milk that meets market demand. Additionally, these results will be valuable in assessing the viability of an offtaker payment scheme based on the quality of milk delivered by farmers.

# AUTHOR CONTRIBUTIONS

FM conceived, designed, and obtained funding for the study. EC performed the experiment. EC, FM, and JA analyzed the data. EC and FM drafted the manuscript. RB and JA made suggestions and corrections. All authors read and approved the final manuscript.

# FUNDING

This study was made possible with funding obtained through AgriTT Research Challenge Fund from the DFID, United Kingdom.

# ACKNOWLEDGMENTS

Genotypes for the reference breeds were thankfully obtained from Olivier Hanotte (East African Shorthorn Zebu), Tad Sonstegard (Norwegian Red, Holstein, Guernsey, Jersey, N'Dama, and Gir), and Edinburgh Genetic Evaluation Services (EGENES), Scotland Rural College, Edinburgh (Friesian). We thank the University of Nairobi for providing us with milk samples used for prediction analysis.




**Conflict of Interest Statement:** EC and FM are currently employees of USOMI LTD. However, they completed the work detailed in the submitted manuscript before coming into said employment.

The other authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer NK and handling Editor declared their shared affiliation.

Copyright © 2018 Cheruiyot, Bett, Amimo and Mujibi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genetic Diversity of Seven Cattle Breeds Inferred Using Copy Number Variations

### Magretha D. Pierce<sup>1</sup> \*, Kennedy Dzama<sup>2</sup> and Farai C. Muchadeyi <sup>3</sup>

*<sup>1</sup> Animal Production, Agricultural Research Council, Pretoria, South Africa, <sup>2</sup> Department of Animal Sciences, University of Stellenbosch, Stellenbosch, South Africa, <sup>3</sup> Biotechnology Platform, Agricultural Research Council, Pretoria, South Africa*

Copy number variations (CNVs) comprise deletions, duplications, and insertions found within the genome larger than 50 bp in size. CNVs are thought to be primary role-players in breed formation and adaptation. South Africa boasts a diverse ecology with harsh environmental conditions and a broad spectrum of parasites and diseases that pose challenges to livestock production. This has led to the development of composite cattle breeds which combine the hardiness of Sanga breeds and the production potential of the Taurine breeds. The prevalence of CNVs within these respective breeds of cattle and the prevalence of CNV regions (CNVRs) in their diversity, adaptation and production is however not understood. This study therefore aimed to ascertain the prevalence, diversity, and correlations of CNVRs within cattle breeds used in South Africa. Illumina Bovine SNP50 data and *PennCNV* were utilized to identify CNVRs within the genome of 287 animals from seven cattle breeds representing Sanga, Taurine, Composite, and cross breeds. Three hundred and fifty six CNVRs of between 36 kb to 4.1 Mb in size were identified. The null hypothesis that one CNVR loci is independent of another was tested using the *GENEPOP* software. One hunded and two and seven of the CNVRs in the Taurine and Sanga/Composite cattle breeds demonstrated a significant (*p* ≤ 0.05) association. *PANTHER* overrepresentation analyses of correlated CNVRs demonstrated significant enrichment of a number of biological processes, molecular functions, cellular components, and protein classes. CNVR genetic variation between and within breed group was measured using phiPT which allows intra-individual variation to be suppressed and hence proved suitable for measuring binary CNVR presence/absence data. Estimate PhiPT within and between breed variance was 2.722 and 0.518 respectively. Pairwise population PhiPT values corresponded with breed type, with Taurine Holstein and Angus breeds demonstrating no between breed CNVR variation. Phylogenetic trees were drawn. CNVRs primarily clustered animals of the same breed type together. This study successfully identified, characterized, and analyzed 356 CNVRs within seven cattle breeds. CNVR correlations were evident, with many more correlations being present among the exotic Taurine breeds. CNVR genetic diversity of Sanga, Taurine and Composite breeds was ascertained with breed types exposed to similar selection pressures demonstrating analogous incidences of CNVRs.

Edited by:

*Tad Stewart Sonstegard, Recombinetics, United States*

### Reviewed by:

*Yang Zhou, Huazhong Agricultural University, China Kwan-Suk Kim, Chungbuk National University, South Korea*

> \*Correspondence: *Magretha D. Pierce wangm@arc.agric.za*

### Specialty section:

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

Received: *13 December 2017* Accepted: *23 April 2018* Published: *15 May 2018*

### Citation:

*Pierce MD, Dzama K and Muchadeyi FC (2018) Genetic Diversity of Seven Cattle Breeds Inferred Using Copy Number Variations. Front. Genet. 9:163. doi: 10.3389/fgene.2018.00163*

Keywords: genetic diversity, CNVs, population structure, South African cattle, breed history, selection

**74**

# INTRODUCTION

Copy number variations are deletions, duplications, and insertions larger than 50 bp in size that modify the DNA structure and play a significant role in the genomic variability and hence diversity evident within and among breeds (Letaief et al., 2017). They have been observed to affect a greater percentage of genomic sequences relative to other forms of genomic variations like single nucleotide polymorphisms (SNPs) (Zhang et al., 2009; Hou et al., 2012; Liu and Bickhart, 2012). SNPs and microsatellite analyses have been used to assess population structures and genetic diversity in order to gain insight into origin, history and adaptation of cattle. CNVR loci have however been found within gene boundaries, with the incidence of some coinciding with breed histories and breed formation patterns (Matukumalli et al., 2009; Hou et al., 2011). Covering a greater number of sequences than SNPs, CNVs may alter gene dosage, disturb coding sequences or sway gene regulation (Stranger et al., 2007). CNVs have been proposed to play a role in genetic adaptation (Liu et al., 2010). Stranger et al. (2007) demonstrated SNPs and CNVs to capture 83.6 and 17.7% of the observed genetic variation respectively with very little overlap in the variation captured by the two variant types. It was thus hypothesized that ascertaining the genetic variations captured by CNVs will generate supplementary information regarding the genetic variation which may add to that already obtained from SNPs. CNVs may hence be a suitable genomic marker for ascertaining cattle origins and history as well as divergence amongst breeds.

The formation and fixation of CNVRs within the genome has not been fully explored. It has been proposed that forces such as recombination, selection and mutations are the primary factors driving the genomic architecture of large variations (Jimenez, 2014). Their fixation within the genome indicates an advantage that necessitates DNA repair mechanisms to not remove them from the genome. Gene ontology analyses demonstrate CNVRs to be prevalent in specific regions of the genome covering genes involved in specific biological, cellular or molecular process (Wang et al., 2015). Whether the fixation of CNVRs at one region of the genome corresponds with the fixation of another CNVR at a different region but possibly involved in the same process or a confounding process has not been explored. If CNVRs are correlated within the genome, this may indicate them to not be random events that occur subsequent to recombination errors, but that selection pressure and other biological mechanisms may be driving their formation and/or fixation at specific locations within the genome.

A number of Taurine, Sanga, and Composite breeds are found in South Africa. While exotic Taurine breeds demonstrate improved production subsequent to the development and elevated focus of intense selection programs, indigenous Sanga breeds of South Africa are recognized for their innate ability to handle the range of harsh climatic conditions, feed, and water scarcity together with a widespread array of diseases and pathogens customary to South Africa (Hoffmann, 2010; Mirkena et al., 2010). Composite breeds, like the Bonsmara have been developed to merge the adaptative ability of indigenous cattle with the productive ability of the Taurine breeds (Bonsma, 1980). Makina et al. (2014) assessed the genetic variation of Composite, Sanga, and Taurine cattle breeds, using genome wide SNP data. Considering the evidenced adaptation of Sanga breeds that have also been introgressed into Composite breeds, the determination of genetic variation of CNVRs in these breeds may hold further insight into understanding the multiple components of functional breed diversity and the subsequent implications thereof. This may have important inference on current breed management and genetic improvement practices. In addition to this, ascertaining whether or not the presence of one CNVR within the genome is correlated with another CNVR would give further insight into understanding the driving force behind CNVR formation and possible fixation within the genome.

This study therefore comprised an investigation into the diversity of seven cattle breeds sampled in South Africa (Angus, Drakensberger, Afrikaner, Holstein, Nguni, and Bonsmara) from each of three breed groups (Taurine, Sanga, and Composite) and one cross breed (Nguni X Angus) utilizing CNVRs. It was hypothesized that CNVR genetic diversity would parallel breed history and adaptation, with greater CNVR variation being present between breeds that are more distantly related or exposed to distinct selection pressures. The relationship between identified CNVRs within the genome was also explored in order to determine whether selection pressures were causing joint fixation of multiple CNVRs involved in the similar or complementary processes. Illumina BovineSNP50 genotyping methodology was used in conjunction with PennCNV to identify CNVRs and subsequent genes enriched by CNVRs. CNVRs were used to ascertain levels of genetic diversity and to determine the measure of pairwise correlation in CNVR presence within and among breeds.

# MATERIALS AND METHODS

# Sample Collection and Genotyping

Genomic data was obtained from Makina et al. (2014) and Makina et al. (2015). This comprised 287 animals comprising of two Taurine (45 Holstein and 32 Angus), two Sanga (59 Nguni and 48 Afrikaner), two Composite (46 Bonsmara and 48 Drakensberger) and one crossbred (10 Nguni Angus) breeds sampled from throughout South Africa. Informed consent from respective breeders was obtained. The protocol utilized for the collection of samples, DNA extraction and genotyping has been published (Makina et al., 2014, 2015). Animal handling and sample collection were performed according to the University of Pretoria Animal Ethics Committee code of conduct (E087-12).

# SNP Quality Control

SNP quality control was performed for all animals using PLINK v.1.07. Those SNPs with a MAF of <0.02, call rate of <95% and missing genotype frequency of more than 10% were excluded from further analyses. Of the 54,609 markers on the Illumina Bovine SNP50 beadchip v2, 45,924 SNPs had a call rate and MAF of greater than 0.95 and 0.02 respectively and thus remained for further analyses. Forty five thousand nine hundred and twenty-five SNPs thus remained for further analyses. A PennCNV input file containing LogR ratio and B allele frequency data of 45 925 good quality SNPs for 287 animals was generated in GenomeStudio Software 2011.1 and exported for further analyse.

# CNVRs Identification and Distribution

PennCNV has outperformed a number of CNV detection packages especially with regard to specificity and sensitivity of CNV calling (Castellani et al., 2014; Zhang Q. et al., 2014). This software was therefore utilized to identify CNVs within the genome of 287 cattle. The PennCNV compile\_pfb script (Wang et al., 2015) was utilized to create a pfb file from the data. The detect\_cnv.pl was run to detect CNVs on 29 autosomes. GC content within 1 Mb region (500 K per side) surrounding each marker was calculated and utilized to create the bovine gcmodel. A second analyses including the gcmodel option was also run for comparative purposes. In order to reduce the number of false positive CNVs, identified CNVs were filtered according four different filtering stringencies as described by Wang et al. (2015). All CNVs filtered in the absence of the gcmodel with a genomic waviness of 0.04 were identified by other models and were therefore used for further analyses. In addition, CNVR identified were checked for false positive CNVR reported by Zhou et al. (2016).

The bioinformatics and evolutionary genomics VENN diagram webtool (http://bioinformatics.psb.ugent.be/webtools/ Venn/) was used to create a venn diagram demonstrating the overlap between CNVs identified in different breeds. Adjacent and overlapping CNVs were aggregated to form CNVRs utilitizing bioinformatic approaches as recommended by Redon et al. (2006).

A CNVR dataset was created from CNVRs identified in 287 animals from seven cattle breeds. CNVR were each treated as individual loci and only those CNVRs identified in three or more animals were utilized so as to reduce the rate of false positives within the dataset (Jakobsson et al., 2008). Three input files were generated. The first contained individual animals with binomial presence/absence data for each of the 110 CNVR loci that remained post pruning. The second dataset comprised of presence/absence data of the 110 CNVR loci for each of the seven cattle breeds, while the third dataset contained information on the CNVR loci frequencies for each of the seven cattle breeds.

# CNVR Correlations and Representation

A pairwise association testing the null hypothesis that genotypes at one locus were independent of genotypes at the other locus was performed using GENEPOP (Raymond and Rousset, 1995). Only those CNVR identified in three or more animals were used. CNVR correlations across all seven breeds and across Sanga/Composite and Taurine breeds were run respectively. Contingency tables, demonstrating the relationship between all pairs of loci within and between breeds was created. A markov chain algorithm described by Raymond et al. (Raymond and Rousset, 1995) computed a G-test and probability test for each table. CNVRs demonstrating a significant correlation with a p-value of <0.05 were uploaded onto UCSC to ascertain genomic region information. A PANTHER overpresentation analyses using the Bonferoni correction for multiple testing was performed on genes covered by correlated CNVRs to ascertain whether any molecular functions, biological processes or cellular components were significantly (p < 0.05) overrepresented by correlated CNVRs.

# CNVR Genetic Diversity Analyses

Molecular variance (AMOVA) and principle component analyses were subseqeuntly performed on the pruned data comprising of 110 CNVR loci in 287 samples using GenAlex software (Peakall and Smouse, 2012). A tri-matrix of squared euclidean distances was used to calculate the pairwise population values (PhiPT) by means of an AMOVA using 9,999 permutations. PhiPT values, which are analogous to Wrights' FST indices, measure population genetic differentiation from binary data and were used to measure the genetic variation of CNVRs within and among cattle breeds. This measure allows intraindividual variation to be suppressed and hence proved suitable for measuring binary CNVR presence/absence data (Teixeira et al., 2014). A genetic distance trimatrix was utilized to determine standaradised eigenvectors for principle components 1–100. Eigenvalues present the amount of genetic variation contained by each respective principle component (PC). In order to determine how many PCs to contain within the model, each eigenvalue was divided by the total sum of eigenvalues in order to establish the fraction of total variance retained versus the number of eigenvalues. Kaiser's stopping rule states that only PCs demonstrating eigenvalues over 1.00 should be considered in the analysis. This comprises the most utilized method for determining the number of PCs to retain in the analyses (Peres-Neto et al., 2005).

STRUCTURE v2.3.4 was utilized to perform a model based clustering analyses of population structure as reported by Pritchard et al. (2000) and Falush et al. (2007). Analyses were run using a burn-in period of 5000 Reps. The model used did not assume any specific mutation process. Considering the exact mutation and inheritance patterns of CNVs is not as yet fully understood (Zhang Q. et al., 2014), it was thus deemed suitable for CNV analyses. Multiple analyses were performed for K = 2 to K = 8. The membership coefficient Q estimate matrix was plotted as a barplot.

The R package hclust was used to compute a distance matrix from binomial CNVR present/absence data for each animal which was then used to perform a hierarchal dissimilarity cluster analysis on regions with variable copy numbers. This was performed for each of the three datasets and plotted to demonstrate clusters.

# CNVR Gene Ontology and Representation

Genomic regions of CNVRs identified were uploaded into UCSC and details of the regions together with the reflink and refGene genes covered were obtained. VENN (http:// bioinformatics.psb.ugent.be/webtools/Venn/) was utilized to construct a venn diagram demonstrating the overlap of those genes enriched within CNVs identified across breeds. Gene ontologies were determined by means of the PANTHER databases (Helleday, 2003). The hypothesis that genes were over or under represented in PANTHER pathways, biological processes, cellular components, and molecular pathways was tested using the bonferoni correction at a significance level of 0.05.

# RESULTS

# CNVRs Identification and Characterisation

One thousand and fifty five unique CNVs were identified in 197 of the 287 cattle. CNVs ranged from 31 kb to 2.9 Mb in size, with an average length of 301 kb (**Table 1**). The majority (625) of the CNVs were single copy deletions. Four hundred and five single copy duplications together with 5 double copy duplication and 20 double copy deletions were reported. The smallest CNV was a single copy duplication, while the largest was a single copy deletion.

Adjacent and overlapping CNVs were joined to form 356 CNVRs (**Additional File 1**). CNVRs ranged from 36 kb to 4.1 Mb in length with an average length of 287 kb across breeds. The most CNVRs were identified on chromosomes 4 and 6, while chromosomes 22 and 28 had the least CNVRs. Chromosome 25 presented the greatest portion of its length to be covered by CNVRs. The largest CNVR was present on chromosome 11, while the smallest occurred on chromosome 1. The percentage of chromosomes covered by variations in copy number ranged from 1.15% of chromosome 28 to 14.14% of chromosome 25.

The most CNVRs were identified in the Nguni Angus breed (n = 114), followed by the Holstein (n = 102) and Angus (n = 101) breeds. The Nguni Angus breed also demonstrated the highest average CNVRs per animal at 11.41, considerably higher than the 1.30–3.15 averages of the remaining breeds. Despite the Nguni Angus cross having noticeably fewer animals in the study, the most CNVRs (114) were identified in these 10 animals. 102 and 101 CNVRs were identified in 45 and 32 Holstein and Angus animals respectively. The least CNVRs were identified in the 46 and 48 Bonsmara and Drakensberger animals (**Table 2**). The Nguni demonstrated the most CNVRs of the indigenous breeds, with an average of 1.61 CNVRs per animal.

The chromosomal distribution of CNVRs across breeds demonstrates great variation in the size and number of CNVRs identified per autosome (**Figure 1**). Chromosomes 4 and 6 possessed the most Falush et al. (2007) CNVRs. The largest CNVR found on chromosome 11 (CNVR11) was 4.1 Mb in



length. This CNVR was present in 76 animals from all 7 breeds. The smallest CNVR of 36 kb was identified in the Afrikaner cattle breed while the Bonsmara, despite demonstrating the least CNVRs, had the longest average CNVR.

Only 4 CNVRs were identified in all seven cattle breeds with chromosome 17 and chromosome 11 presenting the 2 most common CNVR. **Figure 2** demonstrates the spatial distribution of CNVs within each breed for the 4 mutual CNVRs that were identified in 53–78 animals. In all four instances Angus, Holstein, and Nguni X Angus CNVs represented the largest portion of the CNVR while Drakensberger CNVs denoted the least. The consequence of such discrepancies in specific CNV regionality between breeds should be investigated. Most CNVs were shared between fewer breeds with Angus and Nguni Angus breeds demonstrating the most common CNVs (**Additional File 2**).

# CNVR Correlations

Of the 110 CNVR evident in more than 2 individuals, 22 loci demonstrated a significant pairwise association (p ≤ 0.05) with at least one other loci across all 7 breeds, 11 of which demonstrated highly significant correlations (p ≤ 0.002). These loci culminated to form 74 significant correlations with a pvalue of < 0.05 (**Additional File 3**). Zhang Q. et al. (2014) report a significant reduction in the CNVR associations with increase in CNVR prevalence. Associated CNVRs in this study, however were present in 3 to 78 animals (**Additional File 4**). On analyzing the data independantly for each of the indigenous (Nguni, Sanga, Bonsmara, Afrikaner, Drakensberger) and exotic (Holstein, Angus) breeds, only 7 loci were significantly correlated within indigenous breeds representing 6 significant correlations, while 102 loci within the exotic Taurine breeds presented 904 significant (p ≤ 0.05) correlations (**Additional File 5**). Deletions and duplications at the same loci were treated as independent CNVRs. Only one of the correlated loci pairs of all breeds demonstrated a deletion corresponding with duplication. The rest exhibited correlations occurring between CNVRs of the same copy number. Within the 6 CNVR correlations of the indigenous Sanga and Composite breeds, 4 were between CNVR duplications and 2 were between a deletion and duplication (**Additional File 6**). The significant Taurine breed CNVR associations exhibited 866 deletion associations, 38 duplication associations and 2 deletion and duplication associations. The 906 correlations evident among CNVRs of Taurine breeds encompass 849 genes. The 7 CNVR correlations evident among the indigenous animals, on the other hand covered 76 genes. Genes represented within correlated CNVRs were involved in a number of biological, molecular and cellular pathways and are presented in **Table 3**. The representation of CNVR genes involved in processes, pathways and components that are involved in adaptation have implicated CNVRs to play a role in adaptation. The significant overrepresentaion of such ontologies represented in **Table 3** by correlated CNVRs further supports this proposal.

# CNVR Genetic Diversity Analyses

**Table 4** demonstrates pairwise population PhiPT values for CNVRs of seven cattle breeds. For all breed groups, the degree TABLE 2 | CNVR summary statistics for each of seven cattle breeds (Afrikaner–ANG, Angus–ANG, Bonsmara–BON, Drakensberger–DRK, Holstein–HOL, Nguni–NGU, and Nguni Angus cross–NGxAN).


of variation within populations was considerably greater than that between populations. Pairwise population PhiPT values correspond to breed type groupings with Taurine breeds showing the least CNVR variation being captured. Sixteen and 84% of the CNVR genetic variation was among breeds and within breeds (**Table 5**).

Principle component analysis demonstrated the greatest amount of variation to be captured in PC 1 with an eigenvalue of 221.267, explaining 87.45% of the total variation captured among individuals (**Additional File 7**). Principle component 11 demonstrated an eigenvector of 1.058 and was thus chosen as the cutoff component. The Nguni Angus cross animals were the most differentiated from the rest of the animals at PC1 against PC2 (**Figure 3**). With the exception of the Nguni Angus cross animals, all breeds clustered together. The Holstein animals clustered in the same region but with a larger spread. The Holstein animals pulled toward the top of the cluster, while the Angus and Afrikaner animals cluster more to the left. The Nguni, Drakensberger, and Bonsmara animals had the most compact clustering, pulling more to the right of the x-axis.

TABLE 3 | Ontologies (GO) with significant (*p* < 0.05) enrichment by genes covered by correlated CNVRs in seven South African cattle breeds.


\**CC, cellular component; PC, protein class; BP, biological process.*

TABLE 4 | Summary results of AMOVA pairwise population CNVR PhiPT values for seven cattle breeds.


TABLE 5 | Summary AMOVA table demonstrating estimate among and within breed CNVR genetic variance for seven cattle breeds.


STRUCTURE was utilized in R to depict the population structure of breed CNVRs presence. **Figure 4** demonstrates the evolution of the population structure as K increased from 3 to 7. High levels of admixture were evident in the structure based clustering. At K = 3, genomic signatures distinct to the Nguni Angus crossbred animals were evident while genomic signatures distinct to the Sanga breeds of cattle (Afrikaner, Drakensberger and Nguni) were picked during progression to K = 8. Sanga cattle breeds comprise a crossbreed between indigenous Taurine and zebu cattle breed that are unique to Africa (Rege, 1999).

A cluster dendrogram was generated from CNVRs identified in animals by means of R hclust (**Figure 5**). CNVRs for the most part clustered animals of the same breed together. Five of the 7 Nguni X Angus cross animals clustered together with 1 Angus animal in a clade distinct from the rest of the animals. A second clade was evident with a seemingly random mix of animals from different breeds with some animals clustering together within breeds, but others were seemingly random. The structure of the dendrogram suggest a disparity with some CNVRs being breed specific variations, while others may possibly be Bos taurus/Bos indicus CNVRs or possibly indicators of interindividual variation.

Hierarchal clustering analyses on CNVR frequency within breeds were performed. A cluster dendrogram of breeds is depicted in **Figure 6**. Binomial clustering of CNVR presence generated two distinct clades separating the indigenous pure breeds from the two Taurine breeds and the Nguni Angus crossbreed. CNVR presence within the Nguni Angus animals placed them right next to the Angus animals and completely separated from the Nguni. The two frequency plots, however generated distinctly different distributions. CNVR frequency articulated as a percentage caused the Holstein and Nguni Angus animals to segregate away from the other animals while the Angus breed moved to between the Bonsmara/Nguni and Afrikaner/Drakensberger clades. Upon using the number of animals presenting the CNVR the Nguni Angus breed was completely isolated while the two Taurine breeds clustered together and the indigenous breeds assembled in a stepwise fashion.

# CNVR Gene Ontology

Eight hundred and nine genes were covered by the 356 CNVRs identified across seven cattle breeds (**Table 2**). Drakensberger cattle had the least CNVR genes, while Angus had the most of the purebreeds and Nguni Angus had the most overall. Of the 809 genes, 6 genes [low affinity sodium-glucose cotransporterlike (LOC527441), netrin G2 (NTNG2), otopetrin 1 (OTOP1), solute carrier family 5 member 1 (SLC5A1), transmembrane protein 128 (TMEM128) and WD repeat domain 1 (WDR1)] were common to all breeds. Three hundred and eighty nine CNVR genes were breed specific (**Additional File 8**). The most CNVR genes were shared between Angus and Nguni Angus animals. Afrikaner, Angus, Bonsmara, Drakensberger, Holstein, Nguni and Nguni Angus breeds had 17, 57, 26, 13, 19, 26, and 231 breed specific CNVRs. Heat shock proteins HSPBP1 (heat shock binding protein 1), HSPB1 (heat shock protein family B member 1), HSPA5 (heat shock protein family A (Hsp70) member 5), and HSP90AA1 (heat shock protein 90 alpha family class A member 1) considered to play a vital role in balancing immunity and survival during times of stress (Zhang Q. et al., 2014), were covered by CNVRs in Nguni, Angus, Holstein and/or Nguni Angus breeds. Severe reductions in WDR1 (WD40 repeat protein 1), identified in 42 animals from breeds in this study were reported to disturb megakaryocyte maturation and platelet shedding, aggravate neutrophilic autoinflammatory disease and trigger embryonic lethality in mice (Castellani et al., 2014). LSP1 (Lymphocytespecific protein 1) and IGF-II (insulin-like growth factor 2), covered by CNVRs identified in Angus and Nguni Angus animals and IGLL1 (immunoglobulin lambdalike polypeptide 1) overlapped by CNVRs in 44 animals from all breeds except Drakensberger were differentially expressed in cattle selected for resistance or susceptibility to intestinal nematodes (Araujo et al., 2009). Other genes involved in immune response included GSTT3 (glutathione s-transferase theta-3), GSTT1 (glutathione stransferase theta-1), and SMARCB1 (SWI/SNF-related matrixassociated actin-dependent regulator of chromatin subfamily B

member 1) that were present in 35, 33, and 40 animals respectively from all breeds except the Drakensberger.

A PANTHER overrepresentation test using a Bonferroni correction for multiple testing was performed for genes covered by CNVR identified. Five GO biological processes, one molecular function and 25 cellular components demonstrated a significant (p < 0.05) over representation by CNVR genes and are presented in **Additional File 9**. Only Nguni, Holstein, Angus, and Nguni Angus breeds demonstrated breed specific over representation of 1, 15, 11, and 35 gene ontology processes, functions and/or components by CNVR genes respectively. Intracellular (GO:0005622), membranebounded organelle (GO:0043227), intracellular membranebounded organelle (GO:0043231), cytoplasm (GO:0005737), cytoplasmic part (GO:0044444), intracellular part (GO:0044424) where over represented by CNVR genes identified in Angus, Holstein, and Nguni Angus breeds.

# DISCUSSION

CNVs are considered to play a role in breed formation and adaptation, with copy number differences occuring between breeds (Liu et al., 2010). Increasing evidence also suggests CNVs to play a primary role in interindividual diversity (Stankiewicz and Lupski, 2002; Sebat et al., 2004) attributed to both normal phenotypic variation and major variations in complex traits (Fellermann et al., 2006; Feuk et al., 2006). Great variation in the size and number of CNVRs has been reported in cattle (Hou et al., 2012; Jiang et al., 2012). In this study 1055 CNVs formed 356 CNVRs in 287 animals from 7 different cattle breeds representing Taurine, Sanga, Composite and cross bred breed groups using the Bovine 50 K Beadchip. Jiang et al. (2012) identified 367 CNVRs by means of PennCNV analyses of highdensity SNP genotyping data from 96 Chinese Holsteins. Hou et al. (Hou et al., 2011) on the other hand, reports 682 CNVRs identified in 521 animals representing 21 different breeds also identified using Bovine50K SNP genotyping array. Discrepancies in CNVs and subsequent CNVRs between different breeds and even individuals could thus be expected. Although Jiang et al. (2013) highlight the differences in size and structure of populations, could also contribute to such incongruities. Hou et al. (2012) speculated that the distinctions in selected breeds for specific traits could be linked to specific CNVs. CNVR breed characterization, correlation analyses, population structure analyses and genetic diversity analyses all demonstrate the Taurine breeds and Sanga/Composite breeds to cluster in distinct groups with the Nguni Angus cross segregating completely alone. The two Taurine breeds presented noticeably more CNVRs than the indigenous and Composite breeds, coupled with a number of gene ontologies demonstrating overrepresentation. The greater number of CNVRs evident in the exotic Taurine breeds reflects findings of Choi et al. (2013) who compared the genome of a Hanwoo bull to that of Holstein and Black Angus respectively using whole genome sequencing methodologies. Narang et al. (2014) proposed that the migration and adaptation of a population or breed to a completely different environment to which they have typically been accustomed to, may require considerable changes on a genomic level that may be achieved via events like CNVs which may hence contribute toward adaptation. The introduction of exotic Taurine breeds to a new environment may have placed specific pressures on the genome, resulting in the formation of CNVRs at specific loci involved

in processes, functions or components vital for adaptation. The greater number of CNVs present in the Taurine breeds, may suggest CNVs representing a response of the genome to selection pressures imposed by adverse climatic conditions on animals that have been bred for production and not necessarily for their innate ability to survive harsh conditions. Frequently encoding protein products that play a prominent role in species adaptation (Duda and Palumbi, 1999), segmental duplications are an important cause of genomic instability that results in nonallelic homologous recombination (NAHR) during meiosis and genomic innovations and are currently recognized as one of the major catalysts and hotspots for CNV formation (She et al., 2008; Alkan et al., 2009; Nicholas et al., 2009; Liu and Bickhart, 2012). This would hence explain the discrepancies between this and that of Choi et al. (Jiang et al., 2013) with Matukumalli et al. (2009) and Hou et al. (2011) who report Taurine breeds to have

fewer CNVs than Composite, Indicine and African breeds. The African and Composite breeds in the study of Hou et al. (2011) were represented by fewer animals (39 and 46 respectively) and demonstrated an average of 7.21 and 7.17 CNVs per animal. This is not much more than the 6.23 average of 366 Taurine animals, but noticeably less than the 11.41 average of the 70 Indicine animals. Choi et al. (2013) suggested CNVs to be affected by recent intensive artificial selection schemes aimed at improving economically important production traits.

Similar to the findings of Molin et al. (2014), the majority of the CNVs identified in the present study were shared between fewer breeds with the most CNVs (30) being shared between Angus and Nguni Angus cattle (**Additional File 2**). Greater distinction can be drawn from breeds being grouped according to breed type. While genetic diversity analyses demonstrated the majority of CNVR variation to exist within population, the between diversity was least between breeds of the same type. The present studied demonstrated CNVR population structure segregating animals by breed type with Nguni Angus cross animals separating at K = b3 and the Afrikaner, Drakensberger, and Nguni breeds ghettoizing at K = 8. The evolution of the CNV population structure with increasing K values depicts breed history patterns with CNVs segregating breeds groups. The Drakensberger is considered to be one of the earliest Composite breeds developed. Its segregation with the Sanga type breeds is hence not surprising considering the possible role of adaptation on CNV prevalence. Although it was developed with a Taurine component, CNV evolution may reflect the selection pressures of adaptation that is evident in the Sanga breeds. Cicconardi et al. (2013), reported little variation in CNV distribution on chromosomes across five Italian cattle breeds, proposing CNV region (CNVR) variation to be greater between individuals than between breeds. Molin et al. (2014) identified 15 breed specific CNVRs out of 72 CNVs identified in 351 dogs from 30 different. CNVRs identified in a single breed may pose interest for the investigation into breed specific traits (Molin et al., 2014). This however differs from Zhang L. et al. (2014) who report lineage specific CNVRs, proposing CNVs in the Chinese cattle populations to be partly consequent to selective breeding during domestication but also subsequent to hybridization and introgression. Inadequately distinguishing between CNVRs that are breed specific and those that are bovine specific may be the cause of the significantly higher degree of variation being evident within populations (**Table 5**). We postulate that a large proportion of CNVRs are animal specific events, while only a few explicit CNVRs events to be exclusive to breeds. In addition to this, **Figure 2** demonstrates breed specific CNVs sections within 4 large CNVRs that were detected in all 7 breeds. The delineation of CNVRs within this study may hence be responsible for low between breed diversity (**Tables 4**, **5**) and high levels of CNVR admixture observed (**Figure 4**). Pienaar et al. (Pienaar, 2014) found high levels of within breed diversity for Afrikaner cattle using microsatellite data. Makina et al. (2014) found the Afrikaner breed to have the greatest number of alleles per locus when compared to the 5 other purebreeds in this study, while the Nguni had the least. Drakensberger cattle have the greatest genetic diversity of the 4 indigenous Sanga and Composite breeds, while the two Taurine breeds were reported to have had the greatest gene diversity (Makina et al., 2014). The Holstein and Angus breeds of the taurus cattle group have a longer history of artificial selection that has led to enhanced production (Choi et al., 2013). The observed discrepancies evident between some breeds could very well be caused by genetic drift due to bottlenecks, natural selection, and selective breeding (Hou et al., 2011). Itsara et al. (2010) determined different mutation processes to contribute disproportionately to CNVs dependant on the size of the de novo event. The mutation rate of CNVs has been established to be considerably higher than that of SNPs, with great variation in mutation rates occuring between loci (Campbell et al., 2011). The exact mutation and inheritance patterns of CNVs are, however not fully understood (Zhang L. et al., 2014). It has

been proposed that forces such as recombination, selection, and mutations are the primry factors driving the genomic architecture of large variations (Jimenez, 2014), with CNVs comprising a mechanism by which the genome responds to selection pressures subseqeunt to genomic instability induced by such pressures (Redon et al., 2006). CNVRs correlations and breed type distribution observed in this study, further augment this theory exhibiting an external pressure acting on regions within the genome involved in specific functions (**Table 3**). Distinctions in CNVRs correlations specific to breeds and breed subpopulations, augments the notion that selection pressures play an important role in CNV formation (Hou et al., 2011; Porto-Neto et al., 2014). Twenty-two of the 110 CNVR loci present in more than 1 animal were utilized for CNVR correlation analyses and genetic diversity assessments. These constituted 74 significant correlations in all 7 breeds. Within the two exotic Taurine breeds, 906 significant CNVR correlations were determined, while only six significant CNVR correlations were identified in the indigenous Sanga and Composite breeds. Most of the associations were between CNVR loci of the same type. Taurine breed CNVR associations exhibited 866 deletion associations, 38 duplication associations, and 2 deletion and duplication associations. Deletions interrupt genes while also causing a loss of biological function and are therefore currently seen as the most common CNV effecting phenotype (Liu and Bickhart, 2012). Increased copy number may have a positive (McCarroll, 2008) or negative (Lee and Lupski, 2006) association with gene expression levels.

Composite breeds were developed from multiple breeds with the aim to combine the adaptive ability of the local breeds with the productive capabilities of the exotic breeds (Bonsma, 1980). The inclusion of the Composite breeds as well as the Taurine Sanga crossbreed in this study provided insight into the age and evolution of CNVs and the translation of CNVs when breed groups are amalgamated in a Composite breeds and cross breeds. The study of CNVs in crossbred and Composite breeds may hold clues in gaining greater insight into CNV formation and the possible role of CNVs in factors like hybrid vigor. The crossbred Nguni Angus animals, despite fewer animals, demonstrated considerably more CNVs than other breeds with distinct genomic signatures. This study comprises the first characterization of crossbred bovine animals. The noticeably higher number of CNVRs in these animals could indicate CNVRs to play a role in hybrid vigor. The Nguni Angus presents a popular cross in South Africa taking advantage of the strong maternal and adaptive characterstics of the Nguni and the production potential of the Angus.

CNVs may alter gene structure, dosage or gene functioning by disrupting coding sequences, long range regulation or by exposing recessive alleles (Zhang et al., 2009; Stankiewicz and Lupski, 2010; Liu and Bickhart, 2012). The phenotypic impact of CNVs is, however too a large extent related to the locations of the variant in relation to the genes (Buchanan and Scherer, 2008). Drakensberger cattle had the least CNVR genes, while Angus had the most of the purebreeds and Nguni Angus had the most overall. Only six genes were identified in all 7 South African breeds. The identification and breed distinctions of genes involved in processes vital for adaptation suggest CNVs to play a role in breed formation. Gene copy number is conventionally positively correlated with gene expression (Stranger et al., 2007), although cases of negative correlations have been reported (Lee and Lupski, 2006). A duplicated CNVR on chromosome 11 covering AIF1L (allograft inflammatory factor 1-like) and ABL1 (protein kinase abl1) genes was correlated with a second duplication on chromosome 18 covering the NLRP5 (nacht, lrr and pyd domains-containing protein 5) gene. The AIF1L is an important component of innate immunity and response to stress while NLRP5 comprises part of the cellular defense response. ABL1 gene mutations causes resistance to tyrosine kinase inhibitors which have been found to improve the management of chronic myeloid leukemia in humans (Shah et al., 2002; O'Hare et al., 2007). Of the six correlations present among CNVRs of the indigenous breeds, all except two were between duplicated regions. The only exceptions were correlations between a deletion on chromosome 6 and duplication on chromosome 29 and 26 respectively. Although no genes were covered by the deleted CNVR, the correlated duplication on chromosome 29 covered 24 genes including TSPAN32 (tetraspanin-32), CDKN1C (cyclin-dependent kinase inhibitor 1) and TNNT3 (troponin T, fast skeletal muscle) involved in a variety of biological processes, molecular functions and cellular components.

# CONCLUSION

Three hundred and fifty-six Unique CNVRs were identified in 287 animals from 2 Taurine, 2 Composite, 2 Sanga, and 1 Sanga Taurine cross Cattle breeds using the Bovine 50 K Beadchip. A number of cellular components, molecular functions and biological processes demonstrated overrepresentation by genes covered or lying within 10 Mb of CNVRs identified. Correlations between CNVR presence was evident, with considerably more CNVR correlations occurring among the commercially bred Taurine breeds. Such correlations suggest selection pressures being exerted on different genomic regions involved in specific processes and functions. CNVs may be a means by which the genomes respond to selection pressures and subsequently adapts. Variations in CNVR presence between breeds was present with more CNVRs being present in the Nguni Angus cross and the two Taurine breeds. Composite and cross bred animals demonstrated the most within breed CNVR variation, while Sanga cattle demonstrated the least. The Nguni Angus cross demonstrated unique CNV genetic signatures, while some CNVs segregated in both the Taurine and Sanga breeds to some degree. This study indicatesd CNVRs to play a role in both interindividual and between breed variations. With Sanga and Taurine breeds having undergone different selection pressures, the variation in CNV incidence between these groups combined with the CNV correlations designate CNVRs to be genomic features prevalent in selection and adaptation. The distinct properties of CNVRs in the Nguni Angus cross animals need also be explored with possible implications in events like hybrid vigor.

# DATA AVAILABILITY STATEMENT

Datasets supporting the conclusions of this study will be made available by the authors, without undue reservation, to any qualified researcher.

# ETHICS STATEMENT

Genomic data was obtained from Makina et al. (2014, 2015). The Agriculture Research Council, who generated the data published by Makina et al. (2014, 2015), granted permission to use the data in the present analyses.

# AUTHOR CONTRIBUTIONS

Molecular genetic, bioinformatics, and statistical analyses were performed by MP who also drafted the manuscript. FM concieved of the study, aided the analyses of the data, and participated in the design and structure of the manuscript. KD participated in the coordination and preparation of the manuscript. All authors read and approved the final manuscript.

# REFERENCES


# ACKNOWLEDGMENTS

The authors would like to thank the Dr. SO Qwabe for the generation of the rawdata. Provision of animals for blood samples by the breeders and research institutions and semen from Holstein bulls by the Taurus Co-operative is also acknowledged. The Agriculture Research Councils Biotechnology Platform is acknowledged for the use of their laboratory resources for genotyping of samples. Financial support from the Agriculture Research Council is greatly appreciated. The content of this paper was published as part of Dr. MP (nee Wang) Ph.D. dissertation and can be found at http://scholar.sun.ac.za/handle/ 10019.1/100361 (Wang, 2016).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00163/full#supplementary-material

Additional File 1 | Table depicting the count, minimum (MinL), maximum (MaxL) and average (AvL) lengths and total length (LN) of CNVRs identified on each of the 29 chromosomes of 287 cattle from seven different breeds.

Additional File 2 | The number of CNVs (Num CNVs) and the CNVs identified that were present in more than one of the seven cattle breeds (BRDs).

Additional File 3 | Significant pairwise association Chi<sup>2</sup> and *P*-values of deletion and duplication (CN\_A and CN\_B) CNVR events (CNVR\_LocA and CNVR\_LocB) identified in all seven cattle breeds.

Additional File 4 | Associated CNVRs and number of animals (IND) in which they were identified across seven cattle breeds.

Additional File 5 | Significant pairwise association Chi<sup>2</sup> and P-values of deletion and duplication (CN\_A and CN\_B) CNVR events (CNVR\_LocA and CNVR\_LocB) identified in 2 South African Taurine cattle breeds.

Additional File 6 | Significant pairwise association Chi<sup>2</sup> and P-values of deletion and duplication (CN\_LocA and CN\_LocB) CNVR events (CNVR\_LocA and CNVR\_LocB) identified in indigenous South AfricanSanga and Composite cattle breeds.

Additional File 7 | Eigen values (EIV) of *first* 11 Principle Components (PC) generated from a genetic distance matrix of 277 animals (AN) from seven cattle breeds (BRD).

Additional File 8 | Table demonstrating the overlap (Num GEN) of CNVR genes (GEN) identified in seven cattle breeds (BRD).

Additional File 9 | Complete GO molecular functions (MF), cellular components (CC), and biological processes (BB) with significant (*p* < 0.05) over enrichment by genes covered by CNVRs in seven cattle breeds.


and Holstein cattle. Mamm. Genome 24, 151–163. doi: 10.1007/s00335-013- 9449-z


on gene expression phenotypes. Science 315, 848–853. doi: 10.1126/science.11 36678


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Pierce, Dzama and Muchadeyi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Corrigendum: Genetic Diversity of Seven Cattle Breeds Inferred Using Copy Number Variations

Magretha D. Pierce<sup>1</sup> \*, Kennedy Dzama<sup>2</sup> and Farai C. Muchadeyi <sup>3</sup>

*<sup>1</sup> Animal Production, Agricultural Research Council, Pretoria, South Africa, <sup>2</sup> Department of Animal Sciences, University of Stellenbosch, Stellenbosch, South Africa, <sup>3</sup> Biotechnology Platform, Agricultural Research Council, Pretoria, South Africa*

Keywords: genetic diversity, CNVs, population structure, South African cattle, breed history, selection

### **A corrigendum on**

### **Genetic Diversity of Seven Cattle Breeds Inferred Using Copy Number Variations**

by Pierce, M. D., Dzama, K., and Muchadeyi, F. C. (2018). Front. Genet. 9:163. doi: 10.3389/fgene.2018.00163

In the original article Makina et al. (2015) was not cited in the article. The citation has now been inserted in Materials and Methods, Sample Collection and Genotyping, paragraph 1 and should read:

Genomic data was obtained from Makina et al. (2014) and Makina et al. (2015). This comprised 287 animals comprising of two Taurine (45 Holstein and 32 Angus), two Sanga (59 Nguni and 48 Afrikaner), two Composite (46 Bonsmara and 48 Drakensberger) and one crossbred (10 Nguni Angus) breeds sampled from throughout South Africa. Informed consent from respective breeders was obtained. The protocol utilized for the collection of samples, DNA extraction and genotyping has been published (Makina et al., 2014, 2015).

Similarly, the protocol utilized for the collection of samples, DNA extraction and genotyping has been published (Makina et al., 2014). Animal handling and sample collection were performed according to the University of Pretoria Animal Ethics Committee code of conduct (E087-12).

A correction has been made to Materials and Methods, Sample Collection and Genotyping, paragraph 1:

The protocol utilized for the collection of samples, DNA extraction and genotyping has been published (Makina et al., 2014, 2015).

Ethics approval was obtained for the study (Ref. Nr.: 2014/CAES/101).

Finally, we neglected to include information regarding ethical approval for this study (Ref. Nr.: 2014/CAES/101). A correction has been made to Ethics Statement, paragraph 1:

Genomic data was obtained from Makina et al. (2014, 2015). The Agriculture Research Council, who generated the data published by Makina et al. (2014, 2015), granted permission to use the data in the present analyses.

The authors apologize for these errors and state that this does not change the scientific conclusions of the article in any way.

The original article has been updated.

Edited and reviewed by: *Tad Stewart Sonstegard, Recombinetics, United States*

> \*Correspondence: *Magretha D. Pierce wangm@arc.agric.za*

### Specialty section:

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

Received: *05 June 2018* Accepted: *25 June 2018* Published: *13 July 2018*

### Citation:

*Pierce MD, Dzama K and Muchadeyi FC (2018) Corrigendum: Genetic Diversity of Seven Cattle Breeds Inferred Using Copy Number Variations. Front. Genet. 9:252. doi: 10.3389/fgene.2018.00252*

# REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Pierce, Dzama and Muchadeyi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Functional Partitioning of Genomic Variance and Genome-Wide Association Study for Carcass Traits in Korean Hanwoo Cattle Using Imputed Sequence Level SNP Data

Mohammad S. A. Bhuiyan1,2† , Dajeong Lim<sup>3</sup>† , Mina Park<sup>4</sup>† , Soohyun Lee<sup>1</sup> , Yeongkuk Kim<sup>1</sup> , Cedric Gondro<sup>5</sup> , Byoungho Park<sup>4</sup> \* and Seunghwan Lee<sup>1</sup> \*

<sup>1</sup> Department of Animal Science and Biotechnology, Chungnam National University, Daejeon, South Korea, <sup>2</sup> Department of Animal Breeding and Genetics, Bangladesh Agricultural University, Mymensingh, Bangladesh, <sup>3</sup> Division of Animal Genomics and Bioinformatics, National Institute of Animal Science, Rural Development Administration, Wanju, South Korea, <sup>4</sup> Animal Genetic Improvement Division, National Institute of Animal Science, Rural Development Administration, Seonghwan, South Korea, <sup>5</sup> College of Agriculture and Natural Resources, Michigan State University, East Lansing, MI, United States

### Edited by:

Eveline M. Ibeagha-Awemu, Agriculture and Agri-Food Canada (AAFC), Canada

### Reviewed by:

Filippo Biscarini, Consiglio Nazionale delle Ricerche (CNR), Italy Fabyano Fonseca Silva, Universidade Federal de Viçosa, Brazil

### \*Correspondence:

Byoungho Park bhpark70@korea.kr Seunghwan Lee slee46@cnu.ac.kr

†These authors have contributed equally to this work.

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 30 January 2018 Accepted: 28 May 2018 Published: 22 June 2018

### Citation:

Bhuiyan MSA, Lim D, Park M, Lee S, Kim Y, Gondro C, Park B and Lee S (2018) Functional Partitioning of Genomic Variance and Genome-Wide Association Study for Carcass Traits in Korean Hanwoo Cattle Using Imputed Sequence Level SNP Data. Front. Genet. 9:217. doi: 10.3389/fgene.2018.00217 Quantitative traits are usually controlled by numerous genomic variants with small individual effects, and variances associated with those traits are explained in a continuous manner. However, the relative contributions of genomic regions to observed genetic variations have not been well explored using sequence level single nucleotide polymorphism (SNP) information. Here, imputed sequence level SNP data (11,278,153 SNPs) of 2109 Hanwoo steers (Korean native cattle) were partitioned according to functional annotation, chromosome, and minor allele frequency (MAF). Genomic relationship matrices (GRMs) were constructed for each classified region and fitted in the model both separately and together for carcass weight (CWT), eye muscle area (EMA), backfat thickness (BFT), and marbling score (MS) traits. A genome-wide association study (GWAS) was performed to identify significantly associated variants in genic and exon regions using a linear mixed model, and the genetic contribution of each exonic SNP was determined using a Bayesian mixture model. Considering all SNPs together, the heritability estimates for CWT, EMA, BFT, and MS were 0.57 ± 0.05, 0.46 ± 0.05, 0.45 ± 0.05, and 0.49 ± 0.05, respectively, which reflected substantial genomic contributions. Joint analysis revealed that the variance explained by each chromosome was proportional to its physical length with weak linear relationships for all traits. Moreover, genomic variances explained by functional category and MAF class differed greatly among the traits studied in joint analysis. For example, exon regions had larger contributions for BFT (0.13 ± 0.08) and MS (0.22 ± 0.08), whereas intron and intergenic regions explained most of the total genomic variances for CWT and EMA (0.22 ± 0.09–0.32 ± 0.11). Considering different functional classes of exon regions and the per SNP contribution revealed the largest proportion of genetic variance was attributable to synonymous variants. GWAS detected 206 and 27 SNPs in genic and exon regions, respectively, on BTA4, BTA6, and BTA14 that were significantly associated with CWT and EMA. These SNPs were harbored by 31 candidate genes, among which

**90**

TOX, FAM184B, PPARGC1A, PRKDC, LCORL, and COL1A2 were noteworthy. BayesR analysis found that most SNPs (>93%) had very small effects and the 4.02–6.92% that had larger effects (10−<sup>4</sup> × σ 2 A , 10−<sup>3</sup> × σ 2 A , and 10−<sup>2</sup> × σ 2 A ) explained most of the total genetic variance, confirming polygenic components of the traits studied.

Keywords: variance partitioning, genome level SNP, GWAS, carcass traits, Hanwoo cattle

# INTRODUCTION

fgene-09-00217 June 21, 2018 Time: 16:36 # 2

The genetic architecture of complex traits like carcass and meat quality in cattle includes a large number of loci with small individual effects on each trait. Variations in those traits are due to interactions among the loci dispersed across the genome as well as influenced by environmental factors. It is important to know how additive genetic variances are distributed across different genomic regions for better understanding of the genetic composition of complex traits. Several genome-wide association studies (GWAS) using dense single nucleotide polymorphism (SNP) marker panels have shown the differential contribution of genic and non-genic (intergenic) regions of genomes to additive genetic variance in human (Yang et al., 2011b; Lee et al., 2012), dairy and beef cattle (Koufariotis et al., 2014), and broiler chicken (Abdollahi-Arpanahi et al., 2016). These studies showed that genic regions usually contributed more additive genetic variation than non-genic regions. However, Santana et al. (2016) reported maximum genomic variance to be attributed to intergenic and intronic regions in beef cattle, whereas Do et al. (2015) found almost similar genomic contributions from annotated genic and non-genic regions in pigs. The differences among these studies might be associated with several factors, such as SNP density in the marker panel, statistical models used, species, and types of traits investigated.

The Encyclopedia of DNA Elements (ENCODE) project found that about 80% of the human genome was engaged in relevant biochemical activities, even though only about 1% of the genome encodes a defined product such as a protein or reproducible biochemical signature (ENCODE Project Consortium, 2012). Hindorff et al. (2009) reported that 88% of the total trait associated significant variants for human were located in intron (45%) and intergenic (43%) regions. But, importantly, SNPs in missense and promotor regions were significantly enriched whereas SNPs of intergenic regions were underrepresented in association studies (Hindorff et al., 2009; Kindt et al., 2013). On the other hand, the contribution of minor allele frequency (MAF) classes varied greatly for carcass traits in Japanese Black cattle (Ogawa et al., 2016) and for 17 different complex traits in Nordic Holstein cattle (Zhang et al., 2017). Therefore, understanding how genomic regions contribute to the variances of complex traits and partitioning the genome into different categories will help in describing a clear scenario of the genomic architecture of traits.

In GWAS, stringent statistical thresholds are considered in most cases to control false positive results using multiple hypothesis testing and, therefore, many variants with small effects fail to reach significance levels despite some of them being causal variants. The proportion of phenotypic variance explained by all SNPs is relatively lower than the estimates of pedigree data because the former includes only the contributions of causal variants that are in linkage disequilibrium (LD) with genotyped SNPs (Visscher et al., 2010). This is known as the perceived problem of "missing heritability" (Manolio et al., 2009). Insufficient LD between genotyped SNPs and causal variants accounts for most of the deviation in variance estimates. Lack of LD can also arise if the MAF of causal variants is lower than the genotyped SNPs (Lee et al., 2011). Imputation enables the determination of SNP genotypes that are not directly genotyped by low-density marker panels and uses information from a reference population that has been genotyped with higherdensity SNP markers (Hickey et al., 2012). In GWAS, more causal variants of a given trait are expected to be detected using imputed whole-genome sequence data compared with the number of causal variants detected by the currently used SNP marker panels. In addition, LD between SNP markers and causal variants increases in association analysis from imputed sequence level SNP data, which also ensures higher reliability of genomic predictions for quantitative traits because more SNP information can be incorporated and genomically evaluated (van Binsbergen et al., 2015; Gonzalez-Pena et al., 2016). Therefore, sequence level SNP information can be used to capture the maximum numbers of attributed additive genetic variances in a whole-genome or a particular genomic region for better estimation of traits. Previous studies reported higher imputation accuracy from high density genotype to whole genome sequence variants which also provided better prediction for genomic selection in dairy and beef cattle (Hawlader et al., 2017; Pausch et al., 2017).

Hanwoo (Bos taurus coreanae), an indigenous cattle breed of South Korea, has been bred intensively over the last four decades for the improvement of carcass and meat quality traits. Hanwoo beef is regarded as a cultural icon and is very popular for its extensive marbling and eating quality attributes like tenderness, juiciness, and characteristic flavor (Jo et al., 2012). Presently, the genetic worth of individual Hanwoo is estimated based on carcass weight (CWT), eye muscle area (EMA), backfat thickness (BT), and marbling score (MS) traits using both pedigree and SNP genotype data (Lee et al., 2014). Previous GWAS using a 50-K SNP marker panel detected a number of significant SNPs associated with CWT, intramuscular fat, Warner–Bratzler shear force, and sensory traits in Hanwoo (Lee et al., 2013, 2014; Dang et al., 2014). Notably, the genetic evaluation of complex traits using genomic information is increasingly being used in different cattle breeding programs. However, until now, GWAS or genetic architecture of carcass and meat quality traits using sequence level SNP information has been limited to other beef cattle breeds and has not yet been reported in Hanwoo cattle. In this study, imputed genome sequence level SNP data were used

to investigate genetic variance explained by subsets of genomic regions as well as to identify genomic variants in genic and exon regions and their contributions by GWAS for four carcass and meat quality traits in Hanwoo cattle.

# MATERIALS AND METHODS

# Animals and Phenotypes

A total of 2109 Hanwoo steers born between 2004 and 2013 at Hanwoo Experiment Station, National Institute of Animal Science (NIAS), Rural Development Administration, South Korea, were used in this study. All the steers were progeny of 251 sires and unrelated dams (1–3 progenies per dam). Animal health and welfare issues were followed according to approved guidelines of the Animal Care and Use Committee (NIAS) and the ethics committee approval number was 2015-150. Feeding and management practices were uniform under feedlot condition with a concentrate mixture and rice straw-based ration. In the total feed, the proportions of concentrate and roughage were approximately 1.5:1, 2.5:1, and 4.5:1 in the grower (4–12 months), fattening I (13–18 months), and fattening II (19–23 months) rations, respectively. Crude protein and total digestible nutrients contents in the concentrate mixtures of these three rations were 14–16 and 68–70%, 11–13 and 71–73, and 11–12 and 72–73%, respectively. All animals were slaughtered at about 24 months of age. The carcass and meat quality traits investigated in this study were CWT, EMA, BFT, and MS. Feeding, management, and trait measurements were according to Bhuiyan et al. (2017). Briefly, the cold CWT was taken after chilling for about 24 h. Longissimus dorsi muscle samples (approximately 1.5 kg) were collected from the junction between the 12th and 13th rib for the EMA, MS, and BFT measurements. MS was assessed on a 1–9 point scale according to the Korean Beef Marbling Standard (KAPE, 2012). Descriptive statistics of carcass and meat-quality traits are summarized in Supplementary Table S1.

# SNP Genotyping and Quality Control

In total, 2605 individuals were genotyped initially using two different SNP platforms, Illumina Bovine SNP50 BeadChip (1677 animals) and Bovine HD BeadChip (928 animals). The unphased genotypes were converted into phased data using Eagle v. 2.3.2 based on long-range phasing approach (Loh et al., 2016). The genotype data for all 1677 individuals were then imputed to a high-density level (671,902 SNPs) considering the high-density genotype data as reference sequence panel using Minimac3 (Das et al., 2016). SNPs on the sex chromosomes were excluded. Whole-genome sequence data of 203 progeny tested Hanwoo bulls (South Korea Proven Bulls) were used as the reference population for sequence level SNP imputation. Finally, highdensity genotypes of 2109 Hanwoo steers were imputed one chromosome at a time to sequence level using Minimac3, where each sequenced individual had 25,676,502 SNPs. We set-up imputation R <sup>2</sup> > 0.60 according to a previous Cross-Disorder Group of the Psychiatric Genomics Consortium et al. (2013) study, which included 49.12% of the total imputed SNPs. SNP filtering was performed based on the following exclusion criteria: MAF < 0.01 and Hardy–Weinberg equilibrium <0.0001 using PLINK 1.9 software (Purcell et al., 2007). After quality control, 11,278,153 SNPs were retained for further analyses.

# SNP Annotation

The physical positions of the imputed SNPs were determined using the UMD 3.1 (Elsik et al., 2016) bovine genome assembly as a reference sequence. SNP annotation, filtering, and partitioning were performed using SnpEff v.4.3p (Cingolani et al., 2012b) and SnpSift software (Cingolani et al., 2012a). Total SNPs were partitioned into 14 different categories according to their functional annotations (**Table 1**) except regulatory regions. Then, all splice variants and start and stop sites were excluded because they contained a very low proportion of the total SNPs or because, in exon regions, SNPs might already be represented by coding sequences and untranslated regions (UTRs). Finally, six major functional classes of genomic regions were considered: synonymous, non-synonymous (missense), 5<sup>0</sup> - and 3<sup>0</sup> -UTRs, intron, regulatory, and intergenic regions. Regulatory regions were defined as regions located 5-kb upstream and 5-kb downstream of genes, and intergenic regions were defined as regions more than 5-kb distant from genes. Besides, the variants were categorized into six classes based on their MAF as 0.01–0.05, 0.05–0.1, 0.1–0.2, 0.2–0.3, 0.3–0.4, and 0.4–0.5.

# Genomic Variance Partitioning

To decipher the genomic architecture of traits and predictive ability of particular genomic regions, the total genomic variance was partitioned based on MAF category (six classes), chromosome (29 autosome), and functional annotations (six classes). To do this, genomic relationship matrices (GRMs) were estimated based on the SNPs in the respective categories (MAF, chromosome, and functional class) following the method

TABLE 1 | Number of variants annotated in different functional classes in Korean Hanwoo cattle using sequence level single nucleotide polymorphism (SNP) data<sup>1</sup> .


<sup>1</sup>Functional annotation of SNP variants was performed based on the cattle genome reference sequence (UMD 3.1) using SnpEff ver. 4.3p and SnpSift software (Cingolani et al., 2012a,b). The cumulative value is higher than 100% as some variants are located in several transcripts and therefore, could be allocated to multiple regions.

of VanRaden (2008) using genome-wide complex trait analysis (GCTA v.1.26) software (Yang et al., 2011a). The variance attributable to each category was calculated separately or by fitting all GRMs of the respective category simultaneously in a joint analysis. Restricted maximum likelihood analysis implemented in GCTA v.1.26 was performed using the following linear mixed model:

$$\mathcal{Y} = X\emptyset + \sum\_{G=1}^{n} \mathcal{g}\_G + e$$

where y is the vector of phenotypes, β is a vector of fixed effects (year and season) and covariate (age) with its incidence matrix X, n is the number of subsets for non-overlapping SNPs partitioning (n = 6 for joint analysis by MAF bin, n = 29 for the number of autosomes, and n = 6 for the functional annotation of SNPs), g<sup>G</sup> is a vector of random additive genetic effects attributed from aggregated SNP information, and e is a random residual error. The variance component of phenotypic values from the joint analysis is V<sup>g</sup> = Agσ 2 <sup>g</sup>+Iσ 2 e , where σ 2 g is the additive genetic variance tagged by SNPs, A<sup>g</sup> is the genetic relationship matrix calculated from SNP data, σ 2 e is the error variance, and I is the identity matrix. The proportion of variance captured by each category is calculated as h 2 <sup>G</sup> = σ 2 G /σ 2 P , where σ 2 P denotes the phenotypic variance explained by all autosomal SNPs.

# Genome-Wide Association and Genetic Contribution of SNPs

Two different approaches were used for the single-marker association analysis using SNPs in genic (exon or intronic SNPs) and exon regions, as well as to know the contribution of exonic SNPs to phenotypes. Phenotypic data were adjusted using a linear mixed model for fixed effects (year and season) and covariate (animal's age at slaughter). The adjusted phenotypes and constructed GRMs were subsequently used for GWAS under a mixed linear model including all candidate SNPs implemented in GCTA v.1.26. In GCTA, the mixed linear model assumes that all markers are to be in LD with quantitative trait loci (QTL) in close proximity and additive effects are derived based on the SNP mediated overall covariance. Thus, single trait association analysis was performed using the following model:

$$\mathcal{Y} = a + bX + \mathcal{g} + e$$

where y is the adjusted phenotypic value, a is the mean, b is the additive effect (fixed effect) of the candidate SNP to be tested for association, X is the SNP genotype indicator variable coded as 0, 1, or 2 depending on the number of copies of a specified allele, g is the accumulated effect of all SNPs, and e is the random residual effect. The Bonferroni adjusted P-value threshold was determined to correct multiple hypotheses testing at the genome-wide suggestive (1.0/number of SNPs tested) and significant (0.05/number of SNPs tested) levels. Manhattan plots were drawn from genome-wide associated P-values (−log10 transformed observed P-values) using the "gap" package (Zhao, 2014) in R program. A Bayesian mixture model implemented in BayesR software<sup>1</sup> that fitted all markers simultaneously with four posterior distributions of each marker was used to estimate the variance explained by exonic SNPs. The SNPs in the mixture model were assumed to be normally distributed with the proportion of effect sizes 0.00, 0.0001, 0.001, and 0.01, using a single chain length of 50,000 samples, where the first 20,000 cycles were discarded as burn-in (Erbe et al., 2012). The percentage of genetic contribution (%V<sup>g</sup> ) accounted for by each SNP was calculated using the formula:

$$\%V\_{\text{g}} = 100 \times \frac{2pq\beta^2}{\sigma\_{\text{A}}^2}$$

where, p and q are the allele frequencies for a given trait, β is the additive effects of the SNPs, and σ 2 A is the additive genetic variance for a trait. Besides, the per SNP based genetic variance explained by each annotated class was estimated according to the methods described by Koufariotis et al. (2014) using following formula:

$$VarPerSNP = \frac{[(h^2 \div n) \times 100]}{10^{-4}}$$

where, h 2 is the heritability, n is the total SNPs in the respective annotated class, results were multiplied by 100 to get percent (%) of the genetic variance explained and results were divided by 10−<sup>4</sup> for visualization of the data. The derived variance components (σ 2 A and σ 2 P ) during individual SNP effect calculation were used for h 2 estimates. Subsequently, we performed functional annotation of the significant SNPs and searched for candidate genes using SnpEff v.4.3q and variant effect predictor (VEP) tools supported by Ensembl (McLaren et al., 2016).

# RESULTS

# Annotation and Distribution of Variants Across the Genome

Genome sequence level SNP data were annotated into 14 different functional classes (**Table 1**) However, because of the low SNP proportion in some classes, only six major classes (synonymous, non-synonymous, 5<sup>0</sup> - and 3<sup>0</sup> -UTRs, intron, regulatory, and intergenic regions) were included in our analysis. As expected, intergenic variants were the most common, followed by intron, upstream and downstream, and exon variants, representing 70.30, 28.79, 8.24, and 0.88% of the total SNPs, respectively. The proportions of SNPs in the other functional categories were very low (0.002–0.22% of the total SNPs). In a previous study using bovine next-generation sequencing data, Aßmus et al. (2011) found almost similar proportions of intron (28.04%) and exon variants (0.90%) in cattle; however, they reported relatively lower proportions of intergenic (64.36%) and regulatory region (6.38%) variants. Our results are close to the findings of Koufariotis et al. (2014) who reported the proportion of SNPs based on 777-K data in the aforementioned four classes to be 67.0, 31.0, 8.0, and 1.0%, respectively, in beef cattle. Santana et al. (2016) reported the distribution of SNPs in intergenic, intron, and exon

<sup>1</sup>https://github.com/syntheke/bayesR

regions was 63.64, 28.17, and 1.46%, respectively, in Nellore cattle, which also supports our results. Taken together, the results indicate that several attributes like SNP density, LD among SNPs, poor functional annotation, and types of traits may affect the annotation results.

# Partition of Genomic Variance Explained by Individual Chromosomes

The proportions of genomic variance attributed to all SNPs were found to be 0.57, 0.44, 0.45, and 0.49 for CWT, EMA, BFT, and MS, respectively (**Table 2**), suggesting that a substantial genomic contribution explained the phenotypic variation in the studied population. To determine what proportions of the variance were explained by individual chromosomes, we performed a joint analysis by fitting 29 GRMs (from 29 autosomes) simultaneously. The chromosomes contributed to the total genomic variance in various degrees; namely, from 0.000 to 0.089 for CWT, from 0.000 to 0.064 for EMA, from 0.000 to 0.044 for BFT, and from 0.000 to 0.047 for MS. Moreover, the sum of variances attributed to individual chromosomes was slightly lower than the estimated total genomic variance for all four traits (Supplementary Table S2). Notably, with few exceptions, the amount of variance explained by each chromosome was found to be proportional to its physical length for all four traits (**Figure 1**). However, the magnitudes of linear relationships (R 2 ) were comparatively low and varied from 0.06 to 0.15 among the four traits studied.

# Partition of Genomic Variance Explained by Functional Annotation

To determine the genomic variation that was explained by the six major functional classes, at first, similarity matrices of each category were used separately and then all the matrices were fitted simultaneously in a joint analysis. The separate analyses showed that the six classes explained substantial amounts of the genomic variations for all traits, and their contributions were larger than those from the joint analysis (**Tables 2**, **3**). For the separate analyses, the LD between SNPs in the different functional classes might have led to overestimation of the genomic variance for each class. For the joint analysis, the genomic variances explained by genic (synonymous, non-synonymous, and 5<sup>0</sup> - and 3<sup>0</sup> -UTRs) and upstream and downstream regulatory variants were negligible and close to zero (data not shown) for the four traits studied. Therefore, the variants in those functional classes were merged with the exon and intergenic classes, respectively. The sum of variances for both the functional classes and the MAF categories were similar to the estimates for the separate and joint analyses using all the SNPs (**Tables 2**, **4**) for all four traits and justified the well-fitted genome partitioning analysis.

In the joint analysis, the genomic variances that accounted for the six functional classes varied among the carcass and meat quality traits. For example, the genomic heritability explained by exons was 0.13 and 0.22 for the BFT and MS traits, respectively, but close to zero for the CWT and EMA traits, whereas the genomic heritability explained by intron and intergenic regions ranged from 0.22 to 0.32 for the CWT and EMA traits, and from 0.09 to 0.19 for the BFT and MS traits. These results suggest distinct genetic architectures underlie the processes involved in muscle development and fat biosynthesis in the studied population. In particular, when the different functional classes in the exon regions (5<sup>0</sup> - and 3<sup>0</sup> -UTRs, synonymous and nonsynonymous) were considered in the joint analysis, the genomic variances attributable to the synonymous class were significantly more than those attributable to the 5<sup>0</sup> - and 3<sup>0</sup> -UTRs and nonsynonymous classes for all four traits. In the joint analysis, the genetic variance explained by each SNP was estimated to determine the contribution of the SNPs in each class. Regardless of the trait studied, the per SNP analysis also revealed that the variants in coding and UTR regions contributed more to the variance than variants in the intron and intergenic regions. Specifically, the largest proportion of the genetic variance was explained per SNP in the synonymous class, particularly for the CWT, BFT, and MS traits (**Figure 2**). Relatively lower genetic variance was explained per SNP in the UTRs for the CWT, EMA, and MS traits, and by SNPs in the non-synonymous class for the BFT and MS traits. In the intron class, the genetic variance explained per SNP was low, but higher than that for the upstream and downstream and intergenic classes for all four traits.

TABLE 2 | Estimates of the variance explained by the SNPs located in exon, intron, and intergenic regions for four carcass and meat quality traits in Korean Hanwoo cattle.


<sup>∗</sup>Separate means individual analysis was performed for each trait considering the SNPs of respective functional annotation, joint means all three categories (exon, intron, and intergenic) were considered in a single analysis, values in the parentheses denote standard error of h<sup>2</sup> estimates, CWT, carcass weight; EMA, eye muscle area; BFT, backfat thickness; MS, marbling score.

FIGURE 1 | Estimated proportion of variance explained by each chromosome for carcass weight (CWT), eye muscle area (EMA), backfat thickness (BFT), and marbling score (MS) against its length. Genomic partitioning was performed by joint analysis. The number in the circles represent the chromosome number.

# Partition of Genomic Variance Explained by MAF Class

The distribution of SNPs in the six different MAF classes was 27.90, 14.70, 18.40, 14.20, 12.60, and 12.10% of the total SNPs (**Table 4**). Similar to the results for the functional annotations, the variance explained by the six different MAF bins from a joint analysis varied greatly among the traits and MAF categories. In general, two common alleles groups (0.10–0.20 and 0.30–0.40)

TABLE 3 | Estimated proportion of variance explained by the synonymous, non-synonymous, and 50–3<sup>0</sup> UTR SNPs for four carcass and meat quality traits<sup>1</sup> .


<sup>∗</sup>SNPs in exon regions were analyzed either separately for each functional category (synonymous, non-synonymous, and UTR) or jointly in a single analysis. <sup>1</sup>See Table 2 for trait abbreviations.

TABLE 4 | Estimated proportion of variance explained by different minor allele frequency (MAF) category for four carcass and meat quality traits in Korean Hanwoo cattle<sup>1</sup> .


<sup>∗</sup>Separate means five analysis were performed separately for traits under each MAF bin, joint means all five MAF categories were considered in a single analysis, values in the parentheses denote standard error of h<sup>2</sup> estimates, values in the square brackets represent the proportion of SNPs in each MAF category. <sup>1</sup>See Table 2 for trait abbreviations.

contributed more to the variance for all traits than the other allele groups. Specifically, the highest genomic variance was explained by SNPs in MAF category 0.10–0.20 for the CWT (0.26) and EMA (0.14) traits, and by SNPs in MAF category 0.30–0.40 for the MS (0.23) and BFT (0.14) traits. Remarkably, the low frequent alleles (MAF < 0.05) accounted for the highest variance only for the BFT (0.15) trait. The other three MAF bins explained comparatively lower proportions of the genetic variance (from close to zero to 0.10) for all four traits investigated. This finding supports the idea that different genomic architectures exist between carcass and meat quality traits in Hanwoo cattle.

# Identification of Genomic Variants Through GWAS

Genome-wide association study was performed using SNPs in both genic (exon and intron together) and exon regions to identify their intra-genetic association with the four traits studied. Considering all the SNPs in the genic region (a total of 3,345,931 SNPs), the mixed linear model-based GWAS revealed 206 SNPs significantly associated with CWT (P < 1.49 × 10−<sup>8</sup> ) and six SNPs significantly associated with EMA. These significant SNPs were located on BTA6 and 14, and were harbored by 24 candidate genes (**Figure 3**, **Table 5**, and Supplementary Table S3). The most significant SNPs (rs109438687 and rs109467519) were located in the introns of FAM184B on BTA6 and were associated with CWT. The top seven intronic SNPs were in TOX on BTA14 (rs41724548, rs41724547, rs41724546, rs42406058, rs42406039, rs109374728, and rs41724619) and had the second highest association with CWT. Significant SNPs for the CWT and EMA traits were located at 3.32 Mb on BTA6 and were in LAP3, FAM184B, NCAPG, LCORL, and SLIT2. Besides, significantly associated SNPs for CWT spanned a 13.69 Mb region on BTA14 that harbored 19 genes, among which PRKDC, XKR4, IMPAD1,

SDCBP, TOX, DNAJC5B, PREX2, C8orf46, and C8orf34 were notable (**Table 5** and Supplementary Table S3). These results indicate that these two regions of BTA6 and BTA14 were potential candidates for carcass traits in Hanwoo cattle. However, none of the SNPs reached significant levels for the BFT and MS traits (Supplementary Figure S1).

In GWAS, only a few markers with the largest effects cross the significant threshold level through multiple hypothesis testing, and most variants fail to reach statistical significance, even though some of them are causal. To overcome the limitations of stringent criteria, we selected only the exonic SNPs (a total of 99,204) for further association study. The mixed linear modelbased GWAS identified a total of 27 significant SNPs on BTA4, 6, and 14 (**Table 6** and Supplementary Figures S2, S3) for the CWT and EMA traits (P < 5.04 × 10−<sup>7</sup> ). The significant exonic SNPs were harbored by 14 candidate genes, seven of which had already been detected when the SNPs in genic (exon and intron together) regions were used in the mixed linear model-based GWAS. Among the candidate genes, TOX, COL1A2, PPARGC1A, PRKDC, IMPAD1, DNAJC5B, and CRH were noteworthy (**Table 6**). Importantly, the coding variants on COL1A2, PPARGC1A, and CRH were significantly associated only with the exonic SNPs. The most significant SNP (rs110132121) was located in the 3<sup>0</sup> -UTR of TOX (P < 5.31 × 10−15) on BTA14 for CWT, followed by two synonymous SNPs (rs461493029 and rs449968016) in PRKDC (P < 6.22 × 10−14), also for CWT.

# Contributions of Genomic Variants

The SNP effects were estimated using BayesR to determine the proportion of genetic variance explained by individual SNPs and are presented in **Table 7** and Supplementary Figures S2–S5. We limited the analysis to the SNPs in the exon regions because of the heavy computational requirements of BayesR. The SNPs that had the largest effects for the investigated traits were located mostly

on BTA2, 4, 6, 12, 14, 17, 19, and 24; however, these effects were small compared with the total genetic variance. Notably, 93– 96% of the SNPs had close to zero effects, and the other 4–7% had different degrees of genetic contribution to the traits studied (**Table 6**). In particular, the proportion of SNPs that had the largest effects (10−<sup>3</sup> × σ 2 A and 10−<sup>2</sup> × σ 2 A ) varied between 0.26– 0.41% of the total numbers but explained 33.42–62.73% of the total genetic variance.

# DISCUSSION

Quantitative traits are controlled by the additive effects of a large number of genes spaced over an entire genome. Therefore, it is important to identify the genomic regions that contribute most to the genetic variations for complex traits like carcass and meat quality. In this study, we investigated for the first time, the genomic variances explained by different functional classes and performed GWAS using sequence level SNP information in Korean Hanwoo cattle.

# Partitioning of Genomic Variance by Chromosome

We found a linear but weak relationship between the variance explained by each chromosome and its length, which is consistent with the study of Jensen et al. (2012). They reported low R 2 -values (ranged between 0.11 and 0.21) for chromosomal variance on chromosomal lengths for complex traits in Holstein cattle. They also stated that aggregated chromosomal variance accounted for 96–97% of the total genomic variance, which is similar to our findings (Supplementary Table S2). Pimentel Eda et al. (2011) found that relatively broader linear relationships (R 2 ) varied from 0.03 to 0.77 for milk production and milk composition traits in Holstein cattle, which is in partial agreement with the present study. Similar results were also found by Yang et al. (2011b) and Lee et al. (2012) who reported low to strong (R <sup>2</sup> = 0.03– 0.80) linear relationships between genetic variance explained by each chromosome with its length for four complex traits and a complex genetic disorder, schizophrenia in human. Remarkably, we observed notable differences in genetic contribution among chromosomes of similar lengths, which is supported by the findings of Yang et al. (2011b). Taken together, these results indicate that the low R 2 -values between chromosomal lengths and their contributing genomic variances reflected only a weak relationship, which may be because genes that had large effects contributed a greater proportion of genomic variance for the harboring chromosome. The results of the present study also indicate that major genes or QTLs are not evenly segregated across the Hanwoo genome. For instance, DGAT1 and PLAG1 on BTA14 are known to make large contributions to genomic variance for carcass and milk traits in cattle, and accordingly we found the highest variance was attributed to BTA14, which is a

### TABLE 5 | Significant genic SNPs harbored genes for CWT and EMA traits in Korean Hanwoo cattle.


<sup>1</sup>Gene ID names were retrieved from Ensembl database using variant effect predictor (VEP) tools (McLaren et al., 2016) based on Bos taurus genome reference assembly UMD 3.1; <sup>2</sup>Bos taurus autosome; <sup>3</sup>Only first and last variant positions are presented for each gene; <sup>4</sup>Represents the lowest P-value among the SNPs identified in a gene.


TABLE 6 | Significant SNPs of exon regions in genome-wide association study (GWAS) for CWT and EMA traits in Korean Hanwoo cattle.

<sup>1</sup>Positions are based on Bos taurus genome reference assembly UMD 3.1. <sup>2</sup>minor allele frequency. <sup>3</sup>Significant threshold at 5% level of genome-wide significance for Bonferroni correction was P = 5.04 × 10−<sup>7</sup> . <sup>4</sup>Location of SNP variants or genes was performed as per cattle genome reference sequence (UMD 3.1) using SnpEff ver. 4.3p (Cingolani et al., 2012b) and variant effect predictor (VEP) tools (McLaren et al., 2016). <sup>5</sup>Genetic contribution of each SNP was calculated using Bayesian mixture model.

small sized autosome. However, SNP density in the marker panel, statistical model used, types of traits investigated, and species of interest are major contributing factors to differences between our results and previous results. Overall, we found variable genomic contribution attributed across all chromosomes, which support a polygenic model for carcass and meat quality traits, and is similar to the findings of Pimentel Eda et al. (2011) and Jensen et al. (2012) for dairy traits in Holstein cattle.

# Partitioning of Genomic Variance by Functional Annotation and MAF Class

In agreement with our results, Abdollahi-Arpanahi et al. (2016) found that synonymous regions explained the largest proportion of genetic variance among six functional classes for body weight, hen-house egg production, and breast muscle measurement traits in broiler chicken. In human and cattle, Koufariotis et al. (2014) and Yang et al. (2011b) reported more genetic variances were attributed to genic regions than to intron and intergenic regions, which supports our findings. Moreover, the per SNP analysis revealed that both missense and synonymous classes had the largest contributions in total genetic variance (Koufariotis et al., 2014), which partially agrees with the present findings. Importantly, there has been increasing interest in synonymous SNPs, even though they do not change the amino acid in a polypeptide chain. Previous studies reported that synonymous mutations were associated with more than 50 human diseases (Sauna and Kimchi-Sarfaty, 2011), and also affected immature mRNA splicing, alteration of secondary structure of mRNA, stability of mRNA, protein folding, and the functions of translated proteins (Hunt et al., 2014). However, Morota et al. (2014) found that non-genic regions better explained genomic variance than genic regions for body weight and hen-house egg production traits in chicken, whereas for the breast muscle measurement trait, genic regions contributed more than non-genic regions. This variation with our findings might be due to differences in species of interest, number of SNPs investigated, and extent of LD between markers and QTLs. Overall, we found both genic and non-genic regions explained substantial amounts of genomic variances for the



Nsnp, number of SNPs in model; σ 2 G , total genetic variance explained by the SNPs; values in parentheses are proportion of SNPs in each mixture component; σ 2 A , genetic variance explained by the respective mixture component and values are presented in the square brackets. CWT, carcass weight; EMA, eye muscle area; BFT, backfat thickness; MS, marbling score.

carcass and meat quality traits, which favors the infinitesimal theory and highlights the importance of SNPs spread over the entire genome.

van Binsbergen et al. (2015) reported that the frequency of low MAF increased proportionately with the advancement of SNP density and the proportions of low frequency alleles varied from 25 to 30% of the total SNPs in imputed sequence level SNP data and in whole-genome sequences. This result is in agreement with our present findings. Using sequence level SNP data in dairy cattle, Zhang et al. (2017) found the highest relative contribution in genomic variance was attributed to the common variants (MAF > 0.05–0.50) for production traits, whereas rare and low frequency alleles were more highly represents in the explained variance for fertility, longevity, and health-related traits. Their findings pointed toward a polygenic component of production traits and support our findings. Ogawa et al. (2016) reported a higher proportion of additive genetic variance was associated with common alleles where the MAF category ranged from 0.20 to 0.30 for the CWT trait in Japanese Black cattle. They also found that three major QTLs previously identified on BTA6, 8, and 14 were within the cited allele frequency range and potentially contributed to the higher genetic variance. However, the differences in MAF distribution for the CWT trait between previous and present findings may be associated primarily with SNP marker density. Taken together, these results suggest that common alleles make substantial contributions to the total genetic variance for quantitative traits and also support the present findings for carcass and meat quality traits in the Hanwoo population.

# GWAS and Contribution of Genomic Variants

Previous GWAS using both 50K and 777K data have revealed major QTL(s) on BTA14 associated with CWT and bovine stature in different cattle breeds including Hanwoo (Lee et al., 2013). Here, a wider range of significant SNPs was detected in BTA14 as well as in BTA4 and BTA6 using sequence level SNP information. These findings may help to identify more causal variants associated with economically important traits in cattle. Earlier studies reported genetic variants in and around PLAG1 and a nearby major QTL on BTA14 for their associations with bovine stature (Karim et al., 2011), CWT (Nishimura et al., 2012), early life body weight, and peripubertal weight (Littlejohn et al., 2012), as well as birth weight (Utsunomiya et al., 2013) in different cattle populations. In our study, variants of neighboring genes of PLAG1 were found to be significantly associated with CWT, but the most significant SNP marker (rs41724548) was located in TOX, which is 1.61 Mb distant from PLAG1, and also confirmed the previous findings of Lee et al. (2013). Based on 50K SNP chip data, Lee et al. (2013) reported that PLAG1, CHCHD7, FAM110B, CYP7A1, SDCBP, and TOX were positional and functional candidate genes for a CWT QTL in Hanwoo cattle, which supports our findings. In addition, they reported that the variants located near PLAG1 and CHCHD7 had non-significant associations with CWT, which is similar to the present findings. TOX acts as a transcription factor in the hypothalamus and plays a key role in the development of puberty in Brahman cattle (Fortes et al., 2012). Causal variants of TOX were associated with reproductive traits in Nellore cattle (de Camargo et al., 2015). Altogether, previous studies have reported that SNP variants associated with carcass traits were centered on PLAG1. However, we found SNP variants in an extended region between 20.7 and 34.4 Mb were associated with CWT, suggesting synergistic effects of multiple genes for the major QTL(s) on BTA14 in the Hanwoo population.

In previous studies, a QTL on BTA6 around the NCAPG– LCORL region was found to be associated with CWT and body frame size in Japanese Black cattle (Setoguchi et al., 2009, 2011) and birth, weaning, and yearling weight in crossbred beef cattle (Snelling et al., 2010). Setoguchi et al. (2011) found a LD block spanning a 591 kb region encompassed FAM184B, DCAF16, NCAPG, and LCORL where a causal variant (Ile442Met) was located in NCAPG. Recently, Xia et al. (2017) reported 11 significant SNPs associate with a skeleton trait in Simmental cattle that were located in or nearby LAP3, FAM184B,

LCORL, and NCAPG on BTA6, which have been regarded as positional candidate regions for carcass and growth traits in cattle (Lindholm-Perry et al., 2011). Importantly, we found a number of significant markers within this region associated with CWT and EMA, and confirmed the previously reported association using sequence level SNP data for the first time in Hanwoo cattle.

In addition, similar to our study, a number of coding variants on PPARGC1A, COL1A2, and CRH have been documented for their association with growth, carcass, and meat quality traits in mammals including cattle. PPARGC1A encodes a transcriptional coactivator that regulates the genes involved in lipid and glucose metabolism, and has been regarded as a positional and functional candidate gene for carcass traits in beef cattle (Shin and Chung, 2013). The synonymous (c.396G > A) and missense (g.1181G > A) mutations of this gene had significant associations with body weight and average daily gain in Nanyang cattle (Li et al., 2014), as well as with growth, slaughter, and meat quality traits in Brangus steers (Soria et al., 2009). Besides, Shin and Chung (2013) reported two intronic SNPs in PPARGC1A to be significantly associated with the carcass trait EMA in Hanwoo, which supports our findings. CRH plays important roles for growth and development in mammals, and two coding SNPs (synonymous and missense) of this gene had significant association with CWT in our study. A missense mutation of CRH (G1084A) was significantly associated with the EMA trait in Hanwoo (Seong and Kong, 2015), which is in agreement with the present study. COL1A2, which encodes the pro-alpha2 chain of type I collagen, has been extensively investigated in human. Mutations in this gene were associated with several bone-related pathogenicitylike osteogenesis imperfecta and dental fluorosis. We found significant association with variants of COL1A2 for CWT in Hanwoo. Above all, the coding variants detected in our study spanned three different genomic regions on BTA4, 6, and 14, whereas earlier studies documented major QTL(s) for carcass traits only on BTA14 in Hanwoo populations. Using sequence level SNP data, we detected two additional genomic regions (a 0.58 Mb region on BTA4 and a 1.61 Mb region on BTA6) in this study that may be new candidate loci for carcass traits in the investigated population. This information can be used to detect causal variants as well as in genomic selection programs in Hanwoo cattle.

Our results on the effect sizes of SNPs are in agreement with the infinitesimal theory as well as with the findings of Erbe et al. (2012) and Moser et al. (2015). Previous studies suggested that the minimum number of effective loci was between 400 and 4000 for capturing almost all genetic variances that accounted for milk production and disease resistance traits (Pimentel Eda et al., 2011; Erbe et al., 2012). In another investigation, Moser et al. (2015) reported that the number of SNPs with larger effects (10−<sup>4</sup> × σ 2 A , 10−<sup>3</sup> × σ 2 A , and 10−<sup>2</sup> × σ 2 A ) varied greatly (between 2633 and 9411) among seven human diseases. Moreover, they found that more than 96% of the SNPs were attributed with very small effects, close to zero. In our study, the number of large effect SNP variants in exon regions varied between 3979 (CWT) and 6859 (MS) among the investigated traits for explaining almost all of the total genetic variance, whereas the majority of the SNPs (>93%) were involved with the remaining genetic variance, which indicated the traits were polygenic in nature and were consistent with the previously reported findings in livestock and human. The types of traits investigated and the total number and category of SNP variants (exon, intron, or intergenic) included in the analysis might be major contributing factors for the differences between previous and present studies.

# CONCLUSION

Imputed genome sequence level data revealed the contributions of both genic and non-genic SNPs to phenotypic variations for four carcass and meat quality traits. Intragenic SNPs explained more genomic variance than intergenic variants, and the highest variance was attributed to synonymous SNPs. Genomic regions partitioned based on functional annotations, chromosome, and MAF category showed distinct differences in the variance explained for carcass and meat quality traits, and thus depicted different genetic architectures between the two types of traits. A wide range of significant SNPs and their contributions were established through this study. Some of these variants or genes that harbor them, first reported in this study, could be included in the genomic evaluation of quantitative traits in Hanwoo. Only 4–7% of the genic variants potentially contributed to the total explained genetic variance, while the remaining thousands had close to zero contribution and largely point toward the polygenic composition of these traits.

# DATA ACCESSIBILITY

The high density SNP genotypic data and full genome sequence data of Korean Hanwoo cattle used in this study are deposited and available at digital repository of NIAS, South Korea (website) and would be available to the interested researcher upon the request.

# AUTHOR CONTRIBUTIONS

SeL and MB conceived and designed the study. MB drafted the manuscript. DL and CG were responsible for imputation of 50K and 777K genotype data to sequence level. DL and SoL were responsible for phenotypic data collection. SoL and YK contributed in quality control of genotype data, partitioning of genome and SNP annotation. MB and YK performed GWAS. All authors read and agreed on the contents of manuscript.

# FUNDING

This study was supported by grants from the AGENDA projects (Nos. PJ0126872018 and PJ01261101) of the National Institute of Animal Science, Rural Development Administration, South Korea.

# ACKNOWLEDGMENTS

fgene-09-00217 June 21, 2018 Time: 16:36 # 13

We acknowledge to National Agricultural Cooperative Federation, Seosan, South Korea for providing semen samples of KPN bulls.

# REFERENCES


# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00217/full#supplementary-material



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Bhuiyan, Lim, Park, Lee, Kim, Gondro, Park and Lee. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genomics for Ruminants in Developing Countries: From Principles to Practice

Vincent Ducrocq<sup>1</sup> \*, Denis Laloe<sup>1</sup> , Marimuthu Swaminathan<sup>2</sup> , Xavier Rognon<sup>1</sup> , Michèle Tixier-Boichard<sup>1</sup> and Tatiana Zerjal<sup>1</sup>

<sup>1</sup> Génétique Animale et Biologie Intégrative, Institut National de la Recherche Agronomique, AgroParisTech, Université Paris-Saclay, Jouy-en-Josas, France, <sup>2</sup> BAIF Development Research Foundation, Pune, India

Using genomic information, local ruminant populations can be better characterized and compared to selected ones. Genetic relationships between animals can be established even without systematic pedigree recording, provided a budget is available for genotyping. Genomic selection (GS) can rely on a subset of the total population and does not require a costly national infrastructure, e.g., based on progeny testing. Yet, the use of genomic tools for animal breeding in developing countries is still limited. We identify three main reasons for this: (i) the instruments for cheap recording of phenotypes and data management are still limiting. (ii) many developing countries are recurrently exposed to unfavorable conditions (heat, diseases, poor nutrition) requiring special attention to fitness traits, (iii) a high level of expertise in quantitative genetics, modeling, and data manipulation is needed to perform genomic analyses. Yet, the potential outcomes go much beyond genetic improvements and can improve the resilience of the whole farming system. They include a better management of genetic diversity of local populations, a more balanced genetic progress and the possibility to unravel the genetic basis of adaptation of local breeds through whole genome approaches. A GS program being developed by BAIF, a large Indian NGO, is analyzed as a pilot case. It relies on the creation of a female reference population of Bos indicus and crossbreds, recorded with modern technology (e.g., smartphones) to collect performances at low cost in tiny herds on production and fertility. Finally, recommendations for the implementation of GS in developing countries are proposed.

Keywords: genomic selection, dairy cattle, India, genetic resources, adaptation, NGO, capacity building

# INTRODUCTION

The demand for animal products in developing countries is growing at an unprecedented rate due to a combination of factors, including steady population growth, diffuse urbanization and rising levels of family incomes (Steinfeld et al., 2006; Rothschild and Plastow, 2014). Environmental constraints, at present and expected to occur with climate change, are particularly severe in developing countries and require a new balance between adaptation and productivity, as compared to breeding programs in temperate countries where environment is usually better controlled Consequently, the two main features to consider for animal breeding in developing countries are (i) the need for more balanced selection objectives, and (ii) the interest of crossbred or composite populations, to combine adaptation and production ability in various environments (Rege et al., 2011).

### Edited by:

Max F. Rothschild, Iowa State University, United States

### Reviewed by:

Dirk-Jan De Koning, Swedish University of Agricultural Sciences, Sweden Scott Newman, Genus, United Kingdom

> \*Correspondence: Vincent Ducrocq vincent.ducrocq@inra.fr

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 20 April 2018 Accepted: 25 June 2018 Published: 13 July 2018

### Citation:

Ducrocq V, Laloe D, Swaminathan M, Rognon X, Tixier-Boichard M and Zerjal T (2018) Genomics for Ruminants in Developing Countries: From Principles to Practice. Front. Genet. 9:251. doi: 10.3389/fgene.2018.00251

The aim of this paper is to analyze, through a pilot case, how genomics can be used to set up novel breeding programs matching the specific needs of developing countries.

# PART 1: CURRENT CONTRIBUTION OF GENOMIC INFORMATION TO ANIMAL BREEDING IN DEVELOPING COUNTRIES

# New Knowledge Brought by Genomics

Genomics has already greatly improved our knowledge of animal genetic resources in developing countries. Many studies were initiated with microsatellite markers and are now extended to high density (HD) SNP markers sets and whole genome sequencing, as illustrated in goats (Ajmone-Marsan et al., 2014). All studies regularly observed higher genetic diversity in local populations of developing countries for all livestock species (Groeneveld et al., 2010), including cattle (Kim et al., 2017). These studies also made possible the identification of introgression events from exotic breeds and showed that original local populations were still present, thus constituting a genetic resource for animal breeding in developing countries. Analysis of HD SNP data sets on local populations could detect selection signatures associated with adaptation to harsh conditions, mainly those of tropical countries exhibiting hot conditions and pathogens pressure (Gautier et al., 2009; Perez O'Brien et al., 2014; Taye et al., 2017). Thus, selection objectives for breeding in developing countries should not be directly copied from what is applied in temperate countries, even for production systems when environmental conditions can be controlled.

Although molecular data significantly improve our knowledge of animal genetic resources in developing countries, they do not benefit yet to breeding programs in these countries. Classical selection requires an elaborate multi-step breeding program, including pedigree recording, phenotyping and breeding value estimation, which is particularly difficult to organize in a developing country. Could molecular data change the picture?

# Making Use of Molecular Data by Genomic Selection in Ruminants

Genomic selection has completely changed the organization of selection in dairy cattle (Boichard et al., 2016). The possibility of using a whole-genome set of markers to improve the accuracy of breeding value prediction was first described by Meuwissen et al. (2001). It consists in using a set of genotyped and phenotyped animals, called the reference population, to estimate markerphenotype association which makes possible to predict the breeding value of a calf without the need for progeny testing (PT), thereby reducing generation interval and cost of testing. Key factors of success are the size and the design of the reference population and the access to an informative SNP chip suited to the population (Boichard et al., 2016). Moreover, at least in theory, a higher number of bulls can be proposed to farmers and the management of genetic variability within a breed can be better monitored. Here, genotypes can replace pedigree recording and the set-up of a breeding program may start on a new basis, as compared to mandatory pedigree recording, often a limiting factor in developing countries.

Such a concept was tested on a real data set of 1,013 dairy cows in Kenya, which exhibited various degrees of crossbreeding with exotic breeds (Brown et al., 2016). A principal component analysis based on SNP data showed that individuals could be clustered in three groups according to the proportion of exotic breeds, with a reference and a validation data set for each group. The accuracy of genomic prediction (measured as the correlation between milk yield deviation and genomic breeding value) ranged from 0.32 to 0.41 with GBLUP and from 0.28 to 0.39 with BayesC with no significant difference of performance between the two methods. Considering that pedigree recording was totally missing, this approach opens the way to the set-up of a breeding program but limitations were identified regarding the cost of genotyping and the collection of more phenotypic data.

# PART 2: IMPLEMENTING GENOMIC SELECTION IN DEVELOPING COUNTRIES: A CASE STUDY IN INDIA

In this section, we use the example of BAIF Development Research Foundation<sup>1</sup> , a large Indian NGO, as a pilot case to describe examples of constraints and challenges faced when developing a large-scale dairy cattle breeding program in tropical conditions. For 50 years, BAIF's main mission has been to provide sustainable livelihood to Indian smallholder dairy farmers, in particular by promoting genetic improvement of "non-descript" low yielding cattle (and also buffaloes, but they are not considered here). This is carried out through artificial insemination (AI) using frozen semen technology.

# Characteristics of BAIF's Selection Program

BAIF was one of the pioneer organizations to introduce AI crossbreeding of cows with "exotic" Bos taurus bulls (Holstein and Jersey) in India, which now contributes to more than 50% of the country's milk production. It expanded to such a point that in 2016, BAIF's semen stations produced 12.5 million doses of semen from: (i) purebred "exotic" Holstein and Jersey bulls born in BAIF's bull dam nucleus herd which was created about 40 years ago from heifers imported from Canada and Denmark; (ii) purebred indigenous Bos indicus bulls, mainly of Gir and Sahiwal breeds which have a greater milk production potential, but also of other local (draft) breeds for the purpose of genetic resources conservation; (iii) crossbred bulls exhibiting a range of 50–75% exotic blood.

About 4,500 BAIF AI technicians, each covering 12–15 villages, provide AI at the doorsteps of poor families as well as basic guidance on animal nutrition, health, and management. BAIF is currently serving over four million rural households in 16 states all over India (roughly excluding the extreme South, North, and East states) with very diverse agro-climatic conditions,

<sup>1</sup>http://www.baif.org.in

in terms of temperature, water resources, farming systems and production constraints. The most striking common feature is the very small herd size (<2).

# Initial Selection Practices

fgene-09-00251 July 12, 2018 Time: 16:14 # 3

Since 1994, BAIF has been part of a field PT program run by the Indian Council of Agricultural Research (ICAR). Under this program, phenotype recording is only on milk yield and is quite costly given the herd size: each cow is recorded every 14 days, in order to obtain an accurate lactation yield. Recording takes place mostly in Maharashtra villages with a long experience with BAIF. Unfortunately, up to 70% of the records are lost, mainly because of unknown sire, animal identification errors or transcription mismatches when entering information in the database. As a result, only a small fraction of all BAIF Holstein and Holstein crossbred bulls have been progeny tested. The best PT bulls have been used as sires of sons and in the most productive villages, which are also the ones that have practiced crossbreeding for the longest time. In practice, non-progeny tested bulls as well as bulls waiting for PT results have to be used continuously (no lay-off period). Therefore, PT, which has made dairy cattle selection so efficient in many countries, is just costly, inappropriate, and quite ineffective under Indian conditions. Clearly, the main bottleneck for a more ambitious bull selection based on PT was, and still is, the implementation of low cost, large scale recording in tiny herds.

# Selection Objectives

There are other important limitations with the BAIF's current PT program: it concentrates mainly on the recording and selection of just one trait: milk production, despite the fact that in India, milk price highly depends on fat content. Also, the huge heterogeneity of agro-climatic conditions generates large genotype × environment interactions, which have to be accounted for in selection at different levels (choice of breed, of fraction of exotic blood for crossbred bulls, of individual bulls). Selecting only on production traits strongly favors animals with (too) high levels of exotic origin and adaptation to the local conditions can be rapidly lost.

Cow longevity is an obvious trait reflecting adaptation, but is not pertinent in India where slaughter of unproductive cows is not permitted. Considering morphological traits such as good udders, feet, and legs can help but is not enough. The infrastructure for large scale recording of health traits, in particular resistance to mastitis, does not exist yet. A more accessible trait to collect which can be considered as a proxy for general adaptation may be fertility: an unfit or unhealthy cow is less likely to be fertile. At BAIF, AI information is of good quality, with a systematic pregnancy diagnosis two months after each insemination. Combined with proper tagging, good AI and calving records are also important prerequisites to ensure correct pedigree information required in genetic evaluations. Another frequently overlooked aspect to keep in mind in bull selection in India are the farmer's expectations and beliefs (coat color or pattern, shape of horns, or ears, etc.) for good acceptance in the field.

# Low Cost Collection of Phenotypes

The possibility to collect field data at BAIF on a much larger scale was investigated through a project (the "Godhan project") sponsored by the Bill and Melinda Gates Foundation (BMGF): 170 AI technicians were equipped with multi-component software, installed first on dedicated "data loggers" and later on mobile phones. Originally developed to follow the economic and social status of BAIF farmers over time, the software was extended to include technical data. Soon, hundreds of thousands of good quality records were gathered, in particular on fertility, avoiding the error-prone process of data entry and validation (Potdar et al., 2017). It was originally planned to also ask the AI technician to directly collect milk production data from the farmers but this appeared to be difficult, probably because the farmers – as well as the AI technicians – were not motivated enough with incentives and above all, proper feedback. Hence, large scale, low cost milk sample collection and analysis (for fat and protein content or for somatic cell counts) remain an issue.

# Toward Genomic Selection

Even with the low cost of large scale performance recording, generating a group of progeny tested bulls of reasonable size to start genomic evaluation is a long and complex process, in particular because of the very limited population with pedigree information: an incompressible preliminary period is necessary before tagged daughters from known sires start being recorded.

Most of the constraints and challenges indicated above lead to the notion of promoting the development of female reference populations (FRPs), which replace the requirement for a largescale recording infrastructure by a more realistic collection of phenotypes from a set of genotyped cows. These phenotypes should cover the traits identified in the selection objective and come from herds with carefully documented environmental and management characteristics, hence offering the possibility to actually measure G × E interactions on all traits. Absence of known pedigree relationships is overcome by using genomic information, the cost of which cannot be covered by small farmers. Since the constitution of FRP requires strong and long-term financial and technical support from governmental or international institutions, the BAIF project benefits from an important BMGF sponsoring for 5 years, where more than 15,000 pure and crossbred indigenous cows, mainly coming from six very diverse Indian states, are phenotyped, and a substantial portion of them are being genotyped.

# Technology and Infrastructure

The commercially available medium- or low-density SNP chips were primarily designed for Bos taurus cattle. For Bos indicus and crossbred animals at BAIF, these chips are suboptimal because a substantial number of SNP have a very low minimum allele frequency, a low heterozygosity or are fixed (Strucken et al., 2018). In other words, they are less informative.

In terms of infrastructure, the actual constitution of a completely new reference population is obviously a long and complicated task requiring huge investments in human and material resources and a strong centralized coordination. BAIF

could rely on its existing AI technician networks. It must be emphasized that collection of field data requires constant motivation and follow-up at all levels (farmers, technicians, supervisors). At a central level, the design and maintenance of a high quality database is also essential.

A critical step toward genomic selection is the data analysis and the development of prediction equations. They require a high level of expertise in quantitative genetics, modeling and data manipulation. A potentially overlooked difficulty is the choice of a proper genetic evaluation model, actually reflecting the factors contributing to the observed variability of performances. Developing a sophisticated genomic evaluation based on a simplistic genetic evaluation is strongly counterproductive. At BAIF, technical support from University of New England, Australia, and INRA, France, is available for this applied research work.

A final challenge is transforming research developments and results into a continual data stream and a sustainable genomic evaluation procedure that will routinely provide genomic breeding values of bull and bull dam candidates to selection.

# PART 3: RECOMMENDATIONS FOR APPLICATION OF GENOMIC TOOLS TO ANIMAL BREEDING IN DEVELOPING COUNTRIES

# Involving all Stakeholders in the Breeding Program

In 2010, FAO guidelines recommended Community-Based-Breeding-Programs for the management of animal genetic resources. Benefits and limitations of the approach have been previously discussed (Wurzinger et al., 2011). Practical situations analyzed in Bangladesh (Bhuiyan et al., 2017) have led to a set of recommendations underlining the need to : (i) define breeding objectives relevant for the community; (ii) identify the relevant traits to record; (iii) develop inexpensive and easy-to-use devices for phenotype recording, (iv) promote feedback on the program and information exchange. In addition, both studies highlighted the importance of governmental support, with national breeding policies and enabling measures to scale up the programs.

In the case of BAIF, a major leverage is the monitoring of performance of each AI technician as compared to his/her local colleagues. Technicians are equipped with mobile devices that accelerate data collection and improve data quality. This could be a way to provide farmers with rapid feedback on their practices, allowing for improved management of reproduction and nutrition of their cows. Ultimately, the genetic improvement program aims to improve rural livelihoods. The potential longterm outcomes go beyond genetic improvements and can improve the resilience of the whole farming system.

Selection objectives must reflect a real balance between general adaptation, health, and production. This balance has to be carefully addressed because it influences the long term sustainability of farming. Lessons from the BAIF case suggest the need to identify a trait able to represent the expected balance between production and adaptation, considering local constraints and farmers' preference. Consequently, fertility has been preferred to longevity in India, whereas the latter could be preferred in another context.

# Building the Reference Population

The choice of the breed type and of breed composition in crossbreds should align with the local agro-climatic environment and socio-cultural context, giving priority to animals that cope well with harsh climatic, nutritional, and health conditions. The few examples considering genomic selection tend to favor crossbreeding. Lessons from the BAIF case suggest that a portfolio of purebred or crossbred genotypes is the best answer to the various needs, a strategy which is also described in Bangladesh for ruminants (Bhuiyan et al., 2017). Maintaining various alternatives allows preserving and improving purebred indigenous populations, thus exploiting their specific adaptive features, together with the local production and dissemination of crossbreds.

Whatever the type of animals considered, a very close relationship between the FRP and the on-farm population to be improved is key for the accuracy of genomic prediction. Thus, the FRP must represent the current genetic structure/diversity of the population to be improved, the range of crossbreeding if any, and the range of production conditions (environment and management) because of potential G × E interactions. Thus, a large-scale FRP is required to obtain reliable genomic predictions for populations distributed over a large territory, with little exchange among herds. Two difficulties may arise: (i) inconsistency among agriculture public policies in the case of transboundary populations, (ii) competing initiatives within a country or across countries.

For breeds managed in a large number of small herds, data recording should preferably be standardized among herds, unless appropriate methods are used to account for data heterogeneity (see Methodological Challenges). Data should be analyzed centrally, requiring a full-scale data sharing and a good level of organization.

Cumulative constitution of the FRP is necessary to ensure sustainability of the genomic selection program and a progressive increase in prediction accuracy.

# Methodological Challenges

Lessons from BAIF show that adapting genetic evaluation models (e.g., random regression models) based on test-day records makes possible a better correction for the large environmental changes over the year, and relaxes the requirement of rather strict intervals between consecutive records of a cow (Duclos et al., 2008). Furthermore, the challenge of accounting for local environmental conditions in very small herds could be addressed by including in genetic models a "(group of) village(s) by month" contemporary group as a proxy for herd management.

As aforementioned with the case of BAIF, if the existing SNP chips, developed for Bos taurus in developed countries, can allow the genotyping of bovine populations in developing countries, it appears that they may not be fully operational (less informative than expected, especially when using pure Bos indicus

or Bos indicus × Bos taurus crosses). Three alternatives are then possible: (i) accept a loss in accuracy, which may be compensated by a higher number of genotypes with a cheaper Bos taurus chip; (ii) create an imputation population of animals genotyped with the HD chip which included some Bos indicus breeds, and impute the HD genotype of the whole reference population; (iii) design a new chip fully adapted to Bos indicus and crossbred animals. The best option depends on the local conditions. In particular, if the market for a new chip is limited, e.g., when different stakeholders want each to develop a different chip for a same target population, option (iii) may be the less effective one. Such a new chip could be developed as part of a South–South collaboration, involving scientists and breeders from countries concerned by a particular set of breeds to be improved and willing to set up such breeding programs. The sharing of data will lead to a common SNP database from which a suitable chip can be created to be used for marker phenotype association in the FRP. A possibility in the case of BAIF is to envision a transition over time between these options, from (i) + (ii) to (iii). Other alternative options, such as genotyping-by-sequencing have been proposed (Gorjanc et al., 2015), but they should be considered with caution because patents derogation for developing countries may be required.

Exploiting a large number of genotypes at the whole-genome level also opens new possibilities for animal breeding:

– the numerous genotypes being collected for males and females could be used to monitor inbreeding at the genome level and better manage population diversity;

– identification of genomic regions that are common across breeds (with identical directions of allele effects) and that are significantly associated with traits to be improved may help improve across-breed genomic evaluations (Purfield et al., 2015).

# Technological Challenges

Developing countries suffer from deficient tools and infrastructures (Rothschild and Plastow, 2014; Helmy et al., 2016), which limits the use of genomic information in breeding programs.

Reliable marker genotypes require good management of samples for DNA extraction, easy access to experienced genotyping platforms and a proper data base infrastructure. Such structures are often missing or weakly supported for livestock. Therefore, the opportunity of using genotyping or sequencing platforms developed for human genetics should be encouraged to save the cost of establishing expensive dedicated platforms (Glenn, 2011). However, the crucial step for any breeding organization is to master bioinformatics expertise and secure access to computing facilities. As an example, a pan-African network was set up for the "Human Heredity and Health in Africa" initiative<sup>2</sup> , to support access to technologies, facilitate the funding of infrastructures and offer training. In the case of livestock, the interstate Research Center "Centre International de Recherche-Développement sur l'Elevage en zone Subhumide (CIRDES)," based in Burkina Faso and resulting from the partnership between seven West African countries, could play this role in the sub-region.

The lack of sperm production, preservation and dissemination facilities in developing countries has long been reported (Timon, 1993) and remains relevant (Rothschild and Plastow, 2014) in many developing countries. Lessons from BAIF show the benefit from controlling a large-scale infrastructure for AI to fully benefit from the use of genetic information, especially when serving small farmers.

Internet access and easy communication tools (mobile apps) are also very important enhancers, both for the technical supervision of the farms and for the farmers themselves, to facilitate their involvement and appropriation of breeding programs as well as data collection. Thus, internet connections must be effective. Even when such a network exists (Helmy et al., 2016), the lack of stability of the country's energy infrastructures often causes power cuts and weakens internet reliability (Karikari, 2015).

# Capacity Building

In terms of capacity building, constraints observed in developing countries to enable the implementation of genomics applied to livestock are many and involve human, institutional, logistical and financial aspects (Rothschild and Plastow, 2014; van Marle-Köster et al., 2015; Helmy et al., 2016).

The use of genomic data requires expertise in database development and support, quantitative genetics, and statistical modeling to guarantee accurate and stable genomics analyses.

Yet, setting up genetic improvement programs is worthwhile only when animals' maintenance feed requirements are covered (McDowell, 1989; Timon and Baber, 1989). To this extent, farmer training courses should provide, on the one hand, basic guidance on animal nutrition, health and management to improve animal welfare and, on the other hand, should explain the requirements in terms of data recording, and raise awareness of pros and cons regarding the choice of a bull or bull type, i.e., purebred or crossbred.

Training programs for scientists and managers of breeding programs are needed in quantitative genetics, genomics and bioinformatics, with access to scientific literature resources (Rothschild and Plastow, 2014; Karikari, 2015; Helmy et al., 2016). South–South and North–South co-operations are to be encouraged to facilitate training.

# Investment

The main drawback of setting up a reference population is the genotyping cost of a large number of animals: the amount of phenotypic information associated with each genotype and available for genomic evaluation is substantially smaller for cows than for progeny tested bulls (Goddard, 2009). This reduction may be even larger in developing countries for two main reasons: a larger equivalent population size of populations with a limited selection history (e.g., for Bos indicus cattle) and lower heritability traits due to a much more variable environment and a small herd size.

Using genomic information for the management of genetic variability may be relatively easy, provided that the genotyping cost is affordable, which is not so obvious for small populations.

<sup>2</sup>https://h3africa.org

To decrease costs, a multi-breed SNP chip is an option to recommend.

Lessons from BAIF show that a major investor is needed to start a sustainable program, which should be a donor, either a public institution, or a private foundation supporting common goods, such as BMGF. It is of utmost importance to orient these donors toward breeding programs aimed at empowering local communities. Then, long-term operations require a professional and self-supporting organization.

# CONCLUSION

Genomic selection has the potential to overcome the difficulties encountered by developing countries to implement classical breeding programs where pedigree recording is a pre-requisite. The aim is not to copy breeding programs from temperate countries but to benefit from new methods to better answer the needs of farmers in developing countries. The analysis of a case study provided by BAIF helps to identify the critical factors of success, including: importance of a representative reference population in terms of diversity of genotypes and of environmental conditions; definition of balanced selection objectives and appropriate traits as proxy for adaptation; involvement of farmers and technicians with incentives and quick feedback to them; building local expertise in quantitative genetics and bioinformatics. Challenges consist in accounting for genotype × environment interactions,

# REFERENCES


decreasing genotyping cost by using common tools, getting full advantage of genomic data to combine preservation of genetic diversity with improvement of animal performance, building a sustainable economic model complementary to donor support. A balanced and well monitored use of local and exotic genetic resources is possible. This deserves appropriate public policies allowing for the development of new breeding programs without compromising the importance to preserve local genetic resources.

# AUTHOR CONTRIBUTIONS

MT-B conceived the paper, drafted part of the paper, and read, discussed, and approved the whole manuscript. VD proposed the case study, drafted part of the paper, and read, discussed, and approved the whole manuscript. MS provided information on the case study, and read and approved the whole manuscript. DL, XR, and TZ drafted parts of the paper, and read, discussed, and approved the whole manuscript.

# FUNDING

Author's salaries come from their host institution: INRA for VD, DL, MT-B, and TZ, AgroParisTech for XR, BAIF for MS. The case study described in this study was funded by the Bill and Melinda Gates Foundation, as described in the body of the manuscript.



opportunities in Southern Africa. Food Res. Int. 76, 971–979. doi: 10.1016/j. foodres.2015.05.057

Wurzinger, M., Sölkner, J., and Iñiguez, L. (2011). Important aspects and limitations in considering community-based breeding programs for low-input smallholder livestock systems. Small Rumin. Res. 98, 170–175. doi: 10.1016/j. smallrumres.2011.03.035

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ducrocq, Laloe, Swaminathan, Rognon, Tixier-Boichard and Zerjal. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Whole-Genome Resequencing of Red Junglefowl and Indigenous Village Chicken Reveal New Insights on the Genome Dynamics of the Species

Raman A. Lawal <sup>1</sup> \*, Raed M. Al-Atiyat 2,3, Riyadh S. Aljumaah<sup>3</sup> , Pradeepa Silva<sup>4</sup> , Joram M. Mwacharo<sup>5</sup> and Olivier Hanotte1,6 \*

*<sup>1</sup> Cells, Organisms and Molecular Genetics, School of Life Sciences, University of Nottingham, Nottingham, United Kingdom, <sup>2</sup> Genetics and Biotechnology, Animal Science Department, Agriculture Faculty, Mutah University, Karak, Jordan, <sup>3</sup> Animal Production Department, College of Food and Agriculture Sciences, King Saud University, Riyadh, Saudi Arabia, <sup>4</sup> Department of Animal Sciences, Faculty of Agriculture, University of Peradeniya, Peradeniya, Sri Lanka, <sup>5</sup> Small Ruminant Genomics, International Centre for Agricultural Research in the Dry Areas, Addis Ababa, Ethiopia, <sup>6</sup> LiveGene – CTLGH, International Livestock Research Institute, Addis Ababa, Ethiopia*

### Edited by:

*Meng-Hua Li, Institute of Zoology (CAS), China*

### Reviewed by:

*Shikai Liu, Ocean University of China, China Ed Smith, Virginia Tech, United States*

### \*Correspondence:

*Raman A. Lawal lawalakinyanju@yahoo.com Olivier Hanotte olivier.hanotte@nottingham.ac.uk*

### Specialty section:

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

Received: *06 March 2018* Accepted: *29 June 2018* Published: *20 July 2018*

### Citation:

*Lawal RA, Al-Atiyat RM, Aljumaah RS, Silva P, Mwacharo JM and Hanotte O (2018) Whole-Genome Resequencing of Red Junglefowl and Indigenous Village Chicken Reveal New Insights on the Genome Dynamics of the Species. Front. Genet. 9:264. doi: 10.3389/fgene.2018.00264* The red junglefowl *Gallus gallus* is the main progenitor of domestic chicken, the commonest livestock species, outnumbering humans by an approximate ratio of six to one. The genetic control for production traits have been well studied in commercial chicken, but the selection pressures underlying unique adaptation and production to local environments remain largely unknown in indigenous village chicken. Likewise, the genome regions under positive selection in the wild red junglefowl remain untapped. Here, using the pool heterozygosity approach, we analyzed indigenous village chicken populations from Ethiopia, Saudi Arabia, and Sri Lanka, alongside six red junglefowl, for signatures of positive selection across the autosomes. Two red junglefowl candidate selected regions were shared with all domestic chicken populations. Four candidates sweep regions, unique to and shared among all indigenous domestic chicken, were detected. Only one region includes annotated genes (*TSHR* and *GTF2A1*). Candidate regions that were unique to each domestic chicken population with functions relating to adaptation to temperature gradient, production, reproduction and immunity were identified. Our results provide new insights on the consequence of the selection pressures that followed domestication on the genome landscape of the domestic village chicken.

Keywords: red junglefowl, Gallus gallus, indigenous village chicken, chicken domestication, chicken adaptation, environmental adaptation, positive selection, candidate sweep regions

# INTRODUCTION

Since Charles Darwin proposed a single ancestry of chicken from the red junglefowl, its status as either monophyletic or polyphyletic has been debated (Darwin, 1868; Beebe, 1918; Danforth, 1958; Morejohn, 1968; Fumihito et al., 1994). While the red junglefowl is the main ancestor, some studies are now supporting genetic contributions from other junglefowl species (Eriksson et al., 2008; Lawal, 2017).

Evidences are also controversial as to the timing and places where chicken domestication first occurred (Zeuner, 1963; Crawford, 1984; West and Zhou, 1988; Fumihito et al., 1996; Liu et al., 2006; Xiang et al., 2014, 2015; Peters et al., 2015). A study on mitochondrial DNA suggests multiple centers of chicken domestication (Liu et al., 2006) from which chicken dispersed to different parts of the world through humans' influence. They entered North Africa, the Middle East and Sri Lanka from the Indian subcontinent, while maritime introductions, likely originating initially in South-East Asia, occurred along the coast of East Africa as well as Sri Lanka (Silva et al., 2009; Gifford-Gonzalez and Hanotte, 2011; Mwacharo et al., 2011). Following these migration events, natural and artificial selections have shaped the genome landscape of domestic chicken resulting in a wide spectrum of breeds and ecotypes.

Aside the fancy breeds, domestic chickens primarily come under two major categories; commercial and indigenous village chickens (Schmid et al., 2015). In developing countries, the latter play prominent roles in the livelihood of smallholder farmers, being adapted to their local environmental conditions. They are often under the custody of women and children, mainly kept as dual purpose (eggs and meat) birds. Furthermore, indigenous village chicken showing special visual appeal such as comb type, skin and feather colors may have been selected by smallholder farmers, thereby increasing the frequencies of desirable phenotypes (Dana et al., 2010; Desta et al., 2013). Extensive phenotypic variations such as plumage color and other morphological characteristics, behavioral, and production traits, which are present in domestic chicken but absent in the red junglefowl, are the result of domestication, adaptation to various agro-ecosystems and stringent human selection for production and/or aesthetic values (Schütz et al., 2001; Keeling et al., 2004; Tixier-Boichard et al., 2011).

In commercial chicken lines, the genetic factors that control growth, development, reproduction, and production traits have been well studied (Rubin et al., 2010; Fu et al., 2016). Meanwhile, the genetic mechanisms underlying unique adaptations to tropical environmental pressures and productivity remain poorly studied in indigenous chicken. Likewise, in the red junglefowl, little is known about the genetic control of its adaptation and survival in its natural habitat. Here, we investigate, using wholegenome sequence data, footprints of positive selection in the genome of red junglefowl and domesticated indigenous village chicken in order to better understand the evolutionary pressures during the domestication of the species and its adaptation to different production environments.

# MATERIALS AND METHODS

## Sampling and Sequencing

A total of 27 indigenous domestic village chickens were sampled and then grouped into three populations based on the countries of origin. They include, Ethiopian domestic chicken from two districts, Horro (n = 6, altitude around 2,320 m above sea level (asl)) and Jarso (n = 5, altitude of around 1,870 m asl), Saudi Arabian domestic chicken from three villages, Al Qurin (n = 2, altitude around 130 m asl), Goligglah (n = 2, altitude around 130 m asl) and Al Oyoun (n = 1, altitude around 110 m asl) in the Eastern Province, and Sri Lankan domestic chicken from Puttalam district (n = 11, altitude around 60 m asl). Horro is a sub-humid region, with an annual rainfall of 1,685 mm and an average temperature of around 19◦C. Jarso is semi-arid with an average annual temperature of 21◦C and annual average rainfall of 700 mm (Desta et al., 2013). The Eastern Province of Saudi Arabia has an average annual temperature of 26◦C (ranging from 21.2 to 50.8◦C) and average annual rainfall of 74 mm. Puttalam district of Sri Lanka has an average annual rainfall of ∼1,000 mm and temperature of 27◦C.

Collection of blood samples was through the wing vein and genomic DNA was extracted using ammonium acetate precipitation (Bruford et al., 1998) and phenol-chloroform protocols. A minimum of 3 µg at 30 ng/µl DNA concentration was used for whole genome re-sequencing. Samples were sequenced at the Beijing Genomic Institute (BGI) or at Novogene on a HiSeq 2000/2500 Illumina platform. Five hundred (500) bp paired-end insert size libraries with read lengths of between 90– 100 bp and genome coverage of between 10X and 30X (Table S1) were generated. Adapter pollutions from the raw reads and sequences with quality scores ≤5 were deleted at source BGI/Novogene.

For the six red junglefowl, one whole-genome sequence (15X genome coverage) from a captive bird (Koen Vanmechelen private collection)<sup>1</sup> and five whole genome sequences (12X−36X genome coverage) from the Wang et al. (2015) were included in the analyses (Table S1). The five red junglefowl were sampled in Yunnan (altitude ∼3,000 m asl) and Hainan (altitude ∼1,840 m asl) provinces, China. Yunnan is a subtropical highland or humid tropical zone with an annual rainfall range of between 600 mm and 2,300 mm, and annual temperature range of between 8 to 27◦C. For the humid tropical Hainan province, the average annual rainfall is about 2,000 mm and temperature ranges between 16 and 29◦C. Fastq files for all samples newly sequenced in this study have been deposited to NCBI with the SRA accession number SRP142580 or accessible through https://www.ncbi.nlm. nih.gov/sra/SRP142580.

# Sequence Alignment and Variants Calling

The 33 whole-genome sequences were independently aligned to Galgal 4.0, which has reference genome size of 1.07 Gb (Hillier et al., 2004), using Burrows-Wheeler Aligner (BWA) version 0.7.5a (Li and Durbin, 2010). Sorting the alignment files into coordinate order, marking the duplicate reads and indexing the binary alignment map (bam) files were done using Picard tools version 1.105<sup>2</sup> . Using the genome analysis toolkit (GATK) version 3.4.0 (McKenna et al., 2010; DePristo et al., 2011; Auwera et al., 2013), we performed a two-steps protocol for local realignment around insertions and deletions (indels) to clean up artifacts that arose, during the initial mapping steps, following misalignments. Finally, we applied a quality score recalibration step for each base call to remove any errors carried over during the sequencing.

<sup>1</sup>http://www.ccrp.be/

<sup>2</sup>http://picard.sourceforge.net

To call variants, we ran "HaplotypeCaller" from GATK for each sample bam file to create a single-sample "gVCF" using the "-emitRefConfidence GVC" option. We then followed the multisample aggregation approach which jointly genotyped variants by merging together, records of all genome data from each population. Using the "-selectType SNP" option along with the "SelectVariants" from GATK, we extracted the SNPs from the raw genotype file before filtering the extracted SNPs using "VariantFiltration." All investigations were restricted to bi-allelic single nucleotide polymorphisms (SNPs) using bcftools version 1.2 (Li et al., 2009), autosomes (chromosomes 1–28) and the full mitochondrial DNA (mtDNA).

The mapping metrics including the percentage of read pairs that properly mapped to the same chromosome, mean depth coverage, total reads mapped, percentage of the genome with bases covered by at least 5, 10 and 20 reads were calculated using samtools version 0.1.19 (Li et al., 2009). Using Ensembl's "VEP" version 85 (Aken et al., 2016), we predicted the consequences of the variants while the total number of SNPs in each sample/population were identified using VCFtools version 0.1.11 (Danecek et al., 2011). The "VennDiagram" package (Chen and Boutros, 2011) in R was used to plot the unique and shared SNPs between the domestic chicken and red junglefowl.

# Population Structure and Genetic Differentiation

We removed SNPs in linkage disequilibrium to establish the genetic structure of each population and the relationships between samples using PLINK version 1.9<sup>3</sup> . We then assessed the structure of each population unsupervised, using ADMIXTURE version 1.3.0 (Alexander et al., 2009). Using the default (folds = 5) for cross-validation, we ran the analysis for 10 clusters (K). For the principal component analysis (PCA), we ran the smartpca program in eigenstrat version 6.0.1 (Price et al., 2006). The proportion of variance explained by each eigenvector was calculated by dividing the corresponding eigenvalue to the sum of all the eigenvalues.

Genome-wide, nucleotide diversity (π) and genetic differentiation (FST) were calculated within and between population(s), respectively in 20 kb windows with 10 kb slide using VCFtools version 0.1.11 (Danecek et al., 2011). For FST, the pairwise values were calculated between each domestic chicken population and the red junglefowl.

# Mitochondrial DNA Analysis

The full mitochondrial consensus sequence was extracted from the whole genome sequence of each of the 33 samples using "consensus" option in bcftools version 1.2 (Li et al., 2009). Multiple sequence alignment was conducted for the 33 mtDNA genomes using ClustalX version 2.1 (Larkin et al., 2007). To identify the best-fit nucleotide substitution model, we ran jModeltest version 2.1.7 (Darriba et al., 2012). The HKY+I+G model (Hasegawa et al., 1985) was selected as the best, based on the Akaike Information Criterion (AIC), and was subsequently used to construct an unrooted maximum likelihood tree using

To assess the haplogroup (clade) of each mtDNA sequence, we extracted the first 397 bp hypervariable region (HVR) of the D-loop from the full mitochondrial sequences using as reference mtDNA sequences of Komiyama et al. (2003) (NCBI accession number AB098668) and six haplogroups sensu Mwacharo et al. (2011) (Table S2). A haplotype data file including all the 40 HVR of D-loop sequences was generated using DnaSP version 5.1 (Librado and Rozas, 2009) from which the median-joining network was constructed using network 5.0.0.1<sup>4</sup>

# Selective Sweep Analysis

To detect putative selection sweeps, we used the pool heterozygosity (Hp) method (Rubin et al., 2010). It was performed using a 20 kb window size with a 10-kb sliding step following the equation:

$$H\text{p} = \frac{2\sum n\_{\text{MAf}}\sum n\_{\text{MIN}}}{\left(\sum n\_{\text{MAf}} + \sum n\_{\text{MIN}}\right)^2} \tag{1}$$

Where PnMAJ and PnMIN are the sums of major and minor allele frequencies, respectively for all the SNPs in the 20 kb window. The values for the H<sup>p</sup> calculated for each window size were then Z-transformed using the equation:

$$Z(H\text{p}) = \begin{array}{c} H\text{p} \ \ -\overline{X}(H\text{p}) \\ \sigma \text{ (Hp)} \end{array} \tag{2}$$

Where X is the mean and, σ is the standard deviation of Hp.

A genome-wide score of Z(Hp) ≤−4.0 was taken as the threshold after examining the distribution plot of the Z(Hp) values (Figures S1A–D). The size of each candidate selective sweep region was calculated by adding the number of overlapping adjacent windows above the genome-wide threshold.

Since the accuracy of detecting selective sweeps depend on the number of SNPs in each window and considering the high polymorphisms identified within populations, only windows with at least 50 SNPs were considered. Following this criterion, 52, 103, 56, and 39 windows were excluded from the Ethiopian, Saudi Arabian and Sri Lankan chicken populations and red junglefowl datasets, respectively.

# Haplotype Trees

In order to assess if a single or multiple haplotypes were selected across population, we build-up haplotype trees for common candidate "domesticated" regions and regions shared between all domestic chicken and red junglefowl. Only shared significant window(s) across population were used to define the region. For this purpose, we included the haplotype sequences from all junglefowl species used in Lawal (2017) study. Maximum likelihood trees were rooted with the green junglefowl and built using Phyml 3.0 (Guindon and Gascuel, 2003) after the evolutionary model was predicted using jModeltest 2.1.7

phyml 3.0 (Guindon and Gascuel, 2003). The tree was then viewed in MEGA 7.0 (Kumar et al., 2016).

<sup>3</sup>https://www.cog-genomics.org/plink2

<sup>4</sup>http://www.fluxus-engineering.com/sharenet.htm

(Darriba et al., 2012). Genome sequences of the non-red junglefowl species and G. gallus bankiva are available at DNA Data Bank Japan Sequence Read Archive (accession no. DRA003951) (Ulfah et al., 2016).

# Remapping the Galgal 4.0 Sweep Regions to Galgal 5.0 Coordinates

Following the release of the new reference genome Galgal 5.0 (Warren et al., 2017), we remapped the Galgal 4.0 candidate sweep regions to the corresponding Galgal 5.0 coordinates using NCBI remapper (February 2017 release). All the remapping options were set to default threshold. Selective sweep regions based on the Galgal 4.0 and their corresponding positions in Galgal 5.0 are reported at Tables S4–S7, including changes in the annotated genes between the two reference genomes. Only Galgal 5.0 position annotated genes at candidate regions are reported and discussed herein.

# Gene Ontology and Pathways Analysis

To establish the biological significance of the genes found in each candidate selection sweep region, we performed gene ontology and pathways analysis using Database for Annotation, Visualization, and Integrated Discovery (DAVID version 6.8)<sup>5</sup> and the Kyoto Encyclopaedia of Genes and Genomes (KEGG) (KOBAS version 3.0)<sup>6</sup> . The Fisher Exact P < 0.05 default threshold was used to identify over-represented genes.

# RESULTS

# Sequencing and SNPs Identification

Following filtering for quality checks and adapter pollutions, clean sequence reads for each domestic chicken sample range between 108.8 and 408.9 million base pairs (bp) depending on the extent of genome coverage (10X−30X) (see Table S1). For each domestic chicken, the number of nucleotides with quality score >20 (Q20) ranged from 94 to 96%.

More than 90% of the read pairs in all samples were properly mapped to the same chromosome. Except for the red junglefowl\_koen sample with 94.69% of mapped reads, ≥97% of all the reads were mapped to the reference genome. On average, ≥97% of the bases were covered by at least 5 reads, while ≥89% of the bases had minimum support of 10 reads (Table S1).

The intermediate genomic variants generated for individual birds using the "HaplotypeCaller" from GATK (Auwera et al., 2013) were used to jointly genotype all samples belonging to a population into a single variants file. Excluding the multiallelic sites, the average number of SNPs in each sample was ∼6 million (∼6 SNPs/kb). The only exception is red junglefowl5 and red junglefowl\_koen samples having ≥7 million SNPs. Around 60% of the SNPs were heterozygous in each sample except in three domestic chicken (JB1A25B, JB2A04B, and Saudi Arabia1), which showed ∼45% heterozygous SNPs (Table S1). At the population level, we identified 13.07 (∼12 SNPs/kb), 10.23 (∼9 SNPs/kb) and 14.46 (∼13 SNPs/kb) million SNPs in Ethiopian, Saudi Arabian, and Sri Lankan domestic chickens, respectively, and 15.31 (∼14 SNPs/kb) million SNPs in the red junglefowl. It corresponds to a total of 17.0 million SNPs (∼16 SNPs/kb) for the domestic chicken populations combined, and 20.81 million SNPs (∼19 SNPs/kb) after combining the genome of all the domestic chicken populations and red junglefowl (Table S3; Figure S2).

Around 11.05 million SNPs were shared between domestic chicken and red junglefowl, 5.4 and 3.8 million SNPs were unique to domestic chicken and the red junglefowl, respectively (Figure S2). We identified 1.76 million (13% of the total number of SNPs), 1.03 million (10%), and 2.33 million (16%) novel SNPs in Ethiopian, Saudi Arabian and Sri Lankan domestic chickens, respectively and 4.45 million (29%) in the red junglefowl. More than 54% of the SNPs occurred within introns, 30% in intergenic regions, 5.7 and 4.3% in upstream and downstream gene regions, respectively. 3′ and 5′ UTR variants accounted for 1.8 and 0.4% of the SNPs, respectively (Table S3).

# Population Structure

Population structure at autosomal level was examined using Principal Component (PC) (**Figure 1**) and Admixture analyses (**Figure 2**). PC1 and PC2 separate all the domestic populations from the red junglefowl, a result that was also obtained at K = 4 in the admixture analysis. The other admixture plots 5 ≤ K ≤ 10 are shown in Figure S3.

# Diversity and Genetic Differentiation

Across populations, we observe the highest genome nucleotide diversity (π = 0.0052) in the red junglefowl. Among the domestic

FIGURE 1 | Principal Component Analysis (PCA) plot. The top left label defines colors for each population. Individuals with name annotations have been uniquely identified for comparison purpose with Figures 2, 3. The proportion of variance explained by the eigenvector in the x- and y-axes are denoted beside the PCA1 and PCA2.

<sup>5</sup>https://david.ncifcrf.gov/

<sup>6</sup>http://kobas.cbi.pku.edu.cn/

junglefowl (1, 2, 3, 4, 5, and koen) samples in Table S1.

chicken populations, Sri Lankan domestic chicken show the highest nucleotide diversity (π = 0.0046), followed by the Ethiopian Horro (π = 0.0040), Saudi Arabian (π = 0.0039) and Ethiopian Jarso domestic chicken (π = 0.0036).

For the pairwise FST analysis, we calculated the genetic distances between the red junglefowl and each of the domestic chicken populations to evaluate the levels of autosomal genetic differentiation between domestic chicken and red junglefowl. The Ethiopian Jarso returns the highest FST value (0.148), followed by Ethiopian Horro (FST = 0.113), Saudi Arabian (FST = 0.095) and the Sri Lankan domestic chicken (FST = 0.062) populations.

# Mitochondrial Phylogenetic Relationships

The 33 individual mitochondrial genomes were used to construct an unrooted maximum likelihood tree using Phyml 3.0 (Guindon and Gascuel, 2003) (**Figure 3**). Sri Lankan domestic chicken are divided in two clusters. The first cluster belongs to the same lineage than the Ethiopian Horro and Saudi Arabian chicken. The second cluster included the red junglefowl and Ethiopian Jarso chicken with the Sri Lankan domestic chicken being closer to the former than the later.

To assess the possible maternal origins of our indigenous village chicken mitochondrial DNA, we extracted the hypervariable region (spanning the first 397 bp) of the mitochondrial DNA D-loop region. We included in our analysis reference haplotypes representing six major chicken haplogroups sensu Mwacharo et al. (2011) (Table S2). Haplogroups A, B, C, and D were observed in our dataset (**Figure 4**). Within a single segregating site, all Ethiopian Horro, four Saudi Arabian and two Sri Lankan domestic chicken haplotypes are linked to haplogroup D. Four Sri Lankan haplotypes are separated by three mutations from the reference D haplotype. Other Sri Lankan domestic chicken haplotypes (n = 5) link to haplogroups B and C and a single Saudi haplotype was also close to haplogroup B. The Ethiopian Jarso chicken haplotypes were found closer to haplogroup A.

# Mean Genome Heterozygosity

We calculated the average level of within population H<sup>p</sup> genome heterozygosity (20 kb window size). The genome heterozygosity of the red junglefowl averages to 0.32 ± 0.028 (n = 6). Among the domestic chicken populations, Ethiopian chicken population shows the lowest level of genome heterozygosity (mean 0.31 ± 0.051, n = 11) followed by Sri Lankan chicken population (0.32 ± 0.039, n = 11). Saudi Arabian chicken population shows the highest level of genome heterozygosity (0.36 ± 0.048, n = 5) (**Table 1**).

# Selection Sweeps Detection in Red Junglefowl

A total of 434 out of 90,170 windows passed the genomewide threshold ≤−4 resulting in 190 candidates sweep regions (**Table 1**; Table S4). Genome-wide, a single ∼20 kb window located on chromosome 5 (Galgal 5.0 position 51895684– 51909028 bp) had the lowest Z(Hp) score (−5.93) (**Figure 5**; Table S4). The region with the largest fragment size (∼210 kb, Galgal 5.0 position 2376153–2590429 bp, Z(Hp) score = −4.63 ± 0.653) is on chromosome 22. Two other candidate regions >100 kb in size are also present; ∼110 kb region on chromosome 2 (Galgal 5.0 position 33529–143341 bp) and ∼150 kb on chromosome 22 (Galgal 5.0 position 578106–728044 bp). Ninetyone candidates sweep regions out of the 190 have fragment sizes of 20 kb, 44 have sizes of 30 kb, 13 have sizes of 40 kb, 17 have sizes of 50 kb, and 25 have sizes of 60 kb and above, respectively. We did not identify any peaks below our threshold on chromosomes 14, 16, 20, 21, 24, 25, 27, and 28 at Z(Hp) score ≤−4 (**Figure 5**).

# Selection Sweep Detection in the Domestic Chicken

Out of the 89,443 windows analyzed in Ethiopian domestic chicken, 247 windows passed the genome-wide threshold of ≤−4. They define 84 candidates sweep regions (**Table 1**; Table S5). The ∼50 kb candidate region on chromosome 5 (Galgal

5.0 position 40828747–40878736 bp) has the lowest Z(Hp) score (−5.8 ± 0.289) and spans the TSHR and GTF2A1 genes. Genome-wide, the largest candidate sweep region (∼210 kb in size, Galgal 5.0 position 424781–634785 bp; Z(Hp) score = −4.29 ± 0.055) is on chromosome 8 (**Figure 6**; Table S5). Three other candidate regions have fragment sizes >100 kb; two on chromosome 3 with a size of ∼110 kb (Galgal 5.0 position 103157991–103267894 bp) and ∼150 kb (Galgal 5.0 position 103517529–103667817 bp), respectively, and the other on chromosome 8 (Galgal 5.0 position 164536–274537 bp) with a size of ∼110 kb (Table S5). The analysis of the fragment sizes of each sweep region found below the genome-wide threshold of Z(Hp) ≤−4 reveals that 36 candidate regions are 20 kb in size, 13 are 30 kb, nine are 40 kb, ten are 50 kb, and 16 have sizes ≥60 kb. We did not identify any peaks on chromosomes 6, 10, 11, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 25 26, 27, and 28 (**Figure 6**).

For the Saudi Arabian domestic chicken, we identified in total, 87,646 windows out of which 565 passed the genome-wide threshold, defining 212 candidates sweep regions (**Table 1**; Table S6). The peak with the lowest Z(Hp) score (−7.27 ± 0.087) is ∼30 kb region on chromosome 8 (Galgal 5.0 position 204536– 234537 bp). The largest sweep region (∼210 kb in size, Galgal 5.0 position 424781–634785 bp; Z(Hp) score = −4.78 ± 0.272) occurs on chromosome 8 at the same position as the largest candidate selected region in Ethiopian chicken (**Figure 7**; Table S6). Five other candidate selection sweep regions have sizes >100 kb. It includes two regions on chromosome 2 (∼140 kb region at Galgal 5.0 position 75375947–75512081 bp, and ∼110 kb at Galgal 5.0 position 147224789–147334917 bp), one region on chromosome 4 (∼120 kb in size, Galgal 5.0 position 28881313–29001315 bp) and two regions on chromosome 8 (∼196 kb length region at Galgal 5.0 position 8806310–9002909 bp, and ∼113 kb region at Galgal 5.0 position 9108796–9221862 bp) (Table S6). Analysing fragment sizes for the selection sweep regions show that 81 out of the 212 candidate regions have a fragment size of 20 kb, 55 have a fragment size of 30 kb,


*<sup>a</sup>Total number of windows that passed the genome-wide threshold.*

27 are 40 kb in size, 15 are 50 kb size, and 35 are ≥60 kb in size. We did not identify any peaks below our threshold on chromosomes 12, 13, 16, 17, 19, 20, 21, 22, 24, 25 26, 27, and 28 (**Figure 7**).

In Sri Lankan domestic chicken, of the 89,701 windows detected, 299 passed the genome-wide threshold resulting in 127 candidates sweep regions (**Table 1**; Table S7). Like Ethiopian chicken, the lowest genome-wide Z(Hp) score (−6.32 ± 1.634) occurs in ∼50 kb region on chromosome 5 (Galgal 5.0 position 40828747–40878736 bp) (**Figure 8**; Table S7). The candidate region with the largest fragment size (∼290 kb; Z(Hp) score = −4.65 ± 0.454) is located on chromosome 2 (Galgal 5.0 position 82190953–82481139 bp). Two other candidate regions have fragment sizes >100 kb. They include a ∼130 kb region on chromosome 3 (Galgal 5.0 position 111008970– 111138863 bp) and a ∼220 kb region on chromosome 5 (Galgal 5.0 position 22371859–22591888 bp). The analysis of the 127 candidate regions reveal that the length of 63 are 20 kb, 30 are 30 kb, 15 are 40 kb, five are 50 kb, and 14 have sizes ≥60 kb. We did not identify any peak below our threshold on chromosomes 13, 14, 15, 16, 17, 19, 20, 21, 24, 25, and 28 (**Figure 8**).

FIGURE 7 | Manhattan plots for selection sweep analysis performed using the standardized pool heterozygosity Z(*H*p) approach. The horizontal line represents the arbitrary threshold for Z(*H*p) ≤−4. This figure shows the Saudi Arabian chicken population.

TABLE 2 | Candidate selection sweep regions shared between/among populations.


*(Continued)*

### TABLE 2 | Continued


*<sup>a</sup>Significant regions in the three domestic chicken populations.*

*<sup>b</sup>Significant regions in domestic chicken population and red junglefowl.*

*<sup>x</sup>means the candidate region is found selected in the respective population.*

# Overlapping Sweep Regions Across Populations

At the genome level, only two sweep regions are common to all domestic chicken and the red junglefowl. They include ∼20 kb candidate region on chromosome 7 (Galgal 5.0 position 8578942–8598945 bp) within an intergenic region and ∼30 kb length on chromosome 23 (Galgal 5.0 position 5521861–5551860 bp) spanning three functional genes (HPCAL4, TRIT1 and MYCL) (**Table 2**). Haplotype trees analysis for the two regions illustrate the variation within the selected haplotypes (**Figure 9**; Figure S4). One hundred and thirty-two, and 181 variable sites are present across domestic and red junglefowl samples in the 20 and 30 kb regions, respectively (**Table 3**). It corresponds to an average of 7 and 6 SNPs/kb, well below the combined domestic chicken and red junglefowl populations genome average of 19 SNPs/kb (Figure S2, **Table 3**).

Four candidate selected regions shared between the three domestic chicken populations are identified. One is located on chromosome 1 (∼20 kb: Galgal 5.0 position 190947207–190967194 bp), one on chromosome 2 (∼20 kb: Galgal 5.0 position 147254792–147274793 bp) and two on chromosome 5 (∼50 kb: Galgal 5.0 position 40828747–40878736 bp and ∼40 kb: Galgal 5.0 position 41868268–41908264 bp) (**Table 3**). We identified two genes (TSHR and GTF2A1) within the 50 kb region of chromosome 5, while the 20 kb region on chromosome 2 includes an exon of the transcript ENSGALT00000026040. The two other candidate regions are found within intergenic/intronic regions. **Figure 10** and Figures S5–S7 illustrates the haplotype variation. Between 179 and 217 variable sites were identified across these regions or an average of 4 to 11 SNPs/kb (**Table 3**), lower than the genome average of 16 SNPs/kb calculated for the combined domestic chicken populations genomes (Figure S2).

Among the domestic chicken populations, 18 candidates sweep regions, out of a total of 70, are shared between Ethiopian and Saudi Arabian domestic chicken (**Table 2**). Four of the regions span annotated genes; HMGN3 (chromosome 3), LCORL (chromosome 4), C14orf37 (chromosome 5) and GK5 (chromosome 9). Two out of the six candidate regions that are shared between the Ethiopian and Sri Lankan domestic chickens overlap with genes including TACR3 (chromosome 4) and PLOD5 (chromosome 9). The genes present on the 13

candidate sweep regions shared between Saudi Arabian and Sri Lankan domestic chickens include 5S\_rRNA and KCNQ3 (chromosome 2), RIMS1 (chromosome 3), BAZ2B, Mar-07 and NAA20 (chromosome 7) (**Table 2**).

# Functional Annotations for the Enriched Genes Within the Sweep Regions

To identify the functions of candidate genes that may have played significant roles in adaptation to production environments and the domestication process, we performed enrichment analysis for all genes identified within the candidate sweep regions. Only classes of genes with default fisher exact P < 0.05 were considered overrepresented for the GO and KEGG pathways analysis. The GO results for all populations is found in Table S8 and that of KEGG pathway is found in Table S9.

# DISCUSSION

The autosomal genetic background and adaptation to local production environments of three populations of indigenous domestic village chicken were analyzed alongside the wild progenitor, the red junglefowl, using whole-genome

### TABLE 3 | Number of variable sites (SNPs) within the selected regions.


\**Average SNPs/kb in the selected region calculated as total number of SNPs in the selected region divided by the length (kb) of the region.*

re-sequencing data. Our objectives were to identify candidate positively selected regions (i) shared between wild red junglefowl and domestic chicken, (ii) shared among domestic chicken only and (iii) specific to individual domestic chicken and red junglefowl population.

# Common Genome Regions Selected in Both Domestic and Red Junglefowl

Common regions under selection will be expected between a domesticate and its wild ancestor considering their shared evolutionary history. They may correspond, for examples, to species specific signature of selection underlining shared morphological and behavioral phenotypes. It may be particularly true for village indigenous chicken where human selection pressures have been lower compared to commercial and fancy chicken breeds.

We identified two candidates sweep regions that are shared between all domestic chicken and the red junglefowl. While we could not identify any functional genes within the region on chromosome 7, suggesting possibly an important regulatory role for the region, the one on chromosome 23 spanned three candidate genes (HPCAL4, TRIT1, MYCL). HPCAL4 is known to play a role in the development of central nervous system (Kobayashi et al., 1998). However, while the biological functions of MYCL is still being studied (Brägelmann et al., 2017), both TRIT1 and MYCL genes have been linked to the maintenance of tumors (Smaldino et al., 2015; Brägelmann et al., 2017). All three genes may be of importance in both domestic and the wild ancestor; HPCAL4 in relation to behavioral characteristics, TRIT1 and MYCL in relation to adaptation to retrovirus infection in particular to virus causing tumors (e.g., leukosis and Marek virus) commonly affecting chicken (Cheng et al., 2010; Wragg et al., 2015).

# Domestic Chicken Specific Signature of Selection

Candidate signature of positive selection specific to domestic chicken may originate from the domestication process itself or after the domestication of the species following geographic dispersion and local responses to human and natural selection pressures. The distinction between the two is difficult. It may be approached using ancient DNA studies (Flink et al., 2014; Loog et al., 2017). We can also expect that genome regions selected at an early stage of the domestication process, prior to the geographic dispersion of the domesticate will be present in most if not all populations. Compared to fancy chicken breeds and commercial chicken lines, that are characterized by smaller effective population sizes and are heavily selected by humans, the indigenous domestic village chicken, with large effective population sizes, uncontrolled mating and relaxed artificial selection, may represent a better model for the identification of such regions.

We identified four candidate genome regions under positive selection in all the domestic chicken populations but not in the red junglefowl (see **Table 2**). Excluding one region on chromosome 2, these regions have all been previously identified in commercial broilers and layers (Rubin et al., 2010) adding support to early selected domestic region. For the region on chromosome 2, Johnsson et al. (2016) also reported a selected candidate region on this chromosome (Galgal 5.0 position 147194251–147234789 bp) which falls 20-kb away from ours (Galgal 5.0 position 147254792–147274793 bp). This region is only found in domestic chicken and not in feral birds, and it may be therefore of relevance to the domestication process.

For the remaining three regions, the 50 kb selected region on chromosome 5 includes two genes; the TSHR locus involved in metabolic regulation and reproduction process (Yoshimura et al., 2003; Hanon et al., 2008; Rubin et al., 2010) and GTF2A1, a candidate biomarker for detecting ovarian tumor (Huang et al., 2009). Hanon et al. (2008) reports that TSH-expressing cells of the pars tuberalis is linked to seasonal reproductive control in vertebrates and therefore to the onset of egg laying (Loog et al., 2017). We now know from the studies of Flink et al. (2014) and Loog et al. (2017) that selection at the TSHR in European chicken likely followed the selection for higher egg production characteristics. Our studies indicate that similar selection pressures may have acted on Ethiopian, Saudi Arabian, and Sri Lankan domestic chicken. Analysis of chicken populations from different parts of the world, e.g., East and South Asia is required.

# Signatures of Selection in Relation to the Production Environments

Response to selection is environmentally driven either naturally or artificially (Oleksyk et al., 2010). The ancestral species of domestic chicken, the red junglefowl, has a very large geographic range (Delacour, 1977). While different wild red junglefowl subspecies and domestic chicken populations may be witnessing different environmental challenges (e.g., altitudes), all are living in regions that are characterized by rather a warm climate and substantial rainfall which however may show considerable annual variation (e.g., monsoon cycles) or daily variation (e.g., temperature difference during the day). Accordingly, signatures of selection related to thermotolerance including temperature and humidity may be expected in domestic chicken and the red junglefowl.

In Ethiopian chicken, we identified two candidate genes, HRH1 and AGTR1, associated with "vasoconstriction regulation." Vasoconstriction has been linked to reduction in peripheral blood flow leading to increase in internal body temperature (Sessler et al., 1990). These genes may likely play important roles in thermoregulation (Collier and Collier, 2011; Su et al., 2011). The reduction in evaporative heat loss and stress through decreased cutaneous blood flow has been reported previously in cattle and birds (Collier and Collier, 2011; Klotz et al., 2016). Compared to the average chicken body temperature of 41◦C (Bolzani et al., 1979), the ambient temperatures of Horro and Jarso districts are relatively low (19 to 21◦C) and the two selected candidate genes may played important roles in adaptation to their local environments. At the opposite, Saudi Arabia is very dry with extreme heat during the day which could rise above 50◦C in July/August. Here, we identified several GO terms such as "blood circulation," "regulation of heart contraction," "regulation of muscle system process," "regulation of muscle adaptation," and "regulation of cardiac muscle contraction" that may be linked to the control of blood flow and evaporative cooling (Collier and Collier, 2011). Other studies have associated some of these GO terms to oxygen deprivation response due to high altitude adaptation (Li et al., 2013; Wang et al., 2015). However, this causative explanation is unlikely in our case because the Saudi Arabian chicken were sampled at an altitude of about 100 m asl. Considering the climatic conditions of the sampling area, we favor here the link to heat loss in response to extreme heat. The significantly enriched GO terms, cellular response to hydrogen peroxide and toll-like receptor signaling pathways, observed in Saudi Arabian chickens may suggest strong selection as well in response to disease challenges (Medzhitov, 2001; Stone and Yang, 2006).

In the genomes of Saudi Arabian and Sri Lankan domestic chicken alongside the red junglefowl, we uncovered the KCNMA1 gene, that may be linked to hypoxia response challenge. The region harboring this gene did not come as significant in the Ethiopian chicken. KCNMA1 is associated with the regulation of smooth muscle contraction through the activation of calcium ions (Williams et al., 2004). Increase in calcium ions stimulates hypoxia-inducible factor-1 (Hui et al., 2006). However, the biological roles played by this gene in red junglefowl and Saudi Arabian or Sri Lankan domestic chicken may be different. While in the two domestic chicken populations, it may be related, to heat tolerance and stress control considering the low elevations of the sampling sites; in the red junglefowl however, it may rather play a role in adaptation to high altitudes. Both the Yunnan (altitude ∼3,000 m asl) and Hainan (altitude ∼1,840 m asl) provinces, where the two red junglefowls were sampled, are mountainous. High elevations are associated with decrease in arterial oxygen content (Simonson et al., 2010). Another gene, ADAM9, detected in our red junglefowl, which plays a role in the development of cardiorespiratory system has also been proposed to be involved in adaptation to high-altitude in Tibetan chicken (Zhang et al., 2016).

KCNMA1 and ADAM9 were not detected in the candidate regions in Ethiopian chicken. These chickens live at an altitude of around 2,000 m asl. Perhaps, neither the climate and/or altitude where Horro and Jarso populations live result in strong selection pressures in their genomes. Analysis of Ethiopian chicken, living at much higher altitudes may provide further insights on the possible roles of KCNMA1 and ADAM9 in altitude adaptation in African domestic chicken.

In addition, one of the previously reported gene under selection in commercial chicken (Rubin et al., 2010; Johnsson et al., 2016), NT5C1A, was also identified in the red junglefowl and Sri Lankan indigenous domestic chicken studied here. Importantly, this gene is known to be involved in regulating the levels of heart adenosine during hypoxia and ischemia especially when blood supply becomes inadequate in some parts of the body (Hunsucker et al., 2001). The detection of hypoxia adaptation in both the red junglefowl and domestic chicken may or may not be related to environmental conditions. However, it is well documented that activities relating to extreme exercise may induce hypoxia (Springer et al., 1991; Lindholm and Rundqvist, 2016). Wild and domestic cocks are most often aggressive in nature with the latter having a long history of being selected for cock fighting (Delacour, 1977). We could then argue that the aggressiveness already presents in the wild relative, due in part to predator evasion and sexual selection behaviors, which can be seen as extreme exercise, may have undergone positive selection in most domestic chicken populations.

# CONCLUSIONS

Examining signature of selection in both domestic chicken and red junglefowl, our study reveals that only two candidate positive selected regions are common to both while four regions are shared across the domestic populations only. Proviso of the relatively low number of red junglefowl examined and the lack of consensus on the geographic origin of the domestic centers of the species, our results illustrate the major impact of human selection activities on the species, and the consequences on the genome landscape of adaptations to new environments. It exemplifies how quickly a domestic species may evolve when under selection pressures in environments.

# AUTHOR CONTRIBUTIONS

RL and OH conceived and designed the project. PS contributed the DNA and provided knowledge on the Sri Lankan chicken. RA-A, RA, and JM provided the Saudi Arabian chicken samples and their genome sequences. RA-A provided knowledge on the Saudi Arabian chicken and the sampling area. RL performed the analyses and OH supervised the project and contributed substantial knowledge on the interpretation of the results. RL prepared and wrote the manuscript. JM and OH revised the manuscript. All the authors read and approved the final manuscript.

# ACKNOWLEDGMENTS

This study was conducted during Raman Akinyanju Lawal Ph.D. study which was supported by the University of Nottingham Vice Chancellor's Scholarship (International) award. The Saudi Arabian sampling and genome sequencing was supported through the grant (12-AGR2555-02) from the National Plan for Science, Technology and Innovation (MAARIFAH), King Abdulaziz City for Science and Technology, Kingdom of Saudi Arabia. Sampling of the Ethiopian chicken was supported by a grant jointly sponsored by Biotechnology and Biological Sciences Research Council (BBSRC), the UK Department for International Development (DFID) and the Scottish Government (CIDLID program, BB/H009396/1, BB/H009159/1 and BB/H009051/1). Publication cost was met by CGIAR - Livestock CRP. We also thank Addie Vereijken for providing the Red Junglefowl\_Koen sample.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00264/full#supplementary-material

# REFERENCES


matriarchic ancestor of all domestic breeds. Proc. Natl. Acad. Sci. U.S.A. 91, 12505–12509. doi: 10.1073/pnas.91.26.12505


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Lawal, Al-Atiyat, Aljumaah, Silva, Mwacharo and Hanotte. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Incorporating Prior Knowledge of Principal Components in Genomic Prediction

Sayed M. Hosseini-Vardanjani <sup>1</sup> , Mohammad M. Shariati <sup>1</sup> \*, Hossein Moradi Shahrebabak <sup>2</sup> and Mojtaba Tahmoorespur <sup>1</sup>

<sup>1</sup> Department of Animal Science, Ferdowsi University of Mashhad, Mashhad, Iran, <sup>2</sup> Department of Animal Science, University College of Agriculture and Natural Resources, University of Tehran, Tehran, Iran

Genomic prediction using a large number of markers is challenging, due to the curse of dimensionality as well as multicollinearity arising from linkage disequilibrium between markers. Several methods have been proposed to solve these problems such as Principal Component Analysis (PCA) that is commonly used to reduce the dimension of predictor variables by generating orthogonal variables. Usually, the knowledge from PCA is incorporated in genomic prediction, assuming equal variance for the PCs or a variance proportional to the eigenvalues, both treat variances as fixed. Here, three prior distributions including normal, scaled-t and double exponential were assumed for PC effects in a Bayesian framework with a subset of PCs. These developed PCR models (dPCRm) were compared to routine genomic prediction models (RGPM) i.e., ridge and Bayesian ridge regression, BayesA, BayesB, and PC regression with a subset of PCs but PC variances predefined as proportional to the eigenvalues (PCR-Eigen). The performance of methods was compared by simulating a single trait with heritability of 0.25 on a genome consisted of 3,000 SNPs on three chromosomes and QTL numbers of 15, 60, and 105. After 500 generations of random mating as the historical population, a population was isolated and mated for another 15 generations. The generations 8 and 9 of recent population were used as the reference population and the next six generations as validation populations. The accuracy and bias of predictions were evaluated within the reference population, and each of validation populations. The accuracies of dPCRm were similar to RGPM (0.536 to 0.664 vs. 0.542 to 0.671), and higher than the accuracies of PCR-Eigen (0.504 to 0.641) within reference population over different QTL numbers. Decline in accuracies in validation populations were from 0.633 to 0.310, 0.639 to 0.313, and 0.617 to 0.298 using dPCRm, RGPM and PCR-Eigen, respectively. Prediction biases of dPCRm and RGPM were similar and always much less than biases of PCR-Eigen. In conclusion assuming PC variances as random variables via prior specification yielded higher accuracy than PCR-Eigen and same accuracy as RGPM, while fewer predictors were used.

Keywords: genomic selection, statistical models, variable selection, principal component analysis, accuracy

### Edited by:

Johann Sölkner, Universität für Bodenkultur Wien, Austria

### Reviewed by:

Ed Smith, Virginia Tech, United States Fabyano Fonseca Silva, Universidade Federal de Viçosa, Brazil

> \*Correspondence: Mohammad M. Shariati mm.shariati@um.ac.ir

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 15 October 2017 Accepted: 11 July 2018 Published: 02 August 2018

### Citation:

Hosseini-Vardanjani SM, Shariati MM, Moradi Shahrebabak H and Tahmoorespur M (2018) Incorporating Prior Knowledge of Principal Components in Genomic Prediction. Front. Genet. 9:289. doi: 10.3389/fgene.2018.00289

# INTRODUCTION

Advances in high-throughput genotyping technology allow the collection and storage of thousands to millions of SNP markers from many livestock species (Van Tassell et al., 2008; Matukumalli et al., 2009). These genotyped markers are a rich source of information, which can greatly enhance the performance of selection process for the genetic improvement of livestock. The information embedded in genotyped markers can be efficiently extracted by accurate models that can describe and predict the genetic merit of animals.

In genomic selection, relatively small number of phenotypes or pseudo-phenotypes are regressed on a large number of marker variables, simultaneously (Meuwissen et al., 2001). Regressing phenotypes on many marker variables raises several statistical and computational issues, such as how to confront the socalled "curse of dimensionality" as well as the complexity of a genetic mechanism that can involve various types and orders of interactions (Pérez and de Los Campos, 2014). It is expected that such data imbalance between markers and phenotypes still represents the main constraint on the implementation of genomic selection, especially for breeds other than Holstein (Pintus et al., 2012). Besides the "curse of dimensionality," another challenging problem is multicollinearity arising from inter-correlation of marker genotype due to linkage disequilibrium (Long et al., 2011). These statistical challenges have been considered before, and several methods, such as partial least square regression (Wold, 1985), and principal component analysis (Peason, 1901; Hotelling, 1933) have been proposed to reduce the dimensionality of a data set.

Principal Component Analysis (PCA) belongs to the general framework of multivariate analysis and is one of the classical data analysis tools for dimension reduction (Jolliffe, 2002). In PCA, we seek to reduce the dimensionality of an m-dimensional data vector to a smaller p-dimensional vector, where p<<m, which represents an embedding of the data in a lower dimensional space. This technique is a widely used tool in genome-wide association studies to reduce the number of correlated traits (Bolormaa et al., 2010), to trace the respective contributions of population structure and LD between single nucleotide polymorphisms (NP) and quantitative trait locus (QTL) in the accuracy of genomic predictions (Price et al., 2006; Daetwyler et al., 2012), and for genomic prediction (Solberg et al., 2009a; Pintus et al., 2012). Macciotta et al. (2010) applied PCA approach to a PC-BLUP genomic prediction using eigenvalues as prior PC variances and conclude that results were better than the previous assumption of equal variance for PC effects in Solberg et al. (2009a), since the assumption of one single variance for all PC effects could be unrealistic.

In practice, however, when some principal components are excluded from the analysis by a selective criterion the sum of eigenvalues in remaining principal components is not equal to one. So the estimated variance will be smaller than the original variance, which makes scaling inevitable. In addition, when some variables are excluded from the analysis the ranking of the remaining variables is not necessarily the same as before. Exploiting this information may enhance the accuracy of predictions in a statistical analysis. Unfortunately, neither assumption of equal variance nor the assumption of eigenvalues as the prior variance for the predictors would accommodate such information as both techniques consider the variance(s) of predictors as fixed quantities.

External information can be incorporated into the regression on principal components through a Bayesian analysis, in which all parameters are considered as random effects with a probability density function that describes their contributions. Bayesian methods are common in genomic prediction with markers; however, genomic prediction models with PCs using realistic prior specification for PC scores have not been investigated yet. So, the aim of this study was to investigate the performance of a new Bayesian technique for genomic prediction with principal components to improve the accuracy of predictions by incorporating prior knowledge to PC effects and their variances.

# MATERIALS AND METHODS

# Simulation Genome and Population

Data were simulated using the QMSim software package (Sargolzaei and Schenkel, 2009) in 10 replicates for each scenario as follows. A single trait with phenotypic variance of one and heritability of 0.25 were produced. The genome consisted of 3 chromosomes, each one Morgan long. In total, 3,000 bi-allelic marker loci (single nucleotide polymorphism; SNP) and 105, 60, and 15 multi-allelic QTL were simulated on the genome. Markers and QTL positions were randomly selected across the genome. Mutation rate was set to 1 × 10−<sup>3</sup> for markers and 1 × 10−<sup>5</sup> for QTL, respectively. All genetic variance was due to additive QTL effects, which were randomly sampled from a gamma distribution with shape parameter 0.4. Phenotypes were generated for both sexes by adding random residuals from independent distributions∼ N(0, **I**σ 2 e ) to the sum of QTL effects, therefore, no sex difference was simulated.

In order to achieve mutation-drift balance, historical generation was started with 400 females and 20 males and continued as follows: During 100 generation of random mating, the size of population increased to 1,000 animals. The population with the same size randomly mated for 400 more generations. The number of male animals in the last generation increased to 70. From generation 500, 35 males and 455 females were randomly selected as the generation zero and were mated for 15 generations. The mating design in the last 15 generations was also random, but to mimic a situation with selection, male and females were selected from the best animals with high breeding values of previous generation. Generations 8 and 9 were selected as training animals and generations 10 to 15 as selection candidates.

# Statistical Computation

In this research two different groups of models were studied, SNP and PC based models, that used SNPs and PC scores as independent variables, respectively.

The general model for the record of individual i, y<sup>i</sup> , with observed marker genotype j labeled **Z**ijin the first group of models was:

$$y\_i = \mu + \text{sex}\_k + \sum\_{j=1}^m \mathbf{Z}\_{ij} b\_j + \mathbf{e}\_i,\tag{1}$$

Where µ is the overall mean, sex<sup>k</sup> is the effect of kth sex, b<sup>j</sup> is the effect of marker genotype j, and there are m markers, and e<sup>i</sup> is residual. In matrix notation the model is written as:

$$\mathbf{y} = \mathbf{X}\mathbf{s} + \mathbf{Z}\mathbf{b} + \mathbf{e},\tag{2}$$

Where **y** a column vector of records of length n, **s** is a vector of fixed effects, **X** is incidence matrix that relates observations to fixed effects and **Z** is an n × m matrix with elements **Z**ij represented the marker genotype coded as −1, 0, and 1. A SNP genotype was removed if the SNP minor allele frequency (MAF) was less than 0.01 and if it deviated greatly from Hardy–Weinberg equilibrium (P < 1 × 10−<sup>5</sup> ).

The alternative methods in the first group includes Bayesian Ridge regression (Bayes-Ridge), BayesA, BayesB, which differed in the prior used for **b** that are well known and most commonly used in genomic selection (Meuwissen et al., 2001; Habier et al., 2007, 2011). In Bayes-Ridge, the column vector of SNP effects is assumed to have the normal distribution b ∼ N(0, **I**σ 2 b ), where σ 2 b is the prior variance of the SNP effect sampling from scaled inverse chi-square prior with scale parameter S 2 b and ν<sup>b</sup> degrees of freedom as hyper-parameters. In BayesA, the marginal distribution of marker effects is a scaled-t density. But, it was shown that this is equivalent to assuming that the marker effect at locus j has a univariate normal with a null mean and unknown locus-specific variance σ 2 bj (Gianola et al., 2009). In BayesB marker effects are assigned IID priors that are mixtures of a point of mass at zero and a slab that is a scaled-t density. The slab is structured as BayesA by introducing an additional parameter π represents the prior proportion of zero effects that is treated as unknown as previously emphasized that shrinkage of SNP effects is affected by π, and thus should be treated as an unknown being inferred from the data (Habier et al., 2011), therefore, it is assigned a Beta prior with the default hyperparameters set by BGLR (Pérez and de Los Campos, 2014). In all the Bayesian models a flat prior (Sorensen and Gianola, 2002) is used for fixed effects and conditional on the residual variance, σ 2 e , a normal distribution with null mean and co-variance matrix **I**σ 2 e is used for the vector of residuals. Further, σ 2 e is treated as an unknown with a scaled inverse chi-square prior. Variance hyper-parameters, i.e., scale and degrees of freedom, were set as BGLR defaults such that a proper but weakly informative prior distribution is postulated (Pérez and de Los Campos, 2014). Variance components with weakly informative priors will be less dependent on the prior setting and their posterior distribution will be dominated by the data (Sorensen and Gianola, 2002). The fourth model in the first group of models was Ridge-regression BLUP (Ridge-R) which used <sup>σ</sup> 2 a m as a variance of SNP effects. The mixed model equations of Ridge-R were simply solved in a non-Bayesian manner by Cholesky decomposition in R.

The second group of models using PC scores as the predictor variable were performed as follows. PCA was implemented on the correlation matrix of marker genotype (**Wm**×**m**) as below (Janss et al., 2012):

$$\mathbf{W} = \mathbf{U}\mathbf{D}\mathbf{U}^{\mathrm{T}} = \sum\_{j=1}^{m} \lambda\_{j}\mathbf{U}\_{j}\mathbf{U}\_{j}^{\mathrm{T}},\tag{3}$$

Where **U** = [**U1**,**U2**, . . . , **Um**] of order m × m is the matrix of eigenvectors of **W** with the **U<sup>j</sup>** represent the jth column, and **D** is a diagonal matrix with elements equal to the eigenvalues λ1, λ2, . . . , λ<sup>m</sup> associated with the m eigenvectors. Properties of the eigenvalues and eigenvectors are λ<sup>1</sup> > λ<sup>2</sup> > . . . > λ<sup>m</sup> and **UjU T <sup>j</sup>** =**U T <sup>j</sup> Uj**= **I**, repectively. The choice of the number of PCs to be retained is arbitrary and several methods have been proposed (Jolliffe, 2002). In this study, we retain a k number of components until the cumulative variance reaching to 0.999 and then PC score were calculated for animals as:

$$\mathbf{Z\_{pc}} = \mathbf{Z\_{x \times m}} \mathbf{\%} \ast \mathbf{\%} \mathbf{U\_{m \times k}},\tag{4}$$

Where x denotes the number of individuals of training population or each of selection candidate sets. This **Zpc** matrix was replacement as the incidence matrix for different PC based models as follows:

$$\mathbf{y} = \mathbf{X}\mathbf{s} + \mathbf{Z}\_{\mathrm{pc}}\mathbf{b}\_{\mathrm{pc}} + \mathbf{e},\tag{5}$$

The alternative PC based methods hereinafter differ only in the prior used for the vector of predictor variables,**bpc**, and their variance. Principal component regression with eigenvalue as prior variance of predictor variable (PCR-Eigen) assumes that contribution of each PC score is proportional to their eigenvalues and therefore variances of each PC score was ccalculated as σ **2 pcj** = σ **2 a**λj , where σ **2 a** is the additive genetic variance (Macciotta et al., 2010). It's BLUP mixed model equations were constructed and solved in R using Cholesky decomposition. In Bayesian principal component regression with normal distribution (PCR-Normal), regression coefficients are assigned to IID normal distributions, with mean zero and variance σ **2 pc** that the variance parameter is assigned a scaled-inverse Chi-squared density, with parameters dfpc and Spc. Bayesian principal component regression with t-density (PCR-t) was performed with assuming a scaled-t density as marginal distribution of predictor effects with parameters dfpc and Spc. However, as discussed in Gianola et al. (2009), this density is implemented as a univariate normal with null mean and unknown locus-specific variance σ 2 pcj and the variance parameter is assigned an IID scaled-inverse

TABLE 1 | Average number of SNPs and PCs after quality control, over 10 replicates.


Chi-squared density, with parameters dfpc and Spc . A double exponential distribution was assumed as marginal distribution of PC score in Bayesian principal component regression with a LASSO density (PCR-Lasso). The prior of double exponential distribution can be represented as an infinite mixture of scaled normal distributions (Park and Casella, 2008). Predictor effects are assigned independent normal densities with null mean and maker-specific variance parameter τ 2 pcj ×σ 2 ε , in the first. Second, τ 2 pcj are assigned IID exponential densities with rate parameter γ 2 /2. Finally γ 2 assigned to a Gamma prior. A Gibbs-Sampling algorithm was used to estimate PC effects and their variance simultaneously.

# Predictive Ability

Different models were compared on how accurately they predict the true breeding values of animals. The correlation between genomic estimated breeding values and true breeding values was used as the accuracy of a model. The accuracies of genomic estimated breeding values were calculated in two approaches. In the first approach, training animals were first divided into five groups from which in turn, four groups were used to estimate marker effects and the left out group used to calculate accuracies. In the second approach, in order to investigate the persistency of accuracy over generations, estimated marker effects based on animals in reference population, were used repeatedly for measuring the accuracies in the candidate animals (candidate populations) from generation 10 to 15. Unbiasedness of genomic predictions was measured by the regression of true breeding values on estimated genomic breeding values. This regression does not deviate largely from one if the prediction is unbiased.

# RESULTS

**Table 1** illustrates the average number of SNP markers and retained PCs which explain 0.999 of the original variance.

with 105 QTL.

TABLE 2 | Pearson correlations between predicted genomic breeding values and true breeding values for different methods with five-fold cross validation in training populations.


Ridge-R, Ridge regression-BLUP; Bayes-Ridge, Bayesian Ridge regression; PCR-Normal, Bayesian principal component regression with normal distribution of effects; PCR-t, Bayesian principal component regression with scaled t distribution of effects; PCR-Lasso, Bayesian principal component regression with double exponential distribution of effects; PCR-Eigen, Principal component regression-BLUP with eigenvalues as prior variance of effects.

TABLE 3 | Intercept and regression coefficient of true breeding value on predicted genomic breeding value and coefficient of determination for different estimation methods for 5-fold cross validation in training population.


b0, Intercept; b1, regression coefficient; R<sup>2</sup> , determination coefficient.

Although we considered a non-strict criterion for retaining PCs, the number of PCs is nearly half of the number of SNPs. This is the ability of PCA in reducing the variables without considerable loss of variance. Dimauro et al. (2011), selected a strict criteria for retaining PCs and reported that 300 and 700 PCs explain 85 and 95% of the original variance, respectively.

The percentage of explained variance by each PC, and also the cumulative variance of PCs for replicate 1 in scenario with 105 QTL is shown in **Figure 1**, as an example. The first five and 100 PCs are adequate for explaining 60% and 90% of the original variance, respectively. The curve of cumulative variance reached a plateau around 200th PC. In agreement with previous findings on simulated data, PCA has been able to efficiently reduce the size of predictors. Since, a small amount of variance will be explained by each PC after plateau, a large number of PCs must be included in the model to capture a relatively small variance; in this study, about 1,400 PCs after plateau explain less than 1% of the original variance. These results highlight that PC analysis can compress the total variation in a smaller set of variables.

Cross validation accuracies of genomic predictions obtained using SNP/PC based models are shown in **Table 2**. On average, the accuracies were highest in 105 QTL senario. Accuracy of genomic predictions clearly declined with decreasing QTL from 105 to 15 in all eight methods. As expected, the accuracy of BayesA and BayesB increased with decreasing number of QTLs, and at 15 QTL outperformed Bayes-Ridge model. Previous studies have reported that a BLUP mixed model, assuming equal variance for all SNP, perform as well as variable selection models for most traits in dairy cattle (Hayes et al., 2009; VanRaden et al., 2009), but in traits controled with major genes such as fat percentage, variable selection models are superior over BLUP models (Cole et al., 2009; Legarra et al., 2011). Across all senarios, Ridge-R in SNP based models, and PCR-Eigen in PC based models had lowest accuracies.

In all senarios, the performance of PCR-Normal was better than the other three PC based models but the diffrences of PCR-Normal and PCR-t were negligible. Macciotta et al. (2010), investigated the accuracy of PC based estimated breeding values differently. They sequentially added PCs to a PC-BLUP model to reach the highest accuracy and found that the accuracy increased up to a plateau at PC 250 to 300. Retaining more PCs, in their study resulted in no increased accuracy.

In scenario with 105 QTL, the accuracy of Bayes-Ridge, 0.671, was similar to the accuracy of PCR-Normal, that was 0.664, while, in the latter, the size of predictors was nearly half. That is a huge reduction in pridictor variables without any loss of prediction accuracy. In this senario, accuracy of PCR-Normal is exactly similar to the accuracy of BayesA. In 60 QTL senario this two models yielded similar accuracies (0.654 vs. 0.651). This is also true in the case of BayesA and PCR-t, both using the same prior for unknown parameters but the former for SNPs and the later for PC scores. BayesB had the highest accuracy in 15 QTL senario, 0.556, which is only 0.014 higher than the accuracy obtained with PCR-Normal, but 0.052 higher than PCR-Eigen which assumes predictor variances are fixed quantities scaled proportional to their eigenvalues.

A necessary condition for unbiased genomic prediction is that the regression coefficient of true breeding values on genomic prediction is close to 1. Compared with the BLUP models (Ridge-R and PCR-Eigen), the bias in Bayesian models was reduced (**Table 3**). PCR-Eigen overestimated the genomic breeding values with a regression coefficient of less than 1. In a simulation study by Macciotta et al. (2010) with eigenvalues as prior variance the regression slope was 0.76, and with a single prior variance for PCs it was 0.69. In a simulation study with PCs extracted from different marker densities assuming a single PC variance, regression slopes varied from 0.65 to 0.695 (Solberg et al., 2009a). The data simulated in these studies were different but the methods were comparable to our PCR-Eigen. In contrast to the models with a fixed variance for predictors, Bayesian PC models produced unbiased predictions (**Table 3**). The unbiased models in 105 QTL scenario were Bayes-Ridge and PCR-Normal and in 60 QTL scenario were BayesA, followed by Bayes-Ridge and PCR-Normal. PCR-t led

(Middle): 60 QTL; (Bottom): 15 QTL.

to unbiased estimated genomic breeding values in 15 QTL scenario.

**Figure 2** depicts the persistency of selection accuracy over six generations of selection candidates using SNP/PC based models. Accuracies decreased as the number of QTL decreased and as generation increased. This figure shows the marginal differences between SNP based and PC based models for different number of QTL, such that it is difficult to determine which model outperforms the others over the generations. The superiority of Bayesian PCR models over PCR-Eigen is more evident in scenario with 60 QTL followed by 15 QTL.

**Figure 3** shows the regression coefficients of true breeding values on estimated breeding values over six generations. Across all models, absolute values of regression coefficients decreased as generation increased. PCR-Lasso had an inflated regression slope in the training populations of 105 QTL (b1 = 1.2) and 60 QTL (b1 = 1.09) scenarios, but in generations 10, 11 and even 12 the slope was around 1. PCR-Eigen, consistently overpredicted breeding values such that the regression slope at generation 15 in 15 QTL scenario fell down to 0.19.

# DISCUSSIONS

Genomic prediction faces a statistical challenge of smaller observations than marker data. Some research in this decade has focused on this challenge and several solutions have been proposed. VanRaden et al. (2009) compared a 40K SNP set with two 20K and 10K subsets that were obtained by keeping every other or every fourth SNP sequentially across genome, respectively, and reported more accurate predictions using 40K SNP panels. The reduction of predictor variables by selecting subsets of SNPs that were evenly spaced or based on their relevance to the trait was investigated by Vazquez et al. (2010). They reported that the accuracy of genomic prediction substantially decreased with subsetting SNPs. Moser et al. (2009) compared several methods to predict genomic breeding values and showed that least squares regression which exploits a reduced subset of selected SNP consistently had lower accuracy and a larger bias of prediction than the other methods using all SNP. Weigel et al. (2009) sorted markers based on magnitude of the estimated marker effects and included only those with the largest effects in the model, but accuracies always declined with subsetting SNPs. In all methods mentioned, eliminating some SNPs produced lower accuracies, while in the genomic prediction reducing dimension of model is advantageous provided that accuracy does not drop considerably. Compared to other subset selection of variables, the multivariate reduction via PCA has the advantage that no marker is discarded, while a smaller set of uncorrelated predictors preserve as much of the variation present in the original markers as possible.

With huge numbers of dense SNPs, the multicollinearity problem due to linkage disequilibrium is unavoidable (Long et al., 2011). Solberg et al. (2009a) employed partial least squares regression (PLSR) and PCA to reduce the dimensionality and showed that when marker density is low, the accuracy of both methods is comparable with BayesB, but with denser markers, BayesB outperforms PLSR and PCA. They concluded that reduction in computational complexity via multivariate methods did not counterbalance their lower accuracy compared with BayesB. Accuracies of genomic predictions obtained using PCR and G-BLUP models was also investigated by Dadousis et al. (2014), who reported across test datasets and traits, G-BLUP outperformed the PCR model. However, in the present study Bayesian estimation of effects and variances of PC scores led to accuracies similar to BayesB and better accuracies than PCR-Eigen where PC variances were proportional to the eigenvalues. Three Bayesian PCR methods performed the same but considering parsimony PCR-Normal with a single variance parameter for PCs is preferred in practice. The performance of models characterized by different prior specifications showed negligible differences in this study. However, it can be the case that the differences in performance of these PCR methods become more visible under broader differences in genetic architectures of the traits.

The persistence of the accuracy of genomic prediction over generations depends largely on the extent of LD and the ability of statistical methods to exploit LD information. BayesB exploits LD information considerably better than Bayesian ridge regression and thus is expected to produce stable accuracy (Habier et al., 2007). Recombination between markers and QTL over generations breaks down linkage disequilibrium and reduces the accuracy of selection. Depending on the cost of genotyping and the number of markers genomic selection programs will be more cost effective if the estimated marker effects could be used over multiple generations (Solberg et al., 2009b). In this study, there were little differences between Bayesian SNP based methods and Bayesian PC based methods in persistency of accuracies across scenarios where BayesB was slightly better than others. Habier et al. (2010) reported that the accuracy of GEBVs decayed over generations but this decay in the accuracy was less in BayesB compared to G-BLUP.

In all scenarios, accuracy of GEBV increased with assuming a prior density for effects and variances of PC scores instead of specifying predefined weights for the PCs; i.e., PCR-Eigen. Although, we can consider the heterogeneous structure of variance by specifying eigenvalues as prior variance for PC scores, but assumption of fixed quantity limits the ability of this proposal. In Bayesian setting, assigning an informative prior density for PC variance(s) combined with information brought by the data leads to more robust estimation of PC effects that in turn leads to greater accuracy. The decay of accuracy in selection candidates over generations tended to be smaller for developed Bayesian PCR; it is even evident when QTL number was smaller.

# CONCLUSION

The present study assessed the performance of PC based models as a dimensionality reduction method, in comparison to commonly used SNP based models. Accuracies of genomic predictions using prior knowledge of PC effects and variances in a Bayesian hierarchical framework were considerably higher compared to specifying fixed PC variances proportional to eigenvalues. Bayesian PC based models and SNP based models performed similarly at different QTL densities, while the number of predictors in PC models was nearly half of the number of SNPs. Reducing dependency among predictors due to LD as well as dimension reduction via conforming PCs, and then Bayesian updating of PC variance(s) can potentially improve prediction accuracies. Finally, developed methods in this study are recommended according to the ease of implementation and good statistical properties for analysis of correlated high dimensional datasets that are becoming available. These results when confirmed on real data sets, will support the use of Bayesian PCR in genomic predictions.

# AUTHOR CONTRIBUTIONS

SH-V designed and ran the analyses, interpreted the results, and wrote the manuscript. MS assisted with the study design, interpretation of results, and critically contributed to the manuscript. HM and MT helped in the interpretation of results and edited the drafted manuscript. All authors read and approved the final manuscript.

# REFERENCES


# ACKNOWLEDGMENTS

The authors are extremely grateful to two reviewers for their comments and suggestions which improved the present study.

of dairy bulls from genome-wide SNP markers. Genet. Sel. Evol. 41:56. doi: 10.1186/1297-9686-41-56


Sorensen, D., and Gianola, D. (2002). Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. New York, NY: Springer-Verlag Inc.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Hosseini-Vardanjani, Shariati, Moradi Shahrebabak and Tahmoorespur. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genome-Wide Characterization of Selection Signatures and Runs of Homozygosity in Ugandan Goat Breeds

Robert B. Onzima1,2 \*, Maulik R. Upadhyay1,3, Harmen P. Doekes<sup>1</sup> , Luiz. F. Brito<sup>4</sup> , Mirte Bosse<sup>1</sup> , Egbert Kanis<sup>1</sup> , Martien A. M. Groenen<sup>1</sup> and Richard P. M. A. Crooijmans<sup>1</sup>

<sup>1</sup> Animal Breeding and Genomics, Wageningen University and Research, Wageningen, Netherlands, <sup>2</sup> National Agricultural Research Organization (NARO), Entebbe, Uganda, <sup>3</sup> Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden, <sup>4</sup> Department of Animal Biosciences, Centre for Genetic Improvement of Livestock (CGIL), University of Guelph, Guelph, ON, Canada

### Edited by:

Max F. Rothschild, Iowa State University, United States

### Reviewed by:

Heather Jay Huson, Cornell University, United States Gábor Mészáros, Universität für Bodenkultur Wien, Austria

### \*Correspondence:

Robert B. Onzima robert.onzima@wur.nl; robertonzima@gmail.com

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 27 February 2018 Accepted: 25 July 2018 Published: 14 August 2018

### Citation:

Onzima RB, Upadhyay MR, Doekes HP, Brito LF, Bosse M, Kanis E, Groenen MAM and Crooijmans RPMA (2018) Genome-Wide Characterization of Selection Signatures and Runs of Homozygosity in Ugandan Goat Breeds. Front. Genet. 9:318. doi: 10.3389/fgene.2018.00318 Both natural and artificial selection are among the main driving forces shaping genetic variation across the genome of livestock species. Selection typically leaves signatures in the genome, which are often characterized by high genetic differentiation across breeds and/or a strong reduction in genetic diversity in regions associated with traits under intense selection pressure. In this study, we evaluated selection signatures and genomic inbreeding coefficients, FROH, based on runs of homozygosity (ROH), in six Ugandan goat breeds: Boer (n = 13), and the indigenous breeds Karamojong (n = 15), Kigezi (n = 29), Mubende (n = 29), Small East African (n = 29), and Sebei (n = 29). After genotyping quality control, 45,294 autosomal single nucleotide polymorphisms (SNPs) remained for further analyses. A total of 394 and 6 breed-specific putative selection signatures were identified across all breeds, based on marker-specific fixation index (FST-values) and haplotype differentiation (hapFLK), respectively. These regions were enriched with genes involved in signaling pathways associated directly or indirectly with environmental adaptation, such as immune response (e.g., IL10RB and IL23A), growth and fatty acid composition (e.g., FGF9 and IGF1), and thermo-tolerance (e.g., MTOR and MAPK3). The study revealed little overlap between breeds in genomic regions under selection and generally did not display the typical classic selection signatures as expected due to the complex nature of the traits. In the Boer breed, candidate genes associated with production traits, such as body size and growth (e.g., GJB2 and GJA3) were also identified. Furthermore, analysis of ROH in indigenous goat breeds showed very low levels of genomic inbreeding (with the mean FROH per breed ranging from 0.8% to 2.4%), as compared to higher inbreeding in Boer (mean FROH = 13.8%). Short ROH were more frequent than long ROH, except in Karamojong, providing insight in the developmental history of these goat breeds. This study provides insights into the effects of long-term selection in Boer and indigenous Ugandan goat breeds, which are relevant for implementation of breeding programs and conservation of genetic resources, as well as their sustainable use and management.

Keywords: Capra hircus, homozygosity, adaptation, genomic inbreeding, genetic diversity, selective sweeps, candidate genes

# INTRODUCTION

fgene-09-00318 August 13, 2018 Time: 8:29 # 2

Goats are among the most important livestock species in developing countries, such as Uganda, playing a significant socio-economic, nutritional and cultural role in smallholder production systems (MAAIF, 2011). The total goat population in Uganda is estimated to consist of over14 million animals, predominantly from indigenous breeds (98%) and a small proportion (2%) from exotic breeds (MAAIF and UBOS, 2009; UBOS, 2015). Exotic breeds have been artificially selected for production traits over several generations, whereas indigenous breeds have undergone no or less intense artificial selection. While exotic breeds are subjected to more intense artificial selection, it is expected that the effect of natural selection (i.e., adaptation to the specific environment) is more apparent in the indigenous breeds and has played an important role in their development. Based on this hypothesis, it is expected that indigenous breeds will tend to exhibit resistance to gastro-intestinal parasites and local diseases, tolerance to heat, water scarcity and ability to use low quality fodder. Often high order traits like adaptation to environmental stress are influenced by several traits acting in combination. Adaptation is a complex trait that involves many biological processes and quantitative trait loci with each having a small but cumulative effect on the overall expression of the phenotype (Kim et al., 2016; Yang et al., 2016; Mwacharo et al., 2017).

Selection (both natural and artificial) is one of the main driving forces shaping genetic variation across genomes of livestock species. Under strong positive selection pressure, the frequency of favorable alleles will increase over time (Maynard and Haigh, 2007). This may result in genomic regions with high genetic differentiation across breeds and/or specific haplotypes rising to high frequencies. Such regions can thus be selection signatures. Analysis of selection signatures has the goal of identifying genomic regions or loci showing deviations from neutrality. Other forces like migration, admixture events, and population bottlenecks may have a profound effect on genomic variability, locally increasing or reducing the genetic variation.

Two well-established methods to detect selection signatures include the fixation index (FST) (Wright, 1949; Weir and Cockerham, 1984; Porto-Neto et al., 2013) and haplotype differentiation statistic – hapFLK (Fariello et al., 2013). FST is one of the most popular methods to detect selection signatures if data is available for multiple populations. The FST-approach measures population differentiation due to locus-specific allele frequencies between populations and can detect highly differentiated alleles undergoing divergent selection among populations (McRae et al., 2014; Zhao et al., 2015). A drawback of the approach is that the FST-statistic assumes that all populations are of similar effective population size and are derived independently from the same ancestral population. The hapFLK - statistic measures differences of haplotype frequencies between populations and accounts for the hierarchical structure of the populations (Fariello et al., 2013). The use of a combination of haplotype information and of the hierarchical structure of populations results in greater power for the detection of selection signatures.

Selection signature analyses using genome-wide SNPs have been widely applied in exploring the genomes of livestock species such as sheep (Kijas et al., 2012; Purfield et al., 2017; Rochus et al., 2018), cattle (Porto-Neto et al., 2014; Zhao et al., 2015; Taye et al., 2017), and goats (Burren et al., 2016; Kim et al., 2016; Brito et al., 2017). These studies have identified genes associated with a variety of traits including thermo-tolerance, immune response, reproduction functions, skin and hair structure, feed intake, and metabolism.

Ugandan indigenous goat breeds can be phenotypically categorized within three main breeds: Kigezi, Mubende, and Small East African (Mason and Maule, 1960). Other indistinct ecotypes of indigenous goat breeds also exist including Karamojong and Sebei (Nsubuga, 1996). These breeds show high genetic diversity, but weak population sub-structuring (Onzima et al., 2018). The result of the weak population structure is low levels of inbreeding and some of the breeds having similar selection signatures (Msalya et al., 2017). The indigenous breeds present a high degree of adaptation to parasites and heat tolerance, and survive on poor quality fodder, while also maintaining good reproductive rates (Mwacharo et al., 2017). However, production levels are much lower compared to specialized breeds. Therefore, Boer goats were introduced in Uganda in the early 1990s to genetically improve the growth rate and body size of the indigenous breeds (Nsubuga, 1996). Because of community-based small ruminant breeding programs and the use of limited Boer breeding males for cross breeding, the increase in inbreeding levels is a major concern to the industry.

The increase in inbreeding in livestock at a genomic level over generations leads to a reduction in genetic diversity. When an offspring is inbred, it may inherit autozygous chromosomal segments from both parents that are identical by descent (IBD), i.e., segments that are derived from a common ancestor (Broman and Weber, 1999). The result is continuous homozygous segments in the genome, also known as runs of homozygosity (ROH). The extent of ROH can be used to estimate the inbreeding coefficient (Bosse et al., 2012; Marras et al., 2015; Peripolli et al., 2018). ROH can be used to disclose the genetic relationships among individuals, usually estimating with high accuracy the autozygosity at an individual and/or population levels (Ferencakovi ˇ c et al., 2011, 2013a ´ ). It can also be used to establish the level of selection pressure on the populations (Zhang et al., 2015). Length and frequency of ROHs may also be used to distinguish distant from more recent inbreeding, since the length of IBD segments follows an inverse exponential distribution with a mean of 1/2 g Morgans, where g is the number of generations from a common ancestor (Howrigan et al., 2011).

The objectives of this study were to: (1) identify unique selection signatures in the genome and the genes under selection in Ugandan goat breeds, and (2) assess the occurrence and distribution of ROH and ROH-based genomic inbreeding in Ugandan goat breeds.

# MATERIALS AND METHODS

fgene-09-00318 August 13, 2018 Time: 8:29 # 3

# Animals and Genotype Quality Control

The data used in this study were derived from 144 animals from 6 goat populations and has been described in detail previously (Onzima et al., 2018). The animals were from the five indigenous breeds, Mubende (n = 29), Kigezi (n = 29), Small East African (n = 29), Karamojong (n = 15) and Sebei (n = 29), and from the exotic Boer breed (n = 13). All animals were genotyped with the Illumina GoatSNP50 BeadChip (Tosser-Klopp et al., 2014), which features 53,347 single nucleotide polymorphisms (SNPs). Genotype quality control (QC) procedures were performed using PLINK v1.90 (Chang et al., 2015). All samples passed the quality criteria (missing genotype call rate ≥ 0.1) and were used in the analysis. The SNPs with a call rate below 0.95, a minor allele frequency (MAF) lower than 0.05, located on non-autosomal chromosomes, or not in Hardy Weinberg Equilibrium (at p < 0.001) were discarded. After QC procedures, 46,105 autosomal SNPs remained. For these SNPs, the position on the genome was obtained from the goat reference genome assembly ARS1 release 102 (Bickhart et al., 2017). After removing SNPs with unknown position on the ARS1 genome assembly, 45,294 autosomal SNPs from 144 goats remained in the final dataset.

# Relatedness Within and Between Breeds

The level of relatedness between individuals (both within and between breeds) was determined using genomic similarities. For each pair of individuals, the genomic similarity (SIMSNPij) was determined according to Malécot (1948):

$$\text{SIM}\_{\text{SNP}\_{\vec{\eta}}} = \frac{\Sigma\_{k=1}^{n\_{\text{SNP}}} (I\_{11,k} + I\_{12,k} + I\_{21,k} + I\_{22,k})}{4n\_{\text{SNP}}}$$

where nSNP is the total number of markers and Ixy,<sup>k</sup> is an indicator variable that was set to 1 when allele x of individual i and allele y of individual j at marker k were identical by state (IBS), and to 0 otherwise. Note that, as self-similarities were included, the average similarity in a breed was equivalent to the expected homozygosity in that breed.

# Identifying Selection Signatures

To increase the likelihood to detect true selection signatures (i.e., no false positive results), multiple approaches can be used (Simianer, 2014). The methods adapted for analysis of selection signatures need to be robust enough to disentangle selective pressures from other effects on the population such as migration, admixture and population bottlenecks. In this study, we used allele specific population differentiation defined as FST (Wright, 1949) and a haplotype-based differentiation approach, hapFLK (Fariello et al., 2013), which accounts for haplotype structure of populations and for variable effective population sizes.

# Fixation Index (FST)

Selection signatures for each breed were identified using an FST-statistic per SNP that compares the allele frequency in the breed to the allele frequency in a combined population of the remaining breeds, following the unbiased estimator proposed by Weir and Cockerham (1984) and implemented in PLINK (Chang et al., 2015). For example, differences between the exotic Boer breed on the one hand, and the indigenous goat breeds on the other, were investigated by calculating the FST-values between Boer and a combined population of all the indigenous breeds. We also computed FST-values for the indigenous breeds while excluding Boer from the analysis. However, as the exclusion of Boer did not influence the results for the indigenous breeds, only results with Boer included are reported in the subsequent sections.

In general, genomic regions showing high FST-values with moving averages (mas) and single SNPs indicate strong breed differentiation or selection, while low FST-values suggest no or a limited amount of population differentiation. Negative FST-values were set to zero, as they imply no genetic differentiation between the two groups.

To visualize and infer region-specific differences over the erratic pattern of individual SNPs, we computed a ma of FST (maFST) values for 5 adjacent SNPs. The maFST was computed for 5 adjacent SNPs and plotted against the chromosomal position for all goat autosomes (CHI coordinates). The SNPs with a maFST above the 95% quantile of the empirical distribution of raw FST-values were considered as putative selection signatures. The ma is a simple approach for identifying regions of interest in the genome from the erratic pattern of SNPs. This approach has been implemented successfully in analyzing systematic differences in response to genetic variation to pedigree and genome-based selection methods in chicken (Heidaritabar et al., 2014) and genome-wide genetic diversity in Dutch dairy cattle (Doekes et al., 2018). By using maFST, rather than FST for single SNPs, we aimed to reduce the influence of the small sample sizes on the results.

## Haplotype Differentiation (hapFLK)

To account for haplotype structure and varying effective population sizes, we used hapFLK (Fariello et al., 2013) to detect potential selection signatures in the six goat populations. The used procedure has been described in detail by Brito et al., (2017). Briefly, hapFLK was applied to the unphased genotype data to identify putative regions under selection, by estimating the neighbor joining tree and a kinship relationship matrix based on Reynolds' genetic distances between the breeds. The pairwise Reynolds' distances (Reynolds et al., 1983) between populations (including an outgroup) are computed for each SNP and averaged over the genome. Using the genotype data and kinship matrix, and assuming 6 clusters in the fastPHASE model (−k, 6), the program was run and the hapFLK statistic was computed as an average of 20 expectation maximization iterations to fit the Linkage Disequilibrium (LD) model. With hapFLK values generated for each SNP, p-values were computed based on a chi-square distribution of the numerical values. The mean and variance of hapFLK distribution were estimated and used to standardize each SNP specific value. This was subsequently followed by computation of p-values from a standard normal distribution, and the (−log10) of p-values was plotted against the genomic positions. To minimize the number of false positives,

a q-value threshold of 0.01 was set to control the false discovery rate (FDR) at the 1% level. Putative selective signatures were defined by the regions with a threshold of p < 0.005.

# Identification of Candidate Genes Associated With Selection Signatures

Genes within putative selection signature regions were retrieved from NCBI<sup>1</sup> , using the goat reference assembly ARS1. The genes overlapping either partially or fully within the 95% threshold of the empirical distribution of the raw FST -values and within the regions with p < 0.005 for hapFLK, were putative selection candidate genes.

For each of the breeds, gene enrichment analyses were performed based on the FST, and hapFLK results with the web-based tool, Database for Annotation, Visualization, and Integrated Discovery (DAVID) v6.8 (Huang et al., 2009; Jiao et al., 2012), which allows for the investigation of the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (Kanehisa et al., 2012) and Gene Ontology (GO) for biological processes (Ashburner et al., 2000). Fisher's exact test (p-value = 0.05), was applied to identify significantly enriched GO biological and functional processes. More stringent settings, such as Bonferroni correction, FDR, and Fold enrichment test were not considered in the detection given the limited scope of the study. Human gene ontologies were used since the goat genome has not been properly annotated. Moreover, the human genome is highly annotated than closely related species like bovine; thereby increasing the probability of retrieving GO terms in the goat genome. Phenotypes known to be affected by the identified candidate genes were compared from literature and using the AnimalQTLdb at: https://www.animalgenome.org/cgibin/QTLdb/index.

# Runs of Homozygosity (ROH) and Genomic Inbreeding (FROH)

For each individual, ROHs were identified using an in-house script which incorporates a set of criteria for defining regions of homozygosity.

An ROH was called if the following criteria were fulfilled: (1) 20 or more consecutive SNPs were homozygous, (2) a minimum physical length of 2 Mb, (3) a maximum gap between two consecutive SNPs of 500 Kb, and (4) maximum of 2 missing genotypes and no heterozygous calls within ROH. The rather stringent criteria were used to minimize incorrect discovery of ROH (false positives) within regions of low marker density. The minimum expected length of homozygous DNA segments was based on the time frame of approximately 25 generations, over which goats are believed to have been characterized in separate breeds in Uganda (Mason and Maule, 1960). The length of ROH derived from a common ancestor g generations ago follows an inverse exponential distribution with the mean equal to 100/2 g cM (Fisher, 1954; Thompson, 2013). A genetic distance of approximately 1 cM per Mb is often assumed in cattle (Arias et al., 2009) and assuming a similar relationship for goats, the mean length of ROH derived from common ancestor from 25 generations ago would be 2 Mb.

The proportion of ROH per animal in comparison to the whole genome SNP coverage provides a useful indication of the level of inbreeding. Genomic inbreeding coefficient based on ROHs were computed as the length of the autosome covered by ROHs divided by the overall length of the autosome covered by the SNPs (McQuillan et al., 2008):

$$F\_{\rm ROH,i} = \frac{L\_{\rm ROH}}{L\_{\rm AUTO}}$$

where LROH is the sum of the total length of ROH in individual i and LAUTO is the total length of the autosomes covered by the SNPs (2.463 Gb). The number of ROHs and FROH were also evaluated for different ROH length categories. We focused on length classes from 2 to 16 Mb to investigate more ancient inbreeding (2 and 16 Mb are the expected lengths of ROH derived from common ancestors 25 and 3 generations ago, respectively) and >16 Mb to assess more recent inbreeding (expected length of ROH derived from ancestors ≤ 3 generations ago).

The ROH were estimated in each individual separately and then classified into four length categories: 2–4 Mb, 4–8 Mb, 8–16 Mb, and >16 Mb, following classification used in similar studies (Kirin et al., 2010; Ferencakovi ˇ c et al., 2013a ´ ; Marras et al., 2015), specified from now on as ROH2−<sup>4</sup> Mb, ROH4−<sup>8</sup> Mb, ROH8−<sup>16</sup> Mb, and ROH<sup>&</sup>gt; <sup>16</sup> Mb, respectively. For each length category in each of the individuals of each breed, we computed the total number of ROH identified – nROH, mean sum of ROH coverage – SROH in Mb (defined by sum of all ROH per individual divided by the number of animals per breed) and average length of ROH (LROH, Mb).

# RESULTS

# Relatedness Among Ugandan Goats

Mean genomic similarities within and between breeds are shown in **Table 1**. As expected, within breed similarities (diagonal) were higher than between breed similarities (off-diagonal). Within breeds, the highest mean similarity was found in Kigezi (0.643) and the lowest in Sebei (0.623). Between breeds, the indigenous



BOE = Boer (n = 13), KIG = Kigezi (n = 29), MUB = Mubende (n = 29), SEA = Small East African (n = 29), KAR = Karamojong (n = 15), and SEB = Sebei (n = 29).

<sup>1</sup>www.ncbi.nlm.nih.gov

Ugandan goat breeds showed higher genomic similarity with each other than the Boer breed.

# Selection Signatures

fgene-09-00318 August 13, 2018 Time: 8:29 # 5

# Selection Signatures – FST

There was generally a high level of differentiation between Boer on the one hand, and the indigenous Ugandan breeds on the other. For Boer, the average FST across all SNPs was 0.123, while the average FST for the indigenous breeds was less than 0.050 (**Supplementary Table S1**).

Analysis of breed specific differentiation between each of the Ugandan goat breeds including the Boer breed resulted in several putative regions of selection as shown in **Supplementary Table S2** (p < 0.05, without Bonferroni correction). In Boer, the 29 putative regions of selection were identified, which spread across 17 autosomes and overlapped with 134 genes. The regions with the highest degree of differentiation were found on CHI11, 12, 14, 18, and 24 (**Figure 1**). The highest ranked SNP window (maFST-value = 0.754) was found on CHI12 in a genomic region between 60.170 and 60.711 Mb and overlap with portions of the genes MAB21L1 and NBEA. Analysis of breed specific differentiation for the Ugandan indigenous goat breeds resulted in 394 putative regions of selection distributed across all the breeds showing some candidate genes for traits of economic importance. The regions varied from 66 in Mubende to 79 in the Small East African goats distributed across most of the autosomes (**Supplementary Table S2**). The selection signature regions in the indigenous goat breeds are on average numerous and shorter than in the Boer. There is limited overlap between the different breeds indicating signatures are mostly breed specific. Several genes were found spanning the selection signature regions across the autosomes (**Supplementary Table S2**). Functional analysis of some of the candidate genes (**Supplementary Figure S1**) shows they may be involved in tropical adaptation such as thermo-tolerance and immune response in the indigenous breeds. These include KPNA4 (CHI1), MTOR (CHI16), SH2B1 (CHI25), and MAPK3 (CHI25) in Karamojong; IL10RB, IFNAR, DNAJC13 (CHI1) in Kigezi; PPP1R36 and HSPA2 (CHI10), DNAJC24 (CHI5) in Mubende; CD80, ADPRH, IGSF11 (CHI1), IGF1 (CHI5) in Small East African goats, and HOXC12 and HOXC13 (CHI5) in Sebei. The full gene-list is found in **Supplementary Table S2**.

## Selection Signatures – hapFLK

The results of the haplotype-based differentiation with hapFLK are shown in **Figure 2**. A significance threshold of p < 0.005 was considered to identify regions under selection. The hapFLK analysis resulted in six putative selection signature regions on CHI5 (116.662–118.773 Mb), CHI6 (0.005–16.337 Mb), CHI8 (7.766–7.941 Mb), CHI13 (58.709–63.989 Mb), CHI15 (14.932– 23.571 Mb), and CHI16 (40.533–45.988 Mb). Some of the candidate genes identified, which may be playing a role in tropical adaption include; CFI (CHI 6), DEFB genes (CHI 13), and ASIP (CHI 13), MTOR, PIK3CD (CHI 16), and CD44 (CHI 15) (**Supplementary Table S3**).

Four of the six significant regions identified by hapFLK partially overlapped with the 394 significant selective signature regions identified by FST. Several short overlapping regions were found between hapFLK and FST, with the strongest signals detected on CHI 6, 13, and 16 (**Figure 2**). Some of the regions were breed specific and contained several genes (**Supplementary Table S4**).

# Gene Enrichment of Putative Selection Signatures

Within the putative selection regions identified, a list of genes was identified for each of the approaches used: FST (**Supplementary Table S2**) and hapFLK (**Supplementary Table S3**) and were used to perform separate functional analyses using DAVID with default settings on the human gene set (Huang et al., 2009; Jiao et al., 2012). Functional analysis of the FST gene-list for each of the breeds yielded 47 significant (p < 0.05) gene ontology (GO) biological process (BP) terms (**Supplementary Table S5**) and 15 KEGG pathways were enriched (**Supplementary Table S6**). The biological processes enriched were related to cell communication, male sex differentiation, microtubule cytoskeleton organization in the Boer; and negative regulation of catalytic activity, homeostasis in number of cells within the tissue, TIR-domain-containing adapter-inducing interferon-β (TRIF)-dependent toll-like

receptor signaling pathway (GO:0035666), positive regulation of peptidyl-tyrosine phosphorylation (GO:0050731), cytokinesis (GO:0000910), and angiogenesis (GO:0001525) among others in the indigenous goat breeds.

The DAVID analyses based on hapFLK gene list across the breeds resulted in 18 significant (p < 0.05) biological processes (**Supplementary Table S5**) and nine significant (p < 0.05) KEGG pathways (**Supplementary Table S6**). The genes identified, were significantly involved in the defense response to bacterium (GO:0042742; p-value < 0.001), negative regulation of the apoptotic process (GO:0043066; p-value < 0.001), and positive regulation of gene expression (GO:0010628; p-value < 0.001) among others.

# Analysis of ROH and Genomic Inbreeding – FROH

A total of 1,497 ROHs were detected across all individuals. The frequency of ROHs and their length-distribution differed across breeds (**Figure 3**). For all length categories, ROHs were generally more frequent in Boer (a breed selected for meat production) than in Ugandan indigenous goat breeds. Consequently, Boer showed the highest genomic inbreeding coefficients (**Table 2**). For example, the mean FROH <sup>≥</sup>2Mb in Boer was 13.8%, while for the indigenous breeds, it ranged from 0.8% (Sebei) to 2.4% (Karamojong). Shorter ROH were more frequent than longer ROH in all breeds except for Karamojong. In the later breed, there were remarkably many ROH > 16 Mb.

The ROHs were located across the whole genome, with some regions showing a higher frequency than other (**Supplementary Figure S2**). The mean sum of ROH segment coverage was generally higher for short ROHs than for long ROHs. The highest mean ROH coverage within the short ROH category (ROH of 2–4 Mb) was found in Boer, while Sebei had the lowest of mean ROH coverage. For instance, around 65% of the Boer mean sum of ROH segment coverage in this study (219.65 Mb) was within the shorter ROH category 2–8 Mb. However, for the other breeds, the coverage ranged from 4.52 Mb in Sebei to 20.38 Mb in Kigezi (**Supplementary Table S7**). In the long ROH category (ROH<sup>&</sup>gt; <sup>16</sup> Mb), Boer and Karamojong showed higher ROH genome coverage (33.67 and 33.60 Mb, respectively), which indicates more recent inbreeding. In the remaining breeds, the coverage was between 4.44 Mb in Kigezi and 10.58 Mb in the Small East African breed. Boer showed high genome coverage with both short and long ROH, suggesting that the breed has experienced both recent and ancestral inbreeding compared

TABLE 2 | Average percentage genomic inbreeding coefficient (FROH) for different length categories of ROH across six goat populations.


SEA = Small East African goat, N<sup>0</sup> = Number of individuals from the samples where no ROH ≥ 2 Mb, homozygous segments were detected, FROH <sup>≥</sup>2Mb = overall percentage genomic inbreeding at ROH threshold of 2 Mb, FROH(x−y) = genomic inbreeding based on ROHs of length x–y Mb.

to Ugandan indigenous breeds. The findings also suggest that among the Ugandan indigenous goat breeds, Karamojong has greater levels of inbreeding compared to the others. The proportion of the genome located on an ROH differed between breeds and chromosomes. The proportion ranged from 1.50% on CHI2 to 93.62% on CHI23 in Sebei (**Supplementary Table S8**). ROH segments were identified on all 29 autosomes in Boer, but the number varied in the genomes of the indigenous goat breeds with several autosomes showing no homozygous regions (**Supplementary Table S9**).

# DISCUSSION

In this study, we unravel selection signatures and genomic inbreeding coefficients in goat breeds of Uganda using genome-wide SNP data.

Various approaches have been implemented for the detection of selection signatures in several domestic animal species such as, cattle (Msalya et al., 2017; Taye et al., 2017), horses (Petersen et al., 2013), sheep (Kijas et al., 2012; Fariello et al., 2014; Rochus et al., 2018), and goats (Kim et al., 2016; Wang et al., 2016; Brito et al., 2017). In this study, we assessed the genome-wide differences between Ugandan indigenous goat breeds (Karamojong, Kigezi, Mubende, Small East African, and Sebei) and exotic Boer goats by using population differentiation, FST (Weir and Cockerham, 1984) and the haplotype structure in the populations, hapFLK (Fariello et al., 2013).

The statistical power to detect selection signatures may vary among the approaches. In this study, we used FST and hapFLK for detecting selection signatures. The use of different methods in detecting selection signatures boosts the accuracy of detection and eliminates unknown bias (Simianer, 2014; Ma et al., 2015).

# Selection Signatures

The genomic regions potentially under selection identified in this study spanned a myriad of candidate genes with diverse

biological, molecular, and cellular functions, which could be because the adaptation processes to environmental stressors is controlled by a complex network of genes acting together, other than single candidate genes. For instance, adaptation to hot and arid environments was found to be mediated by a complex network of genes in Egyptian Barki goats and sheep (Kim et al., 2016), which were directly or indirectly associated with energy and digestive metabolism, autoimmunity, thermo-tolerance (melano-genesis) and, muscular and embryonic development. Similarly, adaptation may also result from interaction of several traits under the influence of several genes (Lv et al., 2014). In this study, we found putative signatures containing a complex of genes involved directly or indirectly in immune response. Moreover, selection for complex traits may also leave limited or none of classic selection signatures due to weak selection acting on the genome (Kemper et al., 2014).

In line with expected selection signatures for such complex traits, the genomic regions identified in this study using genome-wide maker specific fixation index in the populations showed limited overlap. This suggests that the selection on genes involved in adaptation to a tropical environment were breed-specific. Moreover, the selection signatures found in our study do not display classic hard sweep characteristics, which is to be expected for complex traits. This is in contrast to the findings with Valdostana goats in Italy (Talenti et al., 2017). This may arise due to the very diverse nature of the populations and absence of hard and long selection signature regions observed within the populations at the 50K SNP marker density. Second, our study pooled genotypes from six different breeds and lending to picking out differences between the breeds, unlike in the study of Talenti et al. (2017), whose focus was on only one breed.

We did find overlap between the selected regions identified with the hapFLK method and breed-specific FST signatures. Since hapFLK considers population stratification, the haplotypes in these regions are likely to be selected for in the corresponding breeds. The fact that these regions stand out in the hapFLK results as well as the FST results suggests that selection on those regions most closely resembles classic sweeps. Strong selection signatures were observed on CHI 6, 13, and 16, and they harbored several genes which may be important for adaptation in tropical environments, such as MTOR which is involved in heat stress response and the heat shock family of genes (Shi and Manley, 2007) and DNAJC24 involved in the first apoptosis signal (FAS) pathway and regulation of stress induction of heat shock protein (hsp) in Bos taurus (Roy and Collier, 2012). Several of the genes are involved in immune response particularly the innate immune response pathway (GO:004508). Overall, several of the genes identified in this study are associated with tropical adaptation. Moreover, in the Boer, several candidate genes identified in the putative selection signatures are involved in production related traits, reflecting a more modern selection regime. However, to pin-point the exact genes involved in tropical adaptation and production in the Ugandan goat breeds, there is need for an in-depth study at high resolution.

Generally, most of the regions under selection were subtle and breed-specific, as expected for complex traits under selection. Therefore, the forces driving selection in the genome of the indigenous goat breeds in this study may be associated with adaptation to African tropical environment, such as: thermotolerance, disease and parasite resistance, and the ability to perform under limited (quality and quantity) feed and water resources. The genome-wide scans identified candidate genes within the putative selective signatures associated with specific biological pathways and functions, which may be shaping the genomic architecture of Ugandan goat breeds for survival in stressful environment. Although most signatures were breed-specific, some interesting similarities could be found in the adaptive processes the genes in selected regions were involved in.

### Thermo-Tolerance Genes

Several candidate genes were identified, which are associated with adaptation to thermal stress. The homeobox genes HOXC12 and HOXC13 genes identified in Sebei are involved in the anterior/posterior pattern specification (GO:0009952). The genes play a role in hair follicle differentiation, growth, and development by regulating the keratin differentiation-specific genes (Wu et al., 2009; Taye et al., 2017). The HOXC13 gene has been reported to influence skin thickness. Skin thickness and number of hair follicles impacts positively on thermoregulation. For instance, in cattle, thicker skin is associated with thermo-tolerant cattle (Bos indicus) as opposed to heat susceptible cattle (Bos taurus) breeds (Alfonzo et al., 2016). Relatedly, PPP1R36 and Heat Shock Protein A2 (HSPA2) (CHI10, 26.402–26.719 Mb) identified in Mubende are involved in heat stress response and, HSPA2, DNAJC24, and DNAJC13 are associated with the heat shock family of genes (Shi and Manley, 2007). The presence of multiple genes associated with heat stress would seem to suggest that the trait is under intense selection pressure in tropically adapted breeds. Genes such as KPNA4 (CHI1), MTOR (CHI16), SH2B1 (CHI25), and MAPK3 (CHI25) were also identified in Karamojong goats. They have been reported to be involved in the FAS pathway and regulation of stress induction of hsp in Bos taurus (Roy and Collier, 2012). Furthermore, we identified the gene IGF1 (CHI5, 64.576–65.310 Mb). IGF1 encodes a protein that is similar to insulin and it is involved in regulation of carbohydrate and lipid metabolism. IGF1 facilitates post-absorptive nutrient partitioning during heat stress and accumulation of insulin is often an adaptation mechanism to heat stress (Sanz Fernandez et al., 2015).

### Adaptive and Innate Immunity Genes

Several candidate genes in the putative selection regions are involved in regulating innate and adaptive immunity in mammals. For example, we identified diacylglycerol kinase beta, DGKB gene in Small East African (CHI4 position 97.794–97.991 Mb). The gene is involved in the glycerolipid, glycerophospholipid, and phosphatidylinositol metabolic pathways and has been found to be associated with QTL for strongyles that includes Haemonchus sp (Zvinorova, 2017). Other candidate genes identified include IL10RB and IFNLR1 on CHI1(0.693–0.959 Mb) in Kigezi goats. These genes are involved in type III Interferon Signaling Pathway

and confer immunity (Ferrao et al., 2016). Similarly, we also identified candidate genes in Sebei such as BCL2L1 (CHI13, 60.489–60.748 Mb), and in Small East African goats such as ERBB2 (CHI19, 39.703–40.129 Mb), and ENO1 (CHI16, 43.006–43.669 Mb). These genes are directly or indirectly associated with immunoregulation, e.g., ENO1 in humans (Ryans et al., 2017).

The identification of cytokines such as IL17RE, IL17RC, and IL23A in this study may be associated with gastrointestinal parasite resistance. Some of the cytokines have been reported to be significantly upregulated in Haemonchus contortus infected sheep and are known to be involved in adaptive immune response (GO:0002250) (Guo et al., 2016). These results would suggest that immunity genes are hotspots for natural selection in Ugandan goat breeds in response to high burden of pathogen/parasite challenge in the local environment (Thumbi et al., 2014; Bahbahani and Hanotte, 2015). Indigenous goat breeds vary in the degree of response to parasite infestation (Chiejina and Behnke, 2011; Onzima et al., 2017). We hypothesize that the variation between the breeds may be due to the genes allowing for selection on resistance traits either naturally or artificially.

One of the regions in Boer on CHI3 (84.128–84.373 Mb) harbors a gene PRMT6, which is reported to influence early embryonic development in Zebra fish (Zhao et al., 2016). The gene VAV3 (CHI3, 84.730–84.962 Mb) is also associated with the immune system (Shen et al., 2017). Interleukin 12A (IL12A) gene is another cytokine that was identified on CHI1 in Karamojong which may be associated with immune response. The gene family is reported to be involved with the immune system in humans through series of biological processes (Reitberger et al., 2017). Moreover, it is cytokine that acts as a growth factor for activated T and Natural Killer (NK) cells, enhances the NK/lymphokine activated killer cells and stimulates the production of IFN-gamma.

### Genes Associated With Production Traits

The candidate gene NBEA (Neurobeachin) in the region on CHI12 (maFST = 0.754) (**Supplementary Table S2**) is associated with human body weight (Fox et al., 2007). Another gene of interest that we identified is VAV3 (CHI3; 84.730–84.962 Mb) on a homozygous region in Boer. The gene has been identified as a candidate gene for efficiency of food conversion in swine (Wang et al., 2015) and in goats (Brito et al., 2017). These genes are particularly significant to be identified in Boer goats, which have been extensively selected for high body weight and growth rate. Earlier studies have also identified this gene as a top candidate in Draa goats in Morocco (Benjelloun et al., 2015). However, in that study it was not conclusive if the candidate gene was associated with body weight.

Other genes identified in Boer such as the gap junction protein genes GJB2, GJB6, and GJA3 belong to the family of genes involved in cell communication (GO:0007154). They encode proteins that influence body size, skeletal and embryonic development and testicular embryogenesis, and may indirectly influence traits such as growth (Kim et al., 2016). The region (41.943–42.086 Mb) on CHI13 contains another gene ACSS1 (acyl-CoA Synthetase Short-chain Family Member 1), which has been associated with body weight, food intake, post-natal growth rate and susceptibility to weight loss among others (Liu et al., 2017).

# Gene Enrichment Analysis

Our findings indicated that pathways associated with production and mechanisms of environmental adaptation, such as immune response, male reproduction, energy production and heat stress, may be under selection in Ugandan goat breeds. This is in agreement with findings in East African Short-horn Zebu cattle (Bahbahani et al., 2015, 2017, 2018), South African cattle (Makina et al., 2015), and indigenous goats in Morocco and Egypt (Benjelloun et al., 2015; Kim et al., 2016). Gene ontology analysis shows that multiple pathways are expressed in the Ugandan goat breeds, which may indicate an adaptation to varied environmental conditions. This is also confirmed by recent studies with indigenous Sudanese goats which similarly implicated several biological processes (Rahmatalla et al., 2017). The multiplicity in the number of candidate regions and genes detected in the present study confirms findings from livestock species in stressful environments (Kim et al., 2016, 2017; Mwacharo et al., 2017). These studies and the current one, reaffirm the fact that adaptation is generally a complex trait, involving several biological processes and quantitative trait loci with each contributing a small but cumulative effect to the overall phenotype.

Although, the results based on raw p-values yielded very interesting biological pathways which may be overrepresented, the results of the more stringent multiple testing corrections such as Bonferroni correction, were not significant. This may be attributed to the small sample size involved in this study. Nonetheless, these results provide a useful indication of mechanisms involved in environmental adaptation in the indigenous goat breeds.

# Genomic Inbreeding Based on ROH (FROH)

In the absence of pedigree records, ROH may be useful to infer the level of inbreeding. Computing the proportion of an individual's genome occurring as an ROH of particular length (e.g., >1, >2, or >4 Mb) provides information on the level of inbreeding relative to a population several generations ago (Curik et al., 2014; Forutan et al., 2018). At ROH threshold of >2 Mb, the indigenous breeds showed very low levels of genomic inbreeding, as compared to the higher inbreeding levels found in the exotic Boer (**Table 2**). The low genomic inbreeding level reported in this study is consistent with findings in Swiss goat breeds (Burren et al., 2016) and Barki goats (Kim et al., 2016). Genomic inbreeding based on ROH provides an accurate estimate of an individual's autozygosity than pedigree based inbreeding due to either incomplete or non-existent pedigree information (Ferencakovi ˇ c et al., 2013a,b ´ ; Forutan et al., 2018).

Runs of homozygosity usually emanate from identical haplotypes being transmitted from parents to offspring (Purfield et al., 2012; Iacolina et al., 2016). The frequency of their existence provides a clue on the demographic history and

management of the population over time (Kirin et al., 2010; Ceballos et al., 2018). The mean sum of ROH segment coverage was generally higher for short ROHs than for long ROHs (**Figure 3**). However, Karamojong showed a higher average sum of ROH<sup>&</sup>gt; <sup>16</sup> Mb. The distribution of ROH coverage reported in this study is in agreement with other studies in goats (Brito et al., 2017), sheep (Purfield et al., 2017), and cattle (Ferencakovi ˇ c´ et al., 2013a; Mastrangelo et al., 2016), in which long ROH segments were found less frequently compared to shorter ones. Although the short ROH were more frequent in the genome of the indigenous goat breeds, their absolute contribution to the genome was substantially low (except in Kigezi goats) (**Supplementary Table S7**). This result is consistent with findings of Bosse et al., (2012), who reported that short ROH were abundant in the porcine genome, but contributed less to the genome as compared to large ROH ( > 5 Mb). This may be due to differences in selection events in the more recent or ancestral populations. However, the short ROH in the Boer and indigenous Kigezi goats contributed more to the absolute coverage of the genome by the SNPs. The higher proportion of ROH segments within the short ROH categories indicates a relatively larger contribution of distant inbreeding, whereas the higher coverage of long ROH observed in Karamojong suggests a larger effect of more recent inbreeding. Karamojong goats are reared under pastoral production systems and may be subject to selection of best performing males by their keepers. This coupled with smaller effective population size could be contributing to the high recent inbreeding observed. On the other hand, the Kigezi goats are isolated populations that have undergone limited more recent selection, which could explain the high frequency of short ROH segments attributed to more distant inbreeding. The longer stretches of ROH in the exotic Boer goats may be due to the stringent artificial selection for production traits on few selection candidates (narrow genetic base) and may thus explain the higher levels of genomic inbreeding. Longer stretches of ROH were also observed in exotic goat breeds when compared to Barki goats in Egypt (Kim et al., 2016). Generally, shorter ROH is associated with more ancient inbreeding, while longer ROH tend to show a more recent inbreeding (Browning and Browning, 2012, 2013).

Although there are limited quantitative trait loci in goats, our study provides a basis for future research in goat genomics of tropically adapted breeds. Using medium density SNPs, we could detect selection signatures associated with adaptation to tropical environmental conditions. With the release of the caprine 50K SNP chip (Tosser-Klopp et al., 2014), several efforts are underway including improvements in the annotation of the goat genome assembly (Bickhart et al., 2017). Arguably, these developments will change the landscape of genomic research in goats, allowing for inclusion of genomic evaluations in goat breeding programs. The integration of genomic information will undoubtedly lead to better management and sustainable utilization of genetic resources. The results of this study will advance our understanding of environmentally driven adaptation and its potential application in functional genomics and selective breeding as well as in design of management programs to conserve livestock genetic diversity to cope with the current and future predicted effects of climate change.

# CONCLUSION

Using genome-wide SNP data, we investigated for the first-time selection signatures in Ugandan goat breeds that may be shaping their adaptation to varied environmental conditions. The study identified several putative genomic regions and genes in Ugandan goat populations, which may be underlying adaptation to local environmental conditions such as heat tolerance, disease and parasite resistance, and production traits. Generally, non-classical sweeps with limited overlap were observed which is typical of complex traits.

In the absence of pedigree data, genomic information through ROH provides a useful tool for quantifying the level of genomic inbreeding in the populations.

The study provides a foundation for detailed analysis of the identified putative selection signatures in the goat genome particularly of the tropically adapted breeds and provides an avenue for a well-structured breed improvement.

# DATA ACCESSIBILITY

Data from samples used in the present study are available from the Zenodo Digital Repository: https://doi.org/10.5281/zenodo. 1184716.

# ETHICS STATEMENT

The animals sampled specifically for this study had their processes evaluated and approved by the Ethics Committee of Uganda National Council of Science and Technology (UNCST; SBLS/REC/15/131).

# AUTHOR CONTRIBUTIONS

RO, MU, and MB conceived the study. RO drafted the manuscript. HD, MU, LB, and RO participated in the data analysis. MG, EK, RC, and MB supervised the study. All the authors read and approved the manuscript.

# FUNDING

This work was financially supported by the National Agricultural Research Organization (NARO) in Uganda, through a World Bank-supported project, Agricultural Technology, and Agribusiness Advisory Services (ATAAS) (P109224).

# ACKNOWLEDGMENTS

Bert Dibbits – Wageningen University and Research, Animal Breeding, and Genomics for processing genomic DNA samples for genotyping. Paul Kashaija and Dr. Benda Kirungi from the National Livestock Resources Research Institute and Kachwekano Zonal Agricultural Research and Development

Institute for support in DNA extraction and field sampling, respectively. Robert Mukiibi brainstorming discussion on the subject to shape-up the study. The smallholder farmers for allowing their animals to be sampled for the study.

# REFERENCES


# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00318/full#supplementary-material



adaptation in Egyptian fat-tail sheep and their divergence from East African and western Asia cohorts. Sci. Rep. 7:17647. doi: 10.1038/s41598-017-17775-3



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Onzima, Upadhyay, Doekes, Brito, Bosse, Kanis, Groenen and Crooijmans. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genetic Improvement in South African Livestock: Can Genomics Bridge the Gap Between the Developed and Developing Sectors?

Esté van Marle-Köster\* and Carina Visser

Department of Animal and Wildlife Sciences, Faculty of Natural and Agricultural Science, University of Pretoria, Pretoria, South Africa

### Edited by:

Joram Mwashigadi Mwacharo, International Center for Agriculture Research in the Dry Areas (ICARDA), Ethiopia

### Reviewed by:

Eveline M. Ibeagha-Awemu, Agriculture and Agri-Food Canada (AAFC), Canada Filippo Biscarini, Consiglio Nazionale Delle Ricerche (CNR), Italy Mizeck Chagunda, University of Hohenheim, Germany

> \*Correspondence: Esté van Marle-Köster evm.koster@up.ac.za

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 09 April 2018 Accepted: 31 July 2018 Published: 23 August 2018

### Citation:

van Marle-Köster E and Visser C (2018) Genetic Improvement in South African Livestock: Can Genomics Bridge the Gap Between the Developed and Developing Sectors? Front. Genet. 9:331. doi: 10.3389/fgene.2018.00331 South Africa (SA) holds a unique position on the African continent with a rich diversity in terms of available livestock resources, vegetation, climatic regions and cultures. The livestock sector has been characterized by a dual system of a highly developed commercial sector using modern technology vs. a developing sector including emerging and smallholder farmers. Emerging farmers typically aim to join the commercial sector, but lag behind with regard to the use of modern genetic technologies, while smallholder farmers use traditional practices aimed at subsistence. Several factors influence potential application of genomics by the livestock industries, which include available research funding, socio-economic constraints and extension services. State funded Beef and Dairy genomic programs have been established with the aim of building reference populations for genomic selection with most of the potential beneficiaries in the well-developed commercial sector. The structure of the beef, dairy and small stock industries is fragmented and the outcomes of selection strategies are not perceived as an advantage by the processing industry or the consumer. The indigenous and local composites represent approximately 40% of the total beef and sheep populations and present valuable genetic resources. Genomic research has mostly provided insight on genetic biodiversity of these resources, with limited attention to novel phenotypes associated with adaptation or disease tolerance. Genetic improvement of livestock through genomic technology needs to address the role of adapted breeds in challenging environments, increasing reproductive and growth efficiency. National animal recording schemes contributed significantly to progress in the developed sector with regard to genetic evaluations and estimated breeding values (EBV) as a selection tool over the past three decades. The challenge remains on moving the focus to novel traits for increasing efficiency and addressing welfare and environmental issues. Genetic research programs are required that will be directed to bridge the gap between the elite breeders and the developing livestock sector. The aim of this review was to provide a perspective on the dichotomy in the South African livestock sector arguing that a realistic approach to the use of genomics in beef, dairy and small stock is required to ensure sustainable long term genetic progress.

Keywords: animal recording, developing countries, indigenous livestock, novel traits, smallholder farmers

# INTRODUCTION

The South African (SA) livestock industry is based on a wellestablished dairy, beef and small stock industry where selection and breeding practices have been in existence for more than four decades. These livestock species are farmed in all nine provinces of South Africa, characterized by diverse biomes ranging from sub-tropical regions with high rainfall and temperatures to more moderate regions with cold winters and snow as well as semidesert regions with low rainfall, high temperatures and relatively good quality grazing (Mucina and Rutherford, 2006). Of the total percentage of land available for agricultural production, 68.6% is classified as grazing land (DAFF, 2017a) and used for extensive production of meat producing ruminants. Dairy production is either pasture-based in regions such as Kwa-Zulu Natal (KZN) and the coastal regions of the Eastern Cape (EC) with sufficient rainfall for planted pastures, or produced in Total Mixed Ration (TMR) systems in the remaining parts of SA (Williams et al., 2016).

The South African livestock industry contributed R127 288 million to the Gross Domestic product in 2016–2017 with a positive growth of 11.3% with the largest contribution represented by poultry meat. Animal products contributed 46% of income with regard to all agricultural activities (DAFF, 2017a). It is clear that agricultural sector has an important role considering that sufficient food needs to be produced for approximately 55 million SA population (DAFF, 2016). The current trend is predicting further growth of at least 10 million by 2050 (United Nations., 2012), emphasizing the pressure for increased need for animal derived protein production with higher efficiency.

The SA livestock industry is characterized by a dual system of a highly developed commercial sector vs. a developing sector. In the developed sector the value chain is differentiated into stud and commercial farmers/producers, feedlots and pigs and poultry companies with good access to abattoirs, product processing and a variety of marketing opportunities. The beef industry value chain is shown an example of this structure in **Figure 1**. Large and small livestock are primarily individual farms, while poultry and pigs tend to be large companies with vertical integration.

In contrast, the developing sector consists of small holder farmers and livestock keepers within communal systems. There is also a strong presence of a group referred to as "emerging farmers" (more recently referred to as "market-orientated farmers") in this sector. This group has the potential to become part of the developed commercial sector. The dichotomy of the SA livestock industry is deeply rooted in aspects such as access to land, poor infrastructure, lack of well-structured livestock extensive programs and markets (Mapiye et al., 2018). Development programs, such as the Land Redistribution for Agricultural development (LRAD) and the Independent development Corporation Nguni projects aim to assist the emerging farmer to make the move to commercial farming (Prinsloo, 2008; De Waal, 2014).

In the developed livestock sectors the value chain tend to be fragmented with poor integration of breeding objectives that are set by the stud breeder that markets the genetic material (bulls/rams/buck) versus the commercial cow-calf operation, which in turn produce weaners, and the feedlots who are responsible for finishing and slaughtering. Sheep production follows a similar pattern, but with less feedlot-finishing compared to beef cattle. A similar situation has been described by Pollack (2005) for the beef industry in the United States that results in negative outcomes for selection and long term genetic improvement. The breeding objectives of the stud producer are often not aligned with the needs of the commercial producer, feedlot or end-user (Garrick, 2011). In the developing sector this fragmentation is even more pronounced with a total lack of clear breeding objectives and is further complicated by poor infrastructure and ecological and financial challenges (Mapiye et al., 2018).

Despite a substantial growth in the developing sector over the past two decades with an estimated 1.3 million smallholder farmers, approximately 67% of these farmers are not regarded as emerging commercial operations (DAFF, 2017b; Mapiye et al., 2018). The majority of the smallholder farmers have small herds or flocks where herd sizes could be less than five cows with the majority of these herds being non-descript, crossbred or indigenous cattle, sheep and goats (Mthi et al., 2017; Nyamushamba et al., 2017). Goats for slaughter are mostly marketed directly off the veld through informal trade (Visser, 2018).

Participation in animal recording via national or private services varies significantly among different breeds and between the different livestock species. The majority of beef stud breed societies support animal recording and the use of estimated breeding values (EBVs). In dairy cattle the number of SA stud breeders has declined and commercial producers are moving to automatic recording systems rather than traditional milk recording systems. In the emerging sector the Kaonafatso ya Dikgomo (KyD) have been established by the Agricultural Research Council in 2007 to support emerging and smallholder farmers to take part in animal recording. Complete phenotyping however remains a challenge in both the developed sector and even more so in the developing sector with significant adverse implications for genetic evaluations and sustainable genetic improvement.

In 2015 and 2016 state funded genomic programs were established for the SA beef and dairy industries respectively, to set up training populations for moving toward implementation of genomic selection (GS) with the majority of the beneficiaries being stud farmers in the highly developed and technologydriven commercial livestock sector (Van Marle-Köster et al., 2017). The phenotyping of hard to measure traits such as fertility and carcass traits for application in GS and which will realize the most benefit, remains a major challenge (Blasco and Toro, 2014). A further pressing matter is the alignment of breeding objectives within the different sectors to ensure that the traits included in selection programs will benefit all the producers in the value chain. These breeding objectives set within the developed sector should also consider the dissemination of genetic material to the emerging and smallholder farmers in the developing sector. This paper provides a critical review of the dichotomy between the South African livestock developed and developing sectors with

regard to the use of genomics in beef, dairy and small stock with reference to the requirements for sustainable long-term genetic progress.

# HISTORICAL OVERVIEW OF LIVESTOCK IMPROVEMENT IN SOUTH AFRICA

Since the inception of national animal recording schemes for dairy, beef and small stock in the early nineteen fifties, genetic evaluations for most of these species are routinely performed and stud breeders have access to estimated breeding values (EBVs) as a selection tool. National milk and beef recording date back to 1917 and 1959 respectively, when national recording schemes were managed by the former Animal Improvement Institute (Bergh, 2010). National small stock recording was established in 1956 (Schoeman et al., 2010) with participation by sheep breeders. Angora goat breeders only joined the NSIS in a pilot study in 1983 (Delport and Erasmus, 1984). In **Table 1** a summary is provided of the most commonly recorded traits in beef cattle in South Africa.

South Africa has more than 30 registered beef breeds with large variation among breed societies with regard to participation in recording schemes (Van Marle-Koster et al., 2013; SA Stud Book Annual Report, 2016). Only the locally developed SA Bonsmara composite breed dictates compulsory recording of a number of traits that include fertility, growth and efficiency. In **Figure 2** the proportion of registered beef animals in the seed stock industry in South Africa is shown (SA Stud Book Annual Report, 2016). Furthermore, the number of traits recorded varies among the breeds with a larger numbers of phenotypes available for growth traits, compared to limited numbers for fertility or hard to measure traits such as feed efficiency and carcass quality. For most routinely-measured traits of economic importance, there has been a positive trend for adoption of modern selection tools such as EBVs by the livestock producers. Intensive feedlot testing has been popular among some beef breeds with data generated for growth rate, feed efficiency and carcass traits.

Animal recording in the developing sector is limited to the Kaonafatso ya Dikgomo (KyD) scheme where technical advice on health, production and support with recording of animal information is provided. This scheme makes provision for weight recordings at birth, weaning, 12 and 18 months (http://www.arc. agric.za/arc-api/Pages/KyD.aspx).

The dairy industry in South Africa is dominated by the Holstein and Jersey cattle breeds with average herd sizes of approximately 400 cows (Coetzee, 2017). The participation in the national milk recording scheme among commercial producers has been declining over the past decade with only 24% participation (Scholtz and Grobler, 2009) with the trend toward automatic milk systems and recording, especially in larger herds. The dairy industry in SA relies on importation of semen from the best bulls available in the world and the local dairy bull industry


has declined significantly. In the developing dairy sector, the majority of farmers own between 5 and 15 cows that produce less than a total of 100 liters milk per day (Muntswu et al., 2017).

The commercial small stock sector consists of 14 sheep breeds, 3 commercial meat goat breeds and the SA Angora goat breed. The majority of the sheep breeds are farmed under extensive commercial production systems. Participation in animal recording in this sector is limited to a small number of commercial producers (**Figure 3**), for which genetic analyses are performed.

No recording is performed in the smallholder or communal goat sector (Visser, 2018) which is alarming considering that approximately 60% of goats are kept in these systems and they make a significant contribution to household food security. No genetic improvement in terms of strategic selection or EBV estimation is performed in this sector and genomic applications have been limited to studies on genetic diversity (Mohlatlole et al., 2015; Mdladla et al., 2016).

The challenges for emerging farmers and smallholders are often beyond the scope of the animal scientists and the veterinarian. A number of socio-economic factors such as land issues, financial support and market access are primary constraints in the developing sector (Khapayi and Celliers, 2016). Extension services are also not readily available in all parts of the country to support the number of small holders. Most of these challenges are similar to experiences reported in other developing countries where smallholders (Kosgey et al., 2011) keep beef and dairy cattle. For the emerging and smallholder sectors, genetic tools such as EBVs are unfeasible due to small herds, incomplete recordings for most traits, no parentage recording and insufficient contemporary groups. Different approaches are therefore required to accommodate these farmers to ensure that they will have access to superior genetic material for genetic improvement of their livestock.

# APPLICATION OF GENOMICS IN SOUTH AFRICA

Since the completion of the sheep, beef and goat genomes in 2007, 2009, and 2013 respectively (Fan et al., 2010), followed by SNP marker discoveries (Matukumalli et al., 2009), several applications of genomics have become available for livestock farmers. Over the past two decades, both microsatellite and SNP markers have contributed to the development of diagnostic testing of genetic defects and DNA-based parentage (Van Marle-Koster et al., 2013). SNP arrays are widely applied in routine genotyping for genomic selection in several farm animal species providing an added advantage of using these genotypes for detection and prediction of carriers of genetic defects (Biscarini et al., 2016). Different methods have been reported for prediction that include haplotype-based predictions (Pirola et al., 2013) and discriminant analyses (Biffani et al., 2015). Studies have shown that the accuracy of prediction for the genetic defects could be comparable when using genotypes generated with lower density (Bovine LD) versus a higher density 54K Bovine SNP array (Biscarini et al., 2016). The availability of genotypes furthermore provide the potential for identification of beneficial genes such as the Celtic variant of the POLLED gene for homozygous polled animals (Medugorac et al., 2012).

A number of test facilities are available in South Africa for the diagnostic testing of genetic defects that are relatively cost effective for application in both the commercial and emerging farmer sector (**Table 2**). DNA technology therefore provides an accessible tool to stud breeders and livestock producers to remove affected animals from their herds. It is also a relatively affordable tool for emerging farmers to solve and manage some basic problems for genetic improvement.

For the seed stock industry, accurate pedigree information is essential. Studies performed in South African Angora herds using

TABLE 2 | Diagnostic tests available for ruminants in South African laboratories.


\*Unistel, www.unistelmedical.co.za; Clinomics, www.clinomics.co.za; Veterinary Genetics Lab, www.up.ac.za/the-onderstepoort-veterinary-genetics-laboratory; GENEDiagnostics, www.genediagnostics.co.za.

microsatellite markers (Visser et al., 2011; Garritsen et al., 2015) indicated incorrect and incomplete parentage recording of up to 14%. The largest impact was demonstrated in the accuracy of EBV's with significant re-ranking of the Angora sires (Garritsen et al., 2015). DNA based testing of Boran seed stock in Kenya indicated a 55.2% misidentification of sires and 2.3% for dams (Kios et al., 2012). This situation is not unique to South Africa and Africa as a number of studies reported the adverse effects of incorrect and/or incomplete pedigree information (Visscher et al., 2002; Van Eenennaam et al., 2014).

The use of parentage testing varies among the different livestock species. Approximately 35% of cattle breeders make use of DNA parentage testing on a routine basis, especially larger herds where multi-sire mating is performed. In the small stock industry, group, and over-mating is commonly used resulting in low pedigree accuracies (Visser et al., 2011). Despite the accessibility of DNA parentage testing for sheep and goats, utilization is low due to practical management challenges under extensive production systems. DNA-based parentage verification currently remains limited to the developed livestock sector, mainly due to infrastructural, logistical and financial constraints.

Since the availability of both the ISAG 100 and ISAG 200 panels for bovine parentage validation, more recent studies have highlighted the potential limitations of using a relatively small number of SNP (Strucken et al., 2016; McClure et al., 2018). Due to large-scale genotyping in most world countries the trend is toward large numbers of SNP in combination with different levels of quality control to ensure a high accuracy (McClure et al., 2018). The application of SNP based parentage is only cost-effective if it forms part of routine genotyping. In developing countries such as South Africa where routine genotyping for genomic selection is not standard practice, microsatellite markers are still used for parentage verification. Beef breeds participating in the BGP, will benefit from this added advantage once they engage in routine genotyping.

Genomic technology for application in livestock in South Africa was initiated as recent as 2015 with the founding of the beef genomic program (BGP), followed by the dairy genomic program (DGP) in 2016 (http://www.livestockgenomics.co.za). Both these programs are state funded but have been designed to be driven by the industry with clear objectives toward sustainability with a 10 year period for the beef and 3 years for the dairy industry. The first 3 years for beef cattle have been completed where 16 breed Societies participated and approximately 7,000 samples (hair/semen) have been genotyped with a GGP Bovine150K SNP array. The first genomic enhanced breeding values (GEBV) were published for the SA Bonsmara in August 2017 (Van der Westhuizen et al., 2017) where accuracies were improved between 15 and 30% in traits with low heritability and hard-to-measure phenotypes, such as maternal traits and FCR. Training populations for both dairy and beef cattle in South Africa remain small compared to first world countries, where training populations are replenished by routine genotyping and genomic information used in breeding programs. These programs are however focussed on genomic selection for implementation in the commercial seed stock industry. Several authors reported that the beef industry in general face more challenges with collection of sufficient phenotypes and genotypes compared to dairy cattle (Berry et al., 2016; Piccoli et al., 2017).

Besides commercial application of genomic information in the developed sector of the SA livestock industry, DNA marker technology has been applied for farm animal conservation where the focus has been on indigenous resources. In this regard a number of useful contributions have been made on genetic diversity, inbreeding and population structure of Nguni cattle ecotypes (Makina et al., 2014; Sanarana et al., 2016), Namakwa sheep (Qwabe et al., 2012) and indigenous goats (Mohlatlole et al., 2015; Mdladla et al., 2016). These are all examples of well adapted genetic resources with unique traits that holds potential to be exploited using genomics.

# NOVEL PHENOTYPES

For many decades the primary focus in commercial livestock production systems was on selection for increased production and traits such as milk yield in dairy cows and weaning and carcass weights in meat producing animals. It is now accepted that the over-emphasis of these traits had adverse effects on health and fertility traits (Miglior et al., 2017) and recommendations to livestock breeders are toward a more balanced approach with breeding goals that include traits associated with fitness, longevity and health.

To make full use of the promise that genomics holds, novel traits have been proposed for most production systems. Dairy cattle pioneered genomic selection (GS) worldwide due to the availability of phenotypic data and DNA available via use of artificial insemination (Wiggans et al., 2011). Due to the intensive nature of dairy production, this was the first industry to recognize the importance of traits associated with sustainability. It resulted in accelerating the process of novel trait identification such as feed efficiency (FE), methane emissions, heat stress and claw health (Miglior et al., 2017; Pryce et al., 2018). Traits such as efficiency, greenhouse gas emissions, and heat tolerance are also of importance in beef cattle and small stock. Examples of novel traits to be considered in selection strategies are presented in **Table 3**.

TABLE 3 | Proposed novel traits for inclusion in selection strategies.


Greenhouse gas emissions are closely linked to global warming, and as such has become an important area of research in all ruminant industries. Livestock produce approximately 11–14% of all anthropogenic GHG, with the most significant contribution coming from ruminants (Llonch et al., 2017; Negussie et al., 2017). It is estimated that gastro enteric fermentation by livestock contributes more than 70% of African GHG emissions (Goopy et al., 2018). CH<sup>4</sup> emissions from developing countries are expected to rise in the next few decades, with Africa predicted to be have the largest CH<sup>4</sup> emissions (48%) by 2030 (Forabosco et al., 2017). N2O emissions are expected to rise concurrently in the same period. Selection strategies to mitigate this problem, includes improvement of fertility, feed efficiency, and animal welfare (Llonch et al., 2017).

Several CH<sup>4</sup> phenotypes, such as CH<sup>4</sup> production and CH<sup>4</sup> intensity have been described (Herd et al., 2013). Individual measurements of these on a large scale are however impractical and expensive. Easy to measure, cost-effective proxies with consistent correlations to CH4 emissions have been identified to mitigate this problem. In a comprehensive review, Negussie et al. (2017) indicated that proxies related to rumen samples (e.g., rumen microbiota, volatile fatty acids) are generally poor indicators of methane emissions. Proxies related to milk yield and components (e.g., fat or protein content) were found to be accurate predictors, with milk mid-infrared (MIR) data showing the most promise.

Using indirect selection, it has been reported that a 24% reduction in CH<sup>4</sup> emissions can be gained, should fertility rates in dairy cattle be restored to 1995 levels (Llonch et al., 2017). Forabosco et al. (2017) concurred that including traits such as Age at First Calving (AFC), longevity and mortality could also mitigate GHG emissions, as could an increase in litter size. Although directly selecting for more productive animals could decrease GHG levels through a decline in number of animals necessary for the same level of production, it could result in declined animal health and welfare. Care to balance selection pressure must be taken before pursuing such an option. The use of adapted, local genetic resources or crossbred animals could aid in mitigating gas emissions (Forabosco et al., 2017).

In commercial production systems emphases is being placed on improving feed efficiency as it is a notable strategy for reducing GHG emissions (Llonch et al., 2017). Although various measures of feed efficiency are available, e.g., residual feed intake (RFI), residual gain (RG), feed conversion ratio (FCR), and Kleiber ratio (KR), recording is limited to intensive feeding systems (feed lot systems) where individual feed intake can be measured (Berry et al., 2015). Accurate measurements in grazing systems still pose several challenges, especially under extensive production systems. Sensing technologies, such as wireless sensor networks (WSN) (Greenwood et al., 2014) holds great potential for phenotyping grazing animals in their natural environment.

With widespread climate changes facing all aspects of agriculture, breeding of robust animals will become mandatory. High temperatures reduce animal productivity, with a simultaneous rise in parasites and disease pathogens (Taye et al., 2017; Ortiz-Colon et al., 2018). African and locally developed beef cattle have improved thermo-tolerance levels and an increased ability to regulate their body temperature (Taye et al., 2017). High producing dairy cattle are the most susceptible of all ruminant species to high temperatures that result in decreased milk yield (Bernabucci et al., 2014) and feed intake, as well as reduced reproductive efficiency (Garner et al., 2016). Novel traits for measuring heat tolerance are under investigation where Garner et al. (2016) demonstrated the potential for selection of dairy cattle for increased heat tolerance in a simulation experiment. Nguyen et al. (2017) proposed the use of a breeding value for heat tolerance in Australian dairy cattle. The breeding value estimation is dependent on climatic data being known, as well as milk, protein, and fat yields. This is then enhanced with SNP effects, to produce a genomic-only breeding value. It is suggested to use this value in combination with other profit-determining traits. The slick-hair gene has been associated with heat tolerance (Ortiz-Colon et al., 2018) in Slick-haired Holstein calves that had lower vaginal temperatures and respiration rates, mainly due to an increased ability to dissipate heat through sweating. Improved heat tolerance is most likely not due to only the slick-hair gene, but caused by a more complex genetic mechanism.

Lameness is a significant concern in the dairy industry, due to its adverse impact on milk yield, reproductive performance and animal welfare (Randall et al., 2015). Claw health poses challenges with regard to phenotypic recording due to linear indicator traits (locomotion scores). Claw lesions are however not always associated with these type traits (Miglior et al., 2017) and recording through trimming data holds the most potential for direct genetic improvement (Heringstad et al., 2018). Body condition score (BCS) can also be used as an indicator trait of lameness, and has been proposed as a sustainable management intervention (Randall et al., 2015). Maintaining scores of ≥ 2.5 might decrease risk of lameness, especially when used in combination with other risk factors, such as higher parity.

Parasites are a major constraint for livestock production throughout the world, and especially in tropical areas. Alba-Hurtado and Muñoz-Guzmán (2013) reported that losses due to gastrointestinal nematodes (GIN) have been estimated at approximately US\$ 400 million per annum in Australia and up to US\$ 26 million, US\$ 46 million, and US\$ 103 million in Kenya, South Africa, and India respectively. The effects of nematode and parasite infection include reduced growth, compromised reproduction, and elevated mortality (Marufu et al., 2011; Guo et al., 2016). Historically, the control of GIN and ticks was largely based on the use of drugs but the development of anthelmintic and acaricide resistance has made this practice unsustainable (Mapholi et al., 2014; McManus et al., 2014). Additionally, the use of drugs is expensive and not affordable by emerging and smallholder farmers (Mpetile et al., 2015). This call for the development of more sustainable, realistic long-term and costeffective management strategies, such as breeding animals for genetic resistance to parasites (Marufu et al., 2011; Alba-Hurtado and Muñoz-Guzmán, 2013).

Selection for nematode resistance has mainly been based on the use of indicator traits such as fecal egg count (FEC; Riggio et al., 2013), FAMACHA© scoring (Van Wyk and Bath, 2002), and body condition score (BCS; Cornelius et al., 2014). The FAMACHA system is based on a standardized chart with illustrations of sheep eyes and membranes in differing hues, indicating varying levels of anemia (Van Wyk and Bath, 2002). While FEC is a difficult to measure trait, especially in rural environments, both FAMACHA and BCS can be used in resource-poor areas as efficient indicators of worm infestation. Easily measured, practical traits for tick resistance include coat characteristics such as hair length and skin thickness (Marufu et al., 2011; Mota et al., 2018). Several studies (Mapholi et al., 2014; Benavides et al., 2015; Mota et al., 2018) have indicated QTL and candidate genes that are associated with resistance to parasites, but it is unlikely that markers will be identified that can serve all breeds. The genetic mechanism for resistance is still not well-understood. Certain indigenous breeds show remarkable resistance to GIN, such as the West African Dwarf goat (Chiejina et al., 2015) and the Nguni to ticks (Marufu et al., 2011). This genetic variation should be exploited in the search for a cost-effective, practical solution to parasite infestation.

Novel traits need to adhere to basic criteria to be useful in breeding strategies. It should be economically important, be heritable with sufficient variation and lastly be practically measurable at a cost-efficient level (Miglior et al., 2017). Some of the traits discussed above, may not yet meet all of the criteria. However, it is crucial to investigate novel traits to make full use of the genetic variation available in the African livestock industry.

# GENOMIC STRATEGIES FOR SUSTAINABLE GENETIC IMPROVEMENT

Genomics has resulted in substantial genetic improvement in most livestock species world-wide. Routine genotyping is performed and genetic evaluations include most traits of economic importance that has been traditionally recorded by breeders. As discussed above, the South African livestock industry is still in infancy with regard to genomic applications and to date limited to the developed livestock sector. In order to design appropriate genomic strategies for the South African livestock industries the dichotomy between the developed vs. developing sector must be addressed as this will influence the long-term application and sustainability of genomics in the SA industry.

The commercial beef and dairy cattle industry have been using available genetic tools such as EBVs, diagnostic tests, and DNA parentage testing in selection programs for genetic improvement (Van Marle-Koster et al., 2013). Genetic improvement has been made in production traits in dairy and beef cattle and sheep breeds using these approaches. To meet the challenges of the Twenty-First century with regard to GHG, feed efficiency, fertility and welfare, novel traits will require emphases in setting breeding objectives and inclusion in current animal recording systems. Application of genomic information holds the most potential in this sector, where state funded programs have been established for genomic selection, providing SA breeders with an additional tool for improving accuracy of selection. Recording of novel traits will incur additional costs for breeders for example using hoof trimmers on a regular basis for claw health in dairy cattle, additional labor for collection of tick counts and using wireless sensor networks (WSN) (Greenwood et al., 2014) and Growsafe/Callen gates technology (Berry et al., 2015) for feed intake and GHG. Although research programs are being established for these novel traits, breeders will have to invest in genomics through extensive phenotypic recordings (Berry et al., 2016) and routine genotyping to reap the benefits.

Routine SNP genotyping of livestock populations in the developing sector will remain a pipe dream for at least a few decades, in the face of more practical challenges such as land availability, droughts, and poverty. In South Africa, both phenotypic and genomic data (in terms of a sufficiently large reference population) pose a challenge for most livestock species kept in smallholder systems. Animal recording is practically non-existent in these extensive systems and measuring of basic traits such as animal weights is problematic with limited equipment and infrastructure. More advanced traits such as direct measuring of GHG emissions pose a greater challenge, due to high measuring costs and expensive infrastructure needed. In addition, most methods to estimate methane production rely on the assumption of ad libitum intake, which is often violated in African systems due to tethering and overnight holding of animals (Goopy et al., 2018).

The emerging livestock farmers are in need of good quality male and female genetic stock, which must be supplied by the seed stock breeders. Considering the progress made in the commercial sector over the past three decades, suitable animals (male and female) should be available to already contribute to genetic progress. A study by Mugwabana et al. (2018) have shown that calving rate was positively influenced by using reproductive technologies in emerging and communal farms in South Africa. The adoption of these reproductive technologies (AI) as well as proper animal recording will be cost consideration for these farmers. Farmer co-operatives where bulls and rams are shared, or AI technicians employed can result in genetic improvement in the first generation progeny. In the dairy industry share milking schemes have reported successes where commercial and emerging farmers have formed partnerships (Strydom, 2016). Advantages reported in the study by Strydom (2016) included the access to the livestock skills and technology shared by the commercial farmer, access to markets and gaining business skills. In these systems the basic constraints are overcome, and the emerging farmer can focus on the production, management and selection of the animals. Limited published literature is available of successes of emerging farmers, especially with regard to use of genetic tools and genetic improvement.

Most smallholder farmers make use of indigenous and nondescript crossbreds with no animal recording. The value of adapted indigenous genetic resources in South Africa, which form the basis of smallholder food security, has to a large extent been ignored in the past. Exotic improved breeds often under-perform in the harsh, extensive environments with limited supplementation (Kim et al., 2017). It is ironic that some of the novel traits, such as improved disease resistance and thermo-tolerance that are currently explored in exotic, highproducing world breeds are already present in these local breeds (Kim et al., 2017; Nyamushamba et al., 2017). The greatest benefit of genomics to smallholder farmers might well be the characterisation of their animals, and this benefit may hold great potential in terms of gene introgression into exotic breeds. Using unique haplotypes identified in indigenous breeds, such as hypocretin receptors in trypanotolerance, the BOLA complex in tick resistance and heat shock proteins in thermotolerance (Kim et al., 2017) could ultimately benefit commercial producers. Care should however be taken to protect the scarce genetic resource against indiscriminate crossbreeding, which has eroded the unique characteristics of many indigenous breeds.

Genomic technology holds potential for South African livestock breeders. Commercial breeders are becoming aware of the benefits of complete phenotypic recording and routine genotyping. It is important that the research community address the novel traits in the various species to answer the challenges of sustainable livestock production. South African indigenous livestock are valuable resources with unique traits which should be investigated at a genomic level. Genomics will however not bring solutions on the short term to the developing sector and national strategies will be required to first address socioeconomic issues including livestock extension support.

# CONCLUSION

In reviewing the development of the livestock industry in South Africa, it is clear that there is a solid foundation for genetic improvement. Genetic tools and technologies are available but are restricted to application in the commercial sector. In order to reap the full benefits of genomics, commercial breeders will have to invest in recording of novel phenotypes and routine genotyping. The emerging farmers can already benefit from the available superior genetic material, provided that socio-economic factors are addressed by a national strategy. The emerging farming sector is an important link in the dissemination of genetic resources from the commercial farmers to the smallholder farmers. In this way genomics could provide solutions to narrow the current dichotomy in the SA livestock industry.

## REFERENCES


# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.


in livestock systems: an animal welfare perspective. Animal 11, 274–284. doi: 10.1017/S1751731116001440


markers and the impact on selection. J. Dairy Sci. 85, 2368–2375. doi: 10.3168/jds.S0022-0302(02)74317-8


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 van Marle-Köster and Visser. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Comparative Transcriptome Profiling of mRNA and lncRNA Related to Tail Adipose Tissues of Sheep

Lin Ma<sup>1</sup> , Meng Zhang<sup>1</sup> , Yunyun Jin<sup>1</sup> , Sarantsetseg Erdenee<sup>1</sup> , Linyong Hu<sup>2</sup> , Hong Chen<sup>1</sup> , Yong Cai3,4 \* and Xianyong Lan<sup>1</sup> \*

<sup>1</sup> Shaanxi Key Laboratory of Molecular Biology for Agriculture, College of Animal Science and Technology, Northwest A&F University, Yangling, China, <sup>2</sup> Key Laboratory of Adaptation and Evolution of Plateau Biota, Northwest Institute of Plateau Biology, Chinese Academy of Sciences, Xining, China, <sup>3</sup> Science Experimental Center, Northwest University for Nationalities, Lanzhou, China, <sup>4</sup> College of Life Science and Engineering, Northwest University for Nationalities, Lanzhou, China

### Edited by:

Joram Mwashigadi Mwacharo, International Center for Agriculture Research in the Dry Areas (ICARDA), Ethiopia

### Reviewed by:

Alessandra Crisà, Consiglio per la Ricerca in Agricoltura e l'Analisi dell'Economia Agraria (CREA), Italy Fuyong Li, University of Alberta, Canada

### \*Correspondence:

Yong Cai caiyong1979@163.com Xianyong Lan lan342@126.com

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 26 January 2018 Accepted: 21 August 2018 Published: 10 September 2018

### Citation:

Ma L, Zhang M, Jin Y, Erdenee S, Hu L, Chen H, Cai Y and Lan X (2018) Comparative Transcriptome Profiling of mRNA and lncRNA Related to Tail Adipose Tissues of Sheep. Front. Genet. 9:365. doi: 10.3389/fgene.2018.00365 The Lanzhou Fat-Tail sheep (LFTS, long fat-tailed sheep) is an endangered sheep breed in China with a fat tail compared to the traditional local varieties, Small Tail Han sheep (STHS, thin-tailed sheep) with a small tail, and Tibetan sheep (TS, short thin-tailed sheep) with a little tail. However, little is known regarding how tail fat deposition is regulated by long noncoding RNA (lncRNA). To evaluate the lncRNA and mRNA associated with tail fat deposition and development among these breeds, high-throughput RNA sequencing of three individuals each of LFTS, STHS, and TS were performed and analyzed in this study. RNA sequencing data from these three groups revealed 10 differentially expressed genes (DEGs) and 37 differentially expressed lncRNAs between the LFTS and STHS groups, 390 DEGs and 59 differentially expressed lncRNAs between the LFTS and TS groups, and 80 DEGs and 16 differentially expressed lncRNAs between the STHS and TS groups (p-value < 0.05 and fold change ≥ 2), respectively. Gene Ontology and pathway analysis of DEGs and target genes of differentially expressed lncRNAs revealed enrichment in fatty acid metabolism and fatty acid elongation-related pathways that contribute to fat deposition. Subsequently, the expression of 14 DEGs and 6 differentially expressed lncRNAs was validated by quantitative real-time PCR. Finally, two co-expression networks of differentially expressed mRNA and lncRNAs were constructed. The results suggested that some differentially expressed lncRNAs (TCONS\_00372767, TCONS\_00171926, TCONS\_00054953, and TCONS\_00373007) may play crucial roles as core lncRNAs in tail fat deposition processes. In summary, the present study extends the sheep tail fat lncRNA database and these differentially expressed mRNA and lncRNAs may provide novel candidate regulators for future genetic and molecular studies on tail fat deposition of sheep.

Keywords: sheep, transcriptome, fat deposition, long non-coding RNA (lncRNA), fat tail

# INTRODUCTION

Lanzhou Fat-Tailed sheep (LFTS), Small Tailed Han sheep (STHS), and Tibetan sheep (TS) are famous and special sheep breeds in China. LFTS are one of the four Chinese sheep breeds majorly raised in Northwestern China where the terrain is dry and the region is at high altitude. However, the famous phenotype of LFTS is their fat tail, which can sag to the hock and accumulate a lot of

fat (Shelton, 1990; Almeida, 2011; Edea et al., 2017; Li et al., 2018a). Currently, the number of fat-tailed sheep are in sharp decline, especially LFTS. LFTS is an endangered breed that needs protection. Compared with LFTS, STHS have smaller tails and fat accumulation (Xu et al., 2017; Ma et al., 2018). STHS have a high reproductive capacity and show polyembryony; they grow fast and could be in oestrum at all seasons (Kashan et al., 2005). TS are raised in the mountainous region of the Qingzang plateau, where the average elevation is 3,500 m. Compared with LFTS and STHS, TS are relatively stronger and their tails are the smallest with less fat accumulation (Zhu et al., 2016; Zhou et al., 2017).

Adipose tissue is one of the vital tissues involved in the regulation of fat development and lipid metabolism in domestic animals. The "fat-tail" can provide energy during migration and in seasons when the pasture is dormant or when low amounts of dry matter are available (Atti et al., 2004). The fat-tail phenotype is a trait necessary for survival in harsh environments (Pourlis, 2011). In addition, the tail fat of sheep can be used by humans as an important source of dietary fat (Kashan et al., 2005; Moradi et al., 2012). Thus, the mechanism of tail fat deposition is worth studying.

In recent years, deep sequencing of transcriptomes is increasingly being utilized with promises of higher sensitivity in identification of differential expression (Jäger et al., 2011; Miao and Luo, 2013; Zhang C. et al., 2013). A few comparative transcriptome studies and whole genome studies were performed to survey gene expression profiles between different sheep breeds and different tissues in the same sheep breed (Wang et al., 2014; Miao et al., 2015b; Kang et al., 2017; Zhi et al., 2017). There are some studies on miRNA or CNV in the adipose in sheep (Miao et al., 2015a; Zhu et al., 2016; Zhou et al., 2017). In 2014, transcriptome sequencing was used to compare transcriptome profiles of fat between a fat-tailed sheep (Kazak sheep) and a short-tailed sheep (TS). 646 genes were differentially expressed between the two breeds, and the two top genes with the largest fold change (NELL1 and FMO3) may affect fat metabolism in adipose tissues of sheep (Wang et al., 2014). In 2015, 602 differentially expressed genes (DEGs) were identified in the fat of two breeds of sheep using RNA-Seq technology, and some of these genes were shown to be involved in fat metabolism process through GO enrichment and KEGG pathway analysis. These genes may be involved in fat deposition in sheep (Miao et al., 2015b). The miRNA were sequenced in fat of two breeds of sheep and 54 differentially expressed miRNA were identified. It was found that some miRNA and their target genes were involved in the tail lipid development of sheep. (Miao et al., 2015a). In 2017, deep sequencing methods were used to identify miRNA and their target genes involved in the fat of the fat-tailed sheep (Kazakhstan sheep) and thin-tailed sheep (TS). By comparing the HiSeq data of these two breeds, it was found that some miRNA were involved in the development of tail fat, and through the integration analysis of miRNA–mRNA, it is revealed that some miRNA and their target genes play a key role in fat deposition in sheep (Zhou et al., 2017). In the same year, 1,058 DEGs were identified by transcriptome sequencing of three different types of fat (subcutaneous fat, visceral fat, and tail fat) in Tan sheep, and it was suggested that HOXC11, HOXC12, HOXC13, HOTAIR\_2, HOTAIR\_3, and SP9 could be associated with tail fat deposition in sheep (Kang et al., 2017). Recently, transcriptome sequencing and miRNA sequencing were performed in three types of fat (subcutaneous fat, perirenal fat, and tail fat) of two sheep breeds (Guangling large-Tailed sheep and Small-Tailed Han sheep). Fat-related genes (FABP4, FABP5, ADIPOQ, and CD36) were highly expressed, and 14 genes (LOC101102230, PLTP, C1QTNF7, OLR1, SCD, UCP-1, ANGPTL4, FASD2, SLC27A6, LAMB3, LAMB4, RELN, TNXB, and ITGA8) and 9 miRNA (miR-10b, miR-29a, miR-30c, miR-155, miR-192, miR-206, novel-miR-102, novel-miR-36, and novel-miR-63) may be associated with fat deposition in sheep (Li et al., 2018b; Pan et al., 2018). However, up to now, there has been no report on long non-coding RNAs (lncRNAs) of the fat tail in sheep. Furthermore, more complex gene networks and molecular determinants related to tail fat development remain unclear and further studies exploring these aspects are required.

Here, in order to characterize the mRNA and lncRNA expression profiles in the tail fat of sheep, we explored the transcriptomic differences among LFTS, STHS, and TS sheep and elucidated the molecular mechanisms of tail fat deposition. Our study may provide more clues from coding and non-coding regions regarding the mechanism of fat deposition in fat-tailed sheep.

# MATERIALS AND METHODS

# Ethics Statement

All experiments performed in this study were approved by the International Animal Care and Use Committee of the Northwest A&F University (IACUC–NWAFU). Furthermore, the care and use of animals complied with the local animal welfare laws, guidelines, and policies.

Experimental license on the basis of "Experimental Animal Management Measures in Shaanxi Province" (016000291szfbgt-2011-000001), all experiment procedures, were approved by the Review Committee for the Use of Animal Subjects of Northwest A&F University. Animal experimentation, including sample collection, was performed in agreement with the ethical commission's guidelines. This license is for LM, etc., thesis on "Comparative transcriptome profiling of mRNA and lncRNA related to tail adipose tissues of sheep." College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, China, January 26, 2018.

# Animal and Tail Fat Tissue Collection

In this study, nine unrelated individuals of LFTS (n = 3), STHS (n = 3), and TS (n = 3) breeds that were castrated at the age of 6 months were randomly selected from a sheep farm located in Gansu province, China. The appearance and shape of the sheep completely conformed to their varietal characteristics. Their body conditions were healthy and their weights were moderate. The sheep were fed in stables under natural lighting. The animals were slaughtered and the tail fat tissues collected. The fresh tissues were immediately frozen in liquid nitrogen, and then stored at −80◦C until use.

# RNA Extraction and Quality Assessment

Total RNA was extracted from tail fat tissues using RNAiso Plus (Takara, Dalian, China) following the manufacturer's specifications. The RNA was, respectively, solubilized in 30 µL DEPC-treated H2O. Aliquots of 1 µL RNA from each sample were used for evaluation by spectrophotometric analysis. Another aliquot of 1 µL RNA mixed in loading buffer was detected on 1.0% agarose gel electrophoresed for 20 min by


fgene-09-00365 September 6, 2018 Time: 19:33 # 3


TABLE 2 | Primer pairs of differentially expressed lncRNAs used for qRT-PCR validation.



TABLE 3 | Reads filter and mapping summary.

fgene-09-00365 September 6, 2018 Time: 19:33 # 4

∗ rRNA trimmed reads are data which non-alignment to rRNA database of sheep. Clean ratio = (clean reads/raw reads)%; rRNA ratio = [(clean reads − rRNA trimmed)/clean reads]%; Mapping ratio = mapped reads/all reads.

staining with ethidium bromide and observing under UV transillumination. The RNA concentration and quality were assessed by the Agilent 2100 bioanalyzer (Agilent Technologies, Santa Clara, CA, United States). The A260/<sup>280</sup> ratios, 28S/18S values, and the RNA Integrity Numbers (RIN) of all samples are shown in **Supplementary Table S1**. Subsequent sequencing experiments were performed on qualified RNA. The remaining RNA samples were immediately stored at −80◦C.

# cDNA Library Construction and Illumina Sequencing

Qualified total RNA was further purified by RNAClean XP Kit (Beckman Coulter, Inc., Kraemer Boulevard, Brea, CA, United States) and RNase-Free DNase Set (Qiagen, GmBH, Germany). After the purification and ribosomal RNA removal, the rRNA-depleted samples were sheared into small fragments using divalent cations under high temperature. These RNA fragments were copied into the first strand of cDNA using random primers and reverse transcriptase. The second strand of cDNA was then synthesized using DNA Polymerase I and RNase H. These final cDNA fragments were then subjected to an end repair process where a single "A" base was added followed by ligation of the adapters. The output was then purified and enriched using PCR to create the final cDNA library.

The nine strand-specific RNA-Seq libraries were sequenced with a HiSeq 2000 Desktop Sequencer from Illumina Sequencing Technologies (Biotechnology, Shanghai, China). Sequencing was optimized to generate 150 bp paired reads. All datasets have been submitted to NCBI Sequence Read Archive (SRA) database and the files can be found under the accession numbers SRR6666247, SRR6666246, SRR6666245, SRR6666244, SRR6666251, SRR66- 66250, SRR6666249, SRR6666248, SRR6666243.

# Sequencing Quality Assessment, Reads Mapping, and Transcriptome Assembly

Reads qualities of the RNA sequencing (RNA-Seq) were evaluated using FastQC (v0.10.1) (Andrews, 2012). Adaptor sequences and low quality sequences were removed from the original reads by

Seqtk<sup>1</sup> . The clean reads for each sample were mapped to the sheep reference genome Ovis aries v3.1 with TopHat2 (v2.0.9) using the paired-end mapping method with two mismatches (Trapnell et al., 2009). Based on it, the transcripts were assembled using Cufflinks (v2.2.1) with default parameters (Trapnell et al., 2012).

# Prediction of lncRNA

After annotation, the unknown transcripts were used to screen for lncRNA candidates. Transcripts smaller than 200 nucleotides or having single exons were discarded. Based on the length of the open reading frame, homology with known proteins and their coding potential, the Coding Potential Calculator (Kong et al., 2007), the Coding-Non-Coding Index (Sun et al., 2013), and the Protein Families Database (Finn et al., 2014), which have the power to sort lncRNAs from putative protein-coding RNAs were combined to screen the lncRNAs. The transcripts from the intersection of the three methods were predicted to be lncRNA transcripts.

<sup>1</sup>https://github.com/lh3/seqtk

# Screening of DEGs and Differentially Expressed lncRNAs

fgene-09-00365 September 6, 2018 Time: 19:33 # 5

DEGs were analyzed by edgeR package to calculate the p-value that was obtained by multiple hypothesis testing calibration (Robinson et al., 2010). The p-value was corrected using the false discovery rate (FDR) to obtain the q-value. Q-values were then used to calculate the differential expression among the three groups.

We also calculated fragments per kilobase of the exon model per million mapped reads (FPKM) value of each gene using Perl script, as follows:

FPKM = total exon fragments mapped reads (Millons) × exon length (KB)

FPKM were used to calculate the fold change of DEGs among the three groups. Differentially expressed lncRNAs were analyzed by Cuffdiff to calculate the q-value and fold change (Trapnell et al., 2012). Transcript abundance of lncRNAs was measured by FPKM using Cufflinks (v2.2.1) (Trapnell et al., 2012). DEGs or differentially expressed lncRNAs with a q-value < 0.05 and an absolute value of fold change ≥ 2 were assigned as differential expression. Based on the FPKM of all genes or lncRNAs from three groups of pairwise comparisons, the volcano were plotted by gglot2 packages to show the patterns of genes/lncRNAs expression.

# Target Gene Prediction

Differentially expressed lncRNAs were selected for target prediction via cis- or trans-regulatory effects. For the cis pathway target gene prediction, the genes transcribed within a 10-kb window upstream or downstream of lncRNAs were considered as cis target gene. RNAplex software was then used to select trans-acting target genes (Tafer and Hofacker, 2008).

# Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathway Analyses of DEGs and Target Genes of Differentially Expressed lncRNAs

To analyze the main function of the genes and lncRNAs, DEGs and the target genes were annotated through the GO and KEGG. The GO database was used to predict and illuminate the function of the gene product with respect to the molecular and biological processes and cellular component (Ashburner et al., 2000). The genes were first mapped to the GO terms in the database<sup>2</sup> . The gene numbers in every GO term were then calculated to determine the significantly enriched GO terms using the corrected p-value < 0.05 as a threshold. KEGG<sup>3</sup> was used to perform pathway enrichment analysis (Kanehisa et al., 2016) to confirm the main biochemical and signaling pathways in which the genes participate. The significantly enriched KEGG pathways were determined using the corrected p-value < 0.05 as a threshold. If the corrected p-value (q-value) < 0.05, significant enrichment of GO terms, or KEGG pathways was observed in the DEGs and target genes of differentially expressed lncRNAs.

# Validation of RNA-Seq Results by Quantitative Real-Time PCR (qRT-PCR)

To quantitatively determine the reliability of our analyzed data, 14 significant DEGs and 6 differentially expressed lncRNAs were randomly selected to test their expression levels using qRT-PCR. Total RNA samples were reverse transcribed to cDNA using the PrimeSriptTM RT reagent Kit

<sup>2</sup>http://www.geneontology.org/ <sup>3</sup>http://www.genome.jp/kegg/

with gDNA Eraser (TaKaRa, Dalian, China) according to the manufacturer's recommendations. qRT-PCR was performed using the SYBR <sup>R</sup> Premix Ex TaqTM kit (TaKaRa, Dalian, China) on the Bio-Rad CFX96 Real-Time PCR system (Hercules, CA, United States). All the primers of DEGs and differentially expressed lncRNAs used are presented in **Tables 1**, **2**, respectively. Individual samples were run in triplicate. The qRT-PCR amplification program was as follows: pre-denaturation at 95◦C for 30 s, followed by 39 cycles of 95◦C for 5 s, 60◦C for 30 s.

Relative expressions were calculated using the 2−11C<sup>t</sup> method with GAPDH as the internal control (Livak and Schmittgen, 2001). The data were compared by Student's t-test using SPSS (version 23.0) (SPSS, Inc., Chicago, IL, United States), and the results were expressed as the mean ± standard deviation of triplicates values. P-value < 0.05 was considered statistically significant (Yang et al., 2017).

# Construction of the lncRNA-Gene Co-expression Network

To further explore the interactions between the DEGs and differentially expressed lncRNAs, the co-expression was analyzed based on their FPKM. For each lncRNA, Pearson correlation coefficient (COR) of its expression value with that of each mRNA was calculated. The interaction network of the differentially expressed lncRNA–mRNA co-expression pairs (an absolute value of COR ≥ 0.7 and FDR < 0.01) was then constructed using Cytoscape (v3.6.0) (Shannon et al., 2003).

point for no significant genes.

# RESULTS

# Sequencing Data Summary

fgene-09-00365 September 6, 2018 Time: 19:33 # 7

Herein, a total of 60 Gb raw data were generated. In detail, 75,592,986, 88,617,414, and 83,525,778 raw reads were obtained for LFTS (LFTS-1, 2, and 3, respectively); 100,297,264, 80,848,034, and 83,364,558 raw reads were obtained for STHS (STHS-1, 2, and 3, respectively); and 78,883,006, 70,533,752, and 57,254,426 raw reads were obtained for TS (TS-1, 2, and 3, respectively) (**Table 3**). The raw reads were filtered to obtain clean reads, which were mapped to the Ovis aries v3.1 version of the sheep genome sequence, with the mapping ratio ranging from 68.26 to 81.12%. Based on it, the transcripts were assembled using Cufflinks (v2.2.1) with default parameter. The results of the RNA-Seq reads mapped on the reference are shown in **Table 3**.

# Identification and Characterization of lncRNA in Tail Fat of Sheep

To study the basic features of lncRNAs in tail fat of sheep, the lncRNAs were identified and compared with mRNA. The intersection of the Coding Potential Calculator, Coding-Non-Coding Index, and the Protein Families Database results finally yielded 9,082 lncRNA transcripts. The lncRNA transcripts were classified as 4,791 (52.8%) intergenic lncRNAs, 97 (1.1%) exonic sence lncRNAs, 1,398 (15.4%) exonic antisence lncRNAs, 1,167 (12.8%) intronic sence lncRNAs, 1,148 (12.6%) intronic antisence lncRNAs, and 481 (5.3%) bidirectional lncRNAs (**Figure 1**). Although the length of lncRNAs and mRNA transcripts is comparable, the expression levels between them are different. We found that lncRNAs exhibited lower expression levels compared to mRNA (**Figure 2**).

# Differential Expression Analysis and Target Gene Prediction

fgene-09-00365 September 6, 2018 Time: 19:33 # 8

DEGs and differentially expressed lncRNAs were found through comparison between any two breeds. For the tail fat of LFTS vs. STHS, 10 genes were considered as DEGs, including 7 up-regulated and 3 down-regulated genes. For LFTS vs. TS, 390 genes were DEGs including 215 up-regulated and 175 down-regulated ones. For the comparison of STHS and TS, 40 DEGs were found of which 21 genes were upregulated and 19 were down-regulated. The two common DEGs in LFTS vs. STHS and LFTS vs. TS were FMO2 and ENSOARG00000013777. In total, 17 common DEGs were found in both LFTS vs. TS and STHS vs. TS groups, such as C1RL, DHCR7, and IGF1. There were no common DEGs in the two comparisons of LFTS vs. STHS and STHS vs. TS. We used volcano plots to explore the relationship between the fold change and the significance (**Figure 3**). To determine the primary patterns of gene expression, hierarchical clustering analysis of all DEGs was further employed based on the FPKM value (**Figure 4**).

By analysis, 68 differentially expressed lncRNAs were screened from the three comparisons. Among them, 37 differentially expressed lncRNAs (16 up-regulated and 21 downregulated) were found between LFTS and STHS. Fifty-nine differentially expressed lncRNAs (31 up-regulated and 28 downregulated) were found between LFTS and TS. There were 16 differentially expressed lncRNAs (eight up-regulated and eight down-regulated) between STHS and TS. The two common differentially expressed lncRNAs in the three comparisons were

TCONS\_00297891 and TCONS\_00369087. Except for these two lncRNAs, there were 27 common differentially expressed lncRNAs in the LFTS vs. STHS and LFTS vs. TS, 11 common differentially expressed lncRNAs in LFTS vs. TS and STHS vs. TS, and 2 common differentially expressed lncRNAs in LFTS vs. TS and STHS vs. TS. Volcano plots were used to explore the relationship between the fold change and the significance (**Figure 5**). As lncRNAs could exert effects through cis- or trans-acting target genes, the neighboring (100 kb upstream or downstream) and/or complementary protein-coding genes of the differentially expressed lncRNAs from pairwise comparisons were predicted.

# GO Analysis

The DEGs in the tail fat of LFTS vs. STHS, LFTS vs. TS, and STHS vs. TS were annotated (**Supplementary Table S2**). The top 30 GO terms (in descending order of the Richness factor) of the three groups are shown in **Figure 6**. The DEGs of LFTS vs. STHS were enriched in four GO terms, including organic cyclic compound binding, cell, catalytic activity, and cellular process. LFTS vs. TS DEGs were enriched in triglyceride biosynthetic process, sterol biosynthetic process, and cellular carbohydrate catabolic process. The DEGs of STHS vs. TS were majorly enriched in biological process including negative regulation of cell death and developmental growth.

The target genes of differentially expressed lncRNAs in the tail fat of LFTS vs. STHS, LFTS vs. TS, and STHS vs. TS were annotated and the top 30 GO terms (in descending order of the Richness factor) of the three groups are shown in **Figure 7**. The target genes of LFTS vs. STHS were significantly enriched in four GO terms, including nucleoside triphosphate biosynthetic process, apical part of cell, ATP biosynthetic process, and purine ribonucleoside monophosphate biosynthetic process. LFTS vs. TS target genes were significantly enriched in 33 GO terms, such as protein–DNA complex, protein dimerization activity, and transporter activity. The target genes of STHS vs. TS were

FIGURE 7 | Top 30 of GO enrichment for target genes of differentially expressed lncRNAs from three groups of pairwise comparisons (A: LFTS vs. STHS, B: LFTS vs. TS, and C: STHS vs. TS). The x-axis presents rich factor of target genes in a category. The y-axis shows the specific GO term.

significantly enriched in 23 GO terms which mainly related to transporter activity and protein activity.

# Pathway Analysis

fgene-09-00365 September 6, 2018 Time: 19:33 # 10

Pathway annotation of DEGs was performed using the KEGG database (**Supplementary Table S3**). Pathway enrichment analysis showed that the DEGs of LFTS vs. STHS related mainly to metabolic processes such as arachidonic acid metabolism and metabolism of xenobiotics by cytochrome P450; the DEGs of LFTS vs. TS were enriched in pathways including regulation of lipolysis in adipocytes, steroid biosynthesis, fatty acid metabolism, fatty acid elongation, and biosynthesis of unsaturated fatty acids; the pathways related to fat which the STHS vs. TS DEGs were enriched in included the adipocytokine signaling pathway, cGMP-PKG signaling pathway, and Jak-STAT signaling pathway (**Figure 8**).

Pathway annotation and enrichment of target genes of differentially expressed lncRNAs were performed using the KEGG database. The results showed that the target genes of differentially expressed lncRNAs of LFTS vs. STHS were majorly related to oxidative phosphorylation; the target genes of LFTS vs. TS were abundant in pathways including fatty acid elongation and fatty acid metabolism; and the pathways which the STHS vs. TS target genes were mainly enriched were in fatty acid elongation (**Figure 9**).

# Validation of RNA-Seq Data by qRT-PCR

To validate the RNA-Seq data, DEGs and differentially expressed lncRNAs related to adipocyte accumulation were, respectively, selected in LFTS vs. STHS, LFTS vs. TS, and STHS vs. TS. In total, 14 and 6 DEGs and lncRNAs, respectively, underwent qRT-PCR analysis. The qRT-PCR results of the DEGs and differentially expressed lncRNAs were in agreement with the RNA-Seq data, indicating that the two results validated each other (**Figures 10**, **11**). Compared with STHS, the DEGs FMO2 and PENK were up-regulated, whereas DPT and RASD1 were down-regulated in the LFTS, where DPT showed significant differential expression (p-value < 0.05) and RASD1 showed very significant differential expression (p-value < 0.01). Compared with TS, the DEGs MID1IP1, PRKAR2B, and ELOVL3 were

up-regulated, whereas PDK4, PLIN2, and TCAP were downregulated in the LFTS, where PLIN2 showed significant differential expression (p-value < 0.05) and PDK4 showed very significant differential expression (p-value < 0.01). Compared with TS, the DEGs SLC22A4 and LTF were up-regulated, whereas ADGRG3 and LEPR were down-regulated in the STHS, where SLC22A4 showed significant differential expression (pvalue < 0.05).

For lncRNAs, compared with STHS, the differentially expressed lncRNAs ENSOART00000027984 and TCONS\_002 97891 were down-regulated in the LFTS, where ENSOART-00000027984 showed significant differential expression (p-value < 0.05). Compared with TS, the differentially expressed lncRNA ENSOART00000028008 was up-regulated, whereas ENSOART00000027984, ENSOART00000028118, and TCONS\_00297891 were down-regulated in the LFTS. Compared with TS, the differentially expressed lncRNAs ENSOART0 0000028008 and TCONS\_00303998 were upregulated, whereas TCONS\_00303998 was down-regulated in the STHS, where ENSOART00000028008 showed significant differential expression (p-value < 0.01).

The expression levels of these genes and lncRNAs as determined by qRT-PCR were consistent with the RNA-Seq data, which validated the accuracy of the RNA-Seq data.

# Network Construction Based on DEGs and Differentially Expressed lncRNAs in Tail Fat of Sheep

Using the screened differential expression mRNA and lncRNA of tail fat of sheep for co-expression analysis, 493 pairs of significant co-expression pairs were obtained, and most were positively correlated (COR ≥ 0.7, 475 pairs) while a few were negatively correlated (COR ≤ −0.7, 18 pairs). Using the screened mRNA–lncRNA pairs to construct a co-expression network, it was found that some lncRNAs interact with more than 50 mRNA, for example, 67 mRNA co-expressed with TCONS\_00372767, TCONS\_00171926, and TCONS\_00054953, respectively, and 65 mRNA co-expressed with TCONS\_00373007, indicating that these lncRNAs belong to the core lncRNAs and have important regulatory effects on tail fat deposition (**Figure 12**).

# DISCUSSION

Transcriptome sequencing is the preferred biotechnique to analyze gene expression and reveal biological characteristics. Herein, we used tail fat from LFTS, STHS, and TS, which are unique Chinese sheep breeds, to explore the mechanism

underlying the different tail phenotypes. Strand-specific RNA sequencing was performed to systematically identify mRNA and lncRNAs in different tail fat tissues. In this study, 407 DEGs were identified from the three comparison pairs and were significantly enriched in 120 GO terms and pathways. Furthermore, 68 differentially expressed lncRNAs were screened and the target genes of these lncRNAs were predicted. Further 493 significant co-expression pairs based on DEGs and differentially expressed lncRNAs were constructed to reveal their function.

We identified 9,082 lncRNAs from tail fat of LFTS, STHS, and TS, and most of them belong to intergenic lncRNAs. LncRNAs from tail fats are relatively abundant compared with these from other tissues, such as 6,924 and 5,602 lncRNAs from muscle and blood samples of Hu sheep, respectively (Zhang et al., 2017; Feng et al., 2018). The tail fat lncRNAs also share several typical characters with other mammalian lncRNAs. Compared with mRNA, the tail fat lncRNAs have relatively lower expression levels, while the length of lncRNAs was similar to that of mRNA. These similarities support that

the lncRNAs identified in this study were reliable. To our knowledge, this study presents the first systematic genome-wide analysis of lncRNAs in tail fat of sheep, providing a valuable resource for functional lncRNAs associated with sheep tail fat deposition.

Of the 407 DEGs, a large proportion of key genes were involved in fat deposition, adipogenesis, and fatty acid biosynthesis, including FMO2, PLIN2, PLIN3, LEPR, PENK, ELOVL3, ELOVL5, PDK4, and SLC22A4.

Based on GO and pathway analyses of DEGs in LFTS and STHS, flavin-containing monooxygenases (FMOs) were enriched in four GO terms influencing fat metabolism. FMOs catalyze the NADPH-dependent oxidative metabolism of many structurally diverse foreign chemicals. Mice lacking FMOs 1, 2, and 4 exhibit a lean phenotype and despite similar food intake, weigh less and store less triglycerides in their white adipose tissue compared to wild-type mice (Veeravalli et al., 2014). FMO2 and FMO3 are members of the FMO gene family and FMO3 was identified by

a recent comparative genomic study between fat- and thin-tail sheep using RNA-Seq data with respect to adipose tissues from Wang et al. (2014).

Through GO enrichment of LFTS vs. TS, DEGs enriched in fatty acid elongation, biosynthesis of unsaturated fatty acids, and fatty acid biosynthesis pathways were found to be up-regulated. Previous studies have shown that breed effect was significant on fatty acid composition of fat tail (Unsal and Aktas, 2003; Moharrery, 2007; Alipanah and Kashan, 2011). Four DEGs were enriched in the triglyceride biosynthetic process including three up-regulated genes (PCK1, GPAM, and LDLR). This could indicate that the fat accumulation of LFTS was more than that in TS, leading to rapid fat metabolism. Moreover, ELOVL3, ELOVL5, PLIN2, PLIN3, NR4A1, and KLF4 genes were differentially expressed between LFTS vs. TS. ELOVL, PLIN, and KLF gene families were identified to be possibly associated with tail fat deposition (Miao et al., 2015b). NR4A1 and KLF7 were reported to be associated with adipocyte differentiation (Duszka et al., 2012; Zhang Z. et al., 2013). This suggested that these DEGs are possibly related to fat deposition in the tails of sheep.

In the comparative analysis of STHS and TS, the GO enrichment term "negative regulation of cell death" was focused on. Among the DEGs, IGF1, SERP2, and CITED1 were upregulated, whereas ALB and ACTC1 were down-regulated in STHS. The other GO term was related to growth and included up-regulated genes (NPK, SERP2, DHCR7, and IGF1) in STHS. IGF1 stimulates both the proliferation and differentiation of preadipocytes in cell culture (Duffield et al., 2008). Furthermore, CITED1 gene promotes cell proliferation and migration, and it is also a marker gene when browning of white adipocytes was induced (Choi et al., 2018; Xia et al., 2018). In addition, SLC22A4 was differentially expressed between STHS and TS, and SLC27A6 was identified as a candidate gene in tail fat development (Kang et al., 2017). SLC22A4 and SLC27A6 have similar functions. This suggests that SLC22A4 genes are possibly related to the fat-tail dimensions in sheep.

In this study, 68 differently expression lncRNAs were identified and the target genes of these lncRNAs were predicted. The results showed that the target genes were principally enriched in the GO term associated with mitochondria and transmembrane transport, such as mitochondrial inner membrane and transporter activity. The target genes also were mostly enriched in oxidative phosphorylation and non-alcoholic fatty liver disease (NAFLD). The most commonly enriched target genes were ATP6, ATP8, COIII, COXl, COX2, FHLl, SLC24A2, ALDOA, and ND1. ATP plays an important role in adipocyte. ATP could release energy to produce ADP and inorganic phosphate (Pi). AMP-activated protein kinase (AMPK) controls a constant high ratio of ATP to ADP (Hardie, 2011). The fatty acids produced by lipolysis are not usually oxidized within the adipocyte, but are released for use elsewhere. If the fatty acids generated by lipolysis are not rapidly removed from adipocytes either through export or by oxidative metabolism, they are recycled into triglycerides, an energy intensive process in which two molecules of ATP are consumed per fatty acid (Hardie, 2012). Thus, AMPK could inhibit lipolysis and maintain the rate of ATP to ADP. However, the different tail fat were used according to the condition of different sheep and the amount of fat deposition. Another special target is ELOVL6, which is found between the LFTS vs. TS comparison and is associated with fatty acids. Interestingly, the DEGs of LFTS vs. TS included ELOVL3 and ELOVL5. It could indicate that the ELOVL genes are differently expressed and regulated between tail fats of LFTS and TS that the characters are relatively different.

A total of 493 pairs of co-expression pairs were obtained by network construction based on DEGs and differentially expressed lncRNAs in tail fat of sheep. Among these co-expressed pairs, most of them were significantly and positively correlated, and only a small pairs are negatively correlated. These results indicate that these mRNA and lncRNAs may play a role mainly through positive regulation. That is high expression or low expression of both. It was also found that some lncRNAs can be co-expressed with many mRNA, suggesting that may be the lncRNAs were regulated by many mRNA.

The regulation of lipogenesis is a very complex biological process, and the tail fat of sheep is no exception. Previous studies have reported that tail fat development in sheep is associated with mRNA and miRNA (Wang et al., 2014; Miao et al., 2015a,b; Kang et al., 2017; Li et al., 2018b; Pan et al., 2018). These studies also show that tail fat deposition in sheep is not only regulated by a gene or miRNA, more likely by many coding and non-coding RNA. Some researchers integrated the miRNA and mRNA from Kazakhstan sheep and TS and found that the miRNA can participate in the regulation of sheep fat deposition by target mRNA (Zhou et al., 2017). As a type of non-coding RNA, lncRNA can also participate in the regulation of fat as part of a competing endogenous RNA network. From the perspective of lncRNAs, this study speculated that it regulates the tail fat deposition of sheep based on the lncRNA–mRNA regulated network.

In addition, there are some shortcomings in this study. For example, the DEGs and differentially expressed lncRNAs were to some extent caused by breed effect. Moreover, three animals per group are statistically not powerful enough. Regardless of the technology used to measure expression levels and the size of samples, the true gene expression levels will vary among individuals because expression is inherently a stochastic process (Hansen et al., 2011). In that case, the analysis results may not be powerful enough. However, the biological variability decreases with the increase of the number of samples. Hence, we hope to go on the further study with a larger sample size in the near future.

# CONCLUSION

A total of 407 DEGs and 68 differentially expressed lncRNAs were identified between LFTS, STHS, and TS tail fat tissues (q-value < 0.05), among which were potentially associated with tail adipose tissue enlargement. These findings contribute to a better understanding of adipose deposits in regulating the regional fat distribution and the diverse tail types in fat-tailed sheep breeds.

# AUTHOR CONTRIBUTIONS

fgene-09-00365 September 6, 2018 Time: 19:33 # 17

XL and YC conceived the project and designed the experiments. HC provided suggestions for the project. LM analyzed the data as well as he drafted the manuscript under the supervision of XL. YC and LH collected sheep tail fat tissue samples. YJ and SE performed the RNA extraction. LM and MZ performed qRT-PCR.

# FUNDING

This work was funded by the National Natural Science Foundation of China (Nos. 31660642, 31760649, and 31360529), Natural Science Foundation of Gansu Province (No. 1610RJZA103), and Central Special Funds for Basic Research in Universities Operating Expenses of "An Excellent and Three Special" Discipline Construction (No. 31920170170).

# REFERENCES


# ACKNOWLEDGMENTS

We greatly thank the staff of Lanzhou Fat-Tail sheep, Small Tail Han sheep, and TS elite reservation farm (Gansu) for collecting samples.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00365/full#supplementary-material

TABLE S1 | Quality assessment reports of all RNA samples.

TABLE S2 | The GO annotation of DEGs from three groups of pairwise comparisons.

TABLE S3 | The KEGG annotation of DEGs from three groups of pairwise comparisons.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ma, Zhang, Jin, Erdenee, Hu, Chen, Cai and Lan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fgene-09-00365 September 6, 2018 Time: 19:33 # 18

# Microsatellite-Based Genetic Structure and Diversity of Local Arabian Sheep Breeds

Raed M. Al-Atiyat 1,2 \*, Riyadh S. Aljumaah<sup>1</sup> , Mohammad A. Alshaikh<sup>1</sup> and Alaeldein M. Abudabos <sup>1</sup>

<sup>1</sup> Animal Production Department, King Saud University, Riyadh, Saudi Arabia, <sup>2</sup> Animal Production Department, Mutah University, Karak, Jordan

The genetic diversity of the sheep breeds in the Arab countries might be considered to be a mirror of the ecology of the region. In this study, the genetic structure and diversity of sheep breeds from Saudi Arabia (Harri, Najdi, Naemi, Arb, and Rufidi) and Awassi sheep from Jordan as an out-group were investigated using 19 microsatellites. All the breeds had high intra-population genetic diversity expressed as allelic number (7.33) and richness (2.9) and, expected heterozygosity (0.77). Structure analysis revealed three main gene pools underlying the ancestral genetic diversity of the study populations. The first pool had Harri, Najdi, and Rufidi breeds; the second had Naemi and Awassi breeds, and the third had the Arb breed which was significantly differentiated from the other breeds. Factorial correspondence analysis lent further support to the presence of the three gene pools. Although the outgroup Awassi sheep was more clearly differentiated, it still genetically close to Naemi sheep. The differentiation of the Arb breed could have been resulted from geographic and reproductive isolation. On the other hand, the genetic structure of the other two gene pools could be the result of the past and recent gene flow between individuals reared in the region known to be the center for animal husbandry and trading until the current time.

Keywords: ovis aries, gene flow, admixture, ancestry, biodiversity

# INTRODUCTION

The ecological diversity of the Arabian Peninsula has been reflected in the large number of sheep breeds found in the region (ACSAD (The Arab Center for the Studies of Arid Zones Dry lands)., 1997). The total numbers of breeds of sheep found in the Arab countries have been estimated to be between 46 and 49 indigenous breeds and are classified as fat-tailed, thin-tailed wool sheep and fat-tailed hairy sheep (FAO, 1995; ACSAD (The Arab Center for the Studies of Arid Zones Dry lands)., 1997). In fact, someone might find the three types of sheep in one country. For example, the Kingdom of Saudi Arabia (KSA) has six breeds of sheep named Harri (Habsi), Najdi, Naemi (Awassi), Arb, and Rufidi (ACSAD (The Arab Center for the Studies of Arid Zones Dry lands)., 2011; Aljumaah et al., 2014; Adam et al., 2015). Jordan, however, has only one indigenous breed of sheep named Awassi (Al-Atiyat et al., 2014), although sheep breeds, such as the Naemi and Najdi, from neighboring countries' were reported to be available in Jordan (Jawasreh et al., 2011). The Awassi has the widest geographic distribution of any sheep breed in the Arabian Peninsula; it is found in

### Edited by:

Johann Sölkner, Universität für Bodenkultur Wien, Austria

### Reviewed by:

Kwan-Suk Kim, Chungbuk National University, South Korea Filippo Biscarini, Consiglio Nazionale Delle Ricerche (CNR), Italy

### \*Correspondence:

Raed M. Al-Atiyat raedatiyat@gmail.com; ratiyat@mutah.edu.jo

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 12 November 2017 Accepted: 04 September 2018 Published: 25 September 2018

### Citation:

Al-Atiyat RM, Aljumaah RS, Alshaikh MA and Abudabos AM (2018) Microsatellite-Based Genetic Structure and Diversity of Local Arabian Sheep Breeds. Front. Genet. 9:408. doi: 10.3389/fgene.2018.00408

**180**

Saudi Arabia, Jordan, Palestine, Syria, Lebanon, Iraq, Turkey, and Egypt (Galal et al., 2008). In general, most sheep breeds in the Arabian Peninsula have been phenotypically described and characterized (ACSAD (The Arab Center for the Studies of Arid Zones Dry lands)., 2011). They have been and still are raised under either nomadic pastoral or transhumant production systems across the three geographic areas: the Horn of Africa, North Africa, and the Middle East (FAO, 1995). Some of the breeds are also raised under pastoral transhumance system but over a limited geographic range within a country. This is the current situation under which most breeds of sheep are reared in KSA. Given that the KSA shares land borders with Jordan, the likelihood of gene flow of Jordan Awassi sheep into KSA was reported (Rischkowsky and Pilling, 2007). It is also important to note that the KSA has been at the center of historical animal exchange networks following active ancient trade routes of Incense and Silk Roads (Christian, 2000). Currently, owners of the different sheep breeds in both countries have often faced major threats to their genetic diversity resulting from uncontrolled mating and other regional, climatic and global economic forces.

It is a common assumption that gene flow is influenced by landscape and topographical features of the regions (Taylor et al., 1993). Therefore, the sheep flock dynamics and gene flow within and between KSA regions might have shaped the genetic diversity of the breeds. Some recent studies have highlighted the genetic diversity of KSA sheep (Aljumaah et al., 2014; Adam et al., 2015). In addition, some studies have showed their differentiation from Jordan Awassi (Al-Atiyat and Aljumaah, 2014) and from Egypt and worldwide sheep (Peter et al., 2007). However, none of these studies provided information on the genetic structure of the local KSA sheep populations. It might be worthy to note that the genetic diversity is total amount of variation in a population, while population structure is how the variations are distributed and originated. Recently, Elbeltagy et al. (2015) reported that the genetic diversity and structure of Egyptian indigenous sheep reflects historical and recent anthropological interaction. There has been a suggestion that Saudi local domestic breeds have ancestors originating from within Saudi Arabia or nearby countries (Galal et al., 2008). The advent of molecular DNA technologies have provided great potential for investigating genetic diversity and structure as well as unravel the common genetic history of livestock populations. The aim of the present work was to investigate the genetic diversity, structure and common ancestry between sheep breeds found in the KSA through the analysis of genetic variation in microsatellite markers.

# MATERIALS AND METHODS

# Sheep Populations

Six sheep breeds/populations from different geographic regions in the KSA including South, North, and eastern parts were used in the present study (**Figure 1**). Blood samples were collected from unrelated adult males (rams) and females (ewes) of five KSA breeds; Harri (29), Najdi (31), Naemi (31), Arb (37), and Rufidi (6). In addition, 6 unrelated adult males of the Jordanian Awassi were also sampled and used in the study as an outgroup breed. The rams were sampled at their farm limiting to two rams per farm per village or rural region. The sampled animals were selected according to their known history or origin and predefined morphological characteristics (**Table S1**). The morphological characteristics of the animals were predefined following Atlas of farm animals in the Arab countries reported by ACSAD (The Arab Center for the Studies of Arid Zones Dry lands). (2011). The populations were then characterized by their distinctive phenotypes as can be seen from **Table S1**. For example, Najdi sheep is tall and black coated color with white face, have Roman nose and dropping ears and silky hair. Naemi and Awassi are brown face and white-skinned sheep, whereas Harri and Rufidi are white face and white-skinned sheep. On the other hand, Arb is black body color. All studied populations are fat-tail sheep (ACSAD (The Arab Center for the Studies of Arid Zones Dry lands)., 2011).

The blood sampling and animals handling were practiced with the permission of and in accordance with the guidelines of the Ethics Committee of King Saud University and Saudi Arabia National Committee of Bio Ethics (No. RG-1435-064). Blood sampling was performed by taking 5 ml of blood out of Jugular vein into EDTA tubes. The tubes were stored immediately in Iceboxes and shortly after they were stored at −20◦C until DNA extraction step was performed.

# DNA Extraction and Genotyping

Genomic DNA was extracted from 1 mL blood aliquots using commercially available DNA extraction Kit (E.Z.N.A <sup>R</sup> MicroElute Genomic DNA extraction Kit; OMEGA-Bio-Teck, 2010). DNA concentrations and purity were determined and then all samples were standardized to 10 ng/µL for genotyping process. Nineteen microsatellite (MS) markers recommended by the FAO/ISAG Panel (FAO, 2011), were used for genotyping purposes (**Table 1**). The MS markers are highly polymorphic microsatellite markers, which are short sequence repeats of 1–6 base pairs (FAO, 2011). The genotyping thermal cycling reaction was, in brief, performed on a GeneAmp <sup>R</sup> PCR system 9700; Applied Biosystem. The PCR cocktail was made in a volume of 10 µL. The amplification conditions were an initial denaturation cycle of 5 min at 94◦C followed by the denaturation step at 95◦C for 45 s. Then annealing step was immediately performed at recommended temperature of each primer for 1 min followed by final temperature as extension step at 72◦C for 1 min. They were repeated for amplification. Then a final extension step at 72◦C for 10 min was included. The Amplified PCR products were fragmented using 3130 Genetic Analyzer of Applied Biosystem Company <sup>R</sup> . The size of the microsatellite alleles was scored using Gene Mapper software <sup>R</sup> .

# Analyses of Genetic Diversity and Structure

The number of alleles (A), allelic richness (AR), and expected heterozygosity (He) for each locus and breed were estimated using FSTAT software (Goudet, 1995). Estimating A is complicated by the effects of sample size where large samples are expected to have more alleles. In order to correct estimates of A for differences in sample sizes of the studied populations,

the estimates of AR were taken into account in order to overcome any possible bias resulting from the variation in sample sizes (Kalinowski, 2004). The small sample sizes of Awassi and Rufidi breeds were specifically reconsidered in AR analysis. Analysis of molecular variance (AMOVA) including coefficients of F-statistics, pairwise differentiation coefficient (Fst) and intra-population differentiation (Fis) (Hedrick, 2000), were computed under Hardy-Weinberg equilibrium (HWE) (Nei, 1987) using ARLEQUIN Software (Excoffier et al., 2005). Population structure was analyzed using STRUCTURE (Version 2.3.3) software (Pritchard et al., 2000) considering an admixture model with correlated allele frequencies between breeds. The length of the burn-in and Monte Carlo Markov chain (MCMC) simulations were 200,000 and 100,000, respectively, in 50 runs for each number of clusters (K) ranging between 2 and 6. The K-value, log probability of the data (L[K]) values for each cluster were estimated. The results were exported to STRUCTURE HARVESTER (Earl and von Holdt, 2012) for plotting the likelihood membership coefficient (DeltaK) values so as to determine the most likely number of clusters. Finally, GENETIX <sup>R</sup> software was used to perform factorial correspondence analysis (Belkhir et al., 2000). The factorial correspondence analysis is a multidimensional statistical method to evaluate the number of genetic groups (Belkhir et al., 2000).

# RESULTS

# Genetic Diversity

The results of within-population genetic variation were based on the values of allelic (A and AR) and genetic diversity (He) (**Table 1**). The mean A was 9.1, 9.1, 8.8, 5.4, 7.3, and 4.2 for Harri, Najdi, Naemi, Awassi, Arb, and Rufidi breeds, respectively. In general, the majority of alleles were found in all the breeds, except those in Awassi and Rufidi breeds in which small sample sizes might explain the comparatively small mean A. The A per breed ranged from 2 in all breeds, except Naemi (A = 3), to 14 in both the Harri and the Najdi breeds (**Table 1**). At the loci level, the lowest number of A = 2 was found at locus MAF214 in all the breeds except Naemi, whereas the highest number of A = 14 was found at the loci ILSTS044 in the Harri and the Najdi, and at OARFCB226 in the Najdi and at HSC in the Harri breed. Both the Harri and the Najdi had the highest and similar A at most of the studied loci along with the same value of mean A (**Table 1**). The result might indicate that both breeds have a similar genetic background. The mean AR-values were 2.9, 2.9, 2.8, 3.0, 2.7, 2.8, and 2.9 for Harri, Najdi, Naemi, Awassi, Arb, and Rufidi breeds, respectively. The average AR per breed was the lowest (2.7) in the Arb and highest (3.0) in Awassi sheep (**Table 1**). It is notable that the average AR-value for both Harri and Najdi was the same as was observed for their NA. The lowest AR = 1.8 was found at locus MAF214 in Arb sheep and at BM8125 in Najdi sheep, whereas the highest AR = 3.6 was observed at the DYMS1 in Awassi breed and OARFCB226 in the Rufidi sheep (**Table 1**).

The average Hewas 0.77, 0.75, 0.74, 0.80, 0.73, and 0.74 for the Harri, Najdi, Naemi, Awassi, Arb, and Rufidi breeds, respectively (**Table 1**). The results showed slightly higher Hein the Awassi sheep over the values of the other breeds (**Table 1**) as was the value of AR. Overall, the average Heat the 19 MS loci ranged from 0.73 to 0.80, reflecting a small range of differences between values for the breeds and indicating high genetic variation. On


 breed.

**183**

the other hand, it shows that the Arb breed had the lowest He, but still at least 73%. The AMOVA showed that the extent of genetic variation was 2.79, 7.85, and 89.45% between the breeds, among the individuals within the breeds and within the individuals, respectively (**Table S2**).

The AMOVA results also showed a significant positive inbreeding coefficient (Fis) indicating less heterozygosity than it is expected under HWE in four sheep breeds; 0.096, 0.091, 0.077, and 0.045 (P < 0.002, 0.001, 0.002, and 0.027) for Harri, Najdi, Naemi, and Arb, respectively (**Table 1**). The values of Fisfor the other breeds-Awassi and Rufidi-were not significant (**Table 1**). The Fis-values at the loci varied from -0.33 at SCRCRSP09 in the Naemi to 0.91 at MAF14 in the Rufidi breed. On the other hand, seven MS loci (TGLA53, DYMS1, ILSTS05, MAF214, OARJMP29, BM1329, and SRCRSP5) showed positive Fis-values in all the breeds. The results indicated a shortage of heterozygotes than it would be expected under HWE. The remaining loci showed that the Fis-value was either negative or positive in one breed or more (**Table 1**).

The differentiation coefficients (Fst) based on the distance method of different allele numbers were found significant between pairwise comparisons except between Naemi and Awassi. The pairwise Fst-values varied from lowest (0.006) between Neami and Awassi to highest (0.104) between the Arb and the Rufidi breeds (**Figure 2**). The Fst-values showed a high differentiation coefficient between Rufidi with the other breeds. The next highest level of differentiation was between Arb and the other populations. Furthermore, **Figure 2** shows a lower differentiation between the Harri and the Najdi, while a higher differentiation was observed between the Harri and the Arb. The lowest genetic differentiation was observed between Awassi and Najdi and the rest of the other populations, respectively.

# Genetic Population Structure

The genetic population structure of each breed was determined based on admixture level for each sheep individual using correlated allele frequencies model implemented within the STRUCTURE software. The results of Delta K indicated that the optimal number of genetic clusters representing most like ancestral breeds was at K = 3 (**Figure 3A**). The value suggests that the studied sheep breeds were better defined by three genetic clusters/backgrounds instead of six breeds (**Figure 3**). The three clusters/genetic backgrounds were made up of Harri, Najdi, and Rufidi in the first, Naemi and Awassi in the second and Arb in the third cluster (**Figure 3**). In **Figure 3**, each individual is represented by a single vertical line broken into K colored segments (**Figure 3**). The mixed colors with proportional lengths represent the admixture level for predefined populations of K between 3 and 6. The first genetic pool had individuals of Harri, Najdi, and Rufidi sheep with different assignment probabilities (∼60%) (**Figure 3**). Similarly, many individuals of this gene pool have a reasonable color broken proportion with blue color mainly. Some individuals of Najdi had high assignment probabilities with the second cluster (Naemi and Awassi). Worth noting was that the Najdi had a good proportion of admixture in its individuals from the second and third genetic pools. The few individuals were shown with broken colors with green color in probabilities (∼70%). It might be better to consider it from the second cluster instead. Alternatively, most individuals of the second pool (Blue color; K = 3) were solely assigned Naemi and Awassi together. The third genetic pool had Arb breed with very few individuals of limited admixture proportion (<20%) of the second gene pool. The shared proportion of the second gene pool was observed in the other two pools, indicating a common ancestry origin.

# Correspondence Analysis

The results of correspondence analysis in this study highlighted better genetic admixture and differentiation between all individuals within and between the breeds (**Figure 4**). The results are represented in three factorial dimensional graphs where the first, second, and third factors (axes 1, 2, and 3) accounted for 33.92, 25.39, and 17.02% of total variation, respectively. The analysis clearly distinguished Arb individuals from those of the other breeds. Furthermore, Awassi individuals were more distinguished from the other breeds, but closer to Naemi (**Figure 3**). Most of the individuals clustered into groups that belonged to each predefined breed rather than being in mixed populations. However, Harri and Najdi individuals showed admixture as was observed in the structure analysis.

# DISCUSSION

The KSA imports millions of sheep every year for local consumption and sacrifice during the Eid Al-Adha religious festival. The animal importation represents a current animal exchange networks and span countries as far as Australia and nearby ones such as those of the Horn of Africa, Yemen, Gulf, and Middle East. Consequently, the genetic structure of indigenous sheep in The KSA could have been influenced by demographic events such as animal exchange network imbedding gene flow. Indeed, the main question driving our study was whether the genetic structure of KSA sheep was influenced more by internal gene flow, breeding practices and geographical features. The high genetic variation observed within and between sheep breeds indicated by the A, RA and Hecould be the result of one or more past evolutionary events. The most likely reason to explain the high genetic variation, considering that the transhumant system still predominates in all regions of KSA, was gene flow. This reason could have involved past gene flow within the breeds reared in the same and adjacent regions. The best evidence for this occurrence is reflected in the individuals of Harri and Najdi breeds which existed in the same flocks reared from South to central regions with many crossbreds. The A and He-values were high at most of the loci studied in all the breeds. In particular, the Harri and Najdi had the highest A as well as H<sup>e</sup> indicating that they are the most genetically variable breeds. Generally, if recipient populations have different allele frequencies and if selection is not operating, then it might be expected that migration alone would rapidly cause genetic variation (Ridley, 2004). Our finding shows that the Awassi breed was the most varied breed. Earlier reports showed that Hewas 0.696 for Jordan Awassi (Al-Atiyat, 2015) and 0.667 for the Turkey Awassi (Soysal

et al., 2005). It seems that high values of H<sup>e</sup> were not uncommon in the Awassi sheep, the most common breed in the Middle East.

The overall Fis-value for each breed was positive, indicating a certain level of heterozygote deficiency. The positive Fis-values indicate that individuals in a population are more related than expected under a model of random mating and suggest that the sheep breeds had higher value of inbreeding. This could be due to small population sizes, selection pressure and population subdivision (Hedrick, 2000). The latter can be explained as a Wahlund effect which is reduction in the heterozygosity as a result of population subdivision (Hedrick, 2000). The Wahlund effect: the same situation can be used to characterize the Naemi sheep which subdivided into different population across several regions. On the other hand, the heterozygote deficiency may be due to the fact that a small number of breeding males are used in mating or in the last few decades mating had been occurring among closely related animals. This is observed mainly in Najdi and Harri flocks. Even though gene flow was noticed into these two breeds, it was not enough to drive the individuals into excess of the heterozygosity. The lowest Fst-value between these two breeds provided extra proof that they are closely genetically related. In a wider study, sheep of the world were found to be differentiated on the national and international levels (Kijas et al., 2012). For instance, Awassi from different countries were highly differentiated from the Australian Merino sheep (Al-Atiyat et al., 2014), the Spanish Merino (Arranz et al., 1998), Turkish sheep (Ozdemir et al., 2011) and the Middle East fat-tailed sheep (Rocha et al., 2011) and Egyptian sheep (Elbeltagy et al., 2015).

Structure and admixture analyses have been used in earlier studies involving different sheep populations, providing an appropriate approach to determine ancestral, pure and hybrid populations (Alvarez et al., 2004; Ligda et al., 2009). Although the

results of STRUCTURE showed admixture at the individual level in each sheep breed, the six breeds could be clustered into three gene pools. All individuals of the Arb sheep were assigned to a separate gene pool, with few individuals showing a small fraction of admixture deriving from a common ancestry (**Figure 3**). The results also showed at the individual level that Naemi and Awassi had a mixed ancestry as a result of sharing a fraction of their genome inherited from ancestors; whereas it is much less for individuals of the other breeds. The results might need further justification to prove the observed integration related to breeding practices, geographical isolation and/or common ancestry. It is widely accepted that world's sheep breeds reflect high levels of historical admixture and strong recent selection (Kijas et al., 2012). On the other hand, the clear admixture proportion found between Najdi and the gene pool of both Naemi and Awassi reflect possibly shared ancestry and past individual migration in the same geographical regions. In fact, looking back to the sampling regions where these four breeds (Harri, Najdi, Naemi, and Awassi) came from, we found that these regions were considered to be the major livestock husbandry region where transhumant production system was common and recent crossing observed. The observed genetic structure might be related to the geographical features of the region from which the breeds were sampled. This result was probably, first of all, due to shared ancestry and second due to gene flow between the populations being reared in the close geographic areas. Nevertheless, Arb sheep was geographically isolated in the East region of the KSA with very limited dispersals across the other regions of KSA. The indigenous nomadic people were extremely in favor of practicing pure breeding of the breed and objecting to any crossbreeding strategy. Therefore, the genetic structure of Arb sheep could be influenced by founder effect because they have been isolated in geographical confines in the East region of KSA. As a consequence, the graphic representation of correspondence analysis (**Figure 3**) showed a clear separation of Arb individuals. Clearly the study populations are subdivided into three groups matching the results of structure analysis. Despite the fact that the Awassi sheep breed was located far from those studied groups, they are definitely closer to Naemi sheep. The three groups were matched to their geographic distribution of their sampling locations. These results are in agreement with known history of the breeds in regard to their geographic locations and their long evolutionary history associated with past common ancestors. In general, the result was similar to previous findings which showed close genetic relationship between the four KSA sheep breeds (Aljumaah et al., 2014) and the native Jordan Awassi sheep (Al-Atiyat et al., 2014; Al-Atiyat, 2015). Furthermore, Turkish Awassi sheep as a fattail sheep was separated from other Turkish sheep breeds based on correspondence analysis (Ozdemir et al., 2011). Indeed, the Near East region is considered to be the main center of origin of specifically the fat-tailed sheep (Rocha et al., 2011). In agreement, the Jordan Awassi shows no common genetic structure with the Australian Merino most likely due to geographic isolation (Al-Atiyat, 2015). On the other hand, evidence of gene exchange between Egyptian sheep breeds was reported for flocks reared in the same region (Elbeltagy et al., 2015). Furthermore, Kijas et al. (2012) reported that World's sheep breeds reveal high levels of historic admixture and strong recent selection.

# CONCLUSION

The sheep breeds of the KSA revealed high genetic diversity considering that they are reared in different geographic regions that are far apart and with different features. The Arb sheep was the most differentiated breed, whereas Jordan Awassi was least differentiated from Naemi sheep indicating their common ancestry. The population structure analysis identified three main gene pools underlying the ancestral genetic diversity. The first had Harri, Najd, and Rufidi, the second had Neami and Awassi, whereas the third pool had Arb breed. In accordance, the factorial correspondence analysis distributed the individuals in the three genetic groups. The resulted genetic structure of all gene pools had limited shared genetic makeup arising from common ancestry. Furthermore, the first and the second gene pools could have arisen from past and recent gene flow between individuals. The gene flow was evident between different flocks rearing two or more breeds under transhumant production system. The third pool might have resulted from geographical separation/isolation. These results are in agreement with known history of the breeds in regard to their geographical location and their expected common evolutionary history.

# ETHICS STATEMENT

Standard techniques were used to collect blood. The procedure was reviewed and approved by the University of Edinburgh Ethics Committee (reference number OS 03-06) and also by the Institute Animal Care and Use Committee of the International Livestock Research Institute, Nairobi.

# REFERENCES


# AUTHOR CONTRIBUTIONS

RA-A and RA conceived and designed the experiment. RA and MA performed the experiment. RA-A and RA analyzed the data. RA-A performed the bioinformatics analysis. MA and AA contributed in data. RA-A, RA, and AA wrote the manuscript. All authors have agreed on the contents of the manuscript.

# ACKNOWLEDGMENTS

The authors extend their sincere appreciation to the Deanship of Scientific Research at King Saud University for funding the research group under grant No. RGP-42. The authors also thank the International Livestock Research Institute, Nairobi for providing in lab facilities and help.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00408/full#supplementary-material


Hedrick, P. W. (2000). Genetics of Population. Sudbury, MA: Jones and Bartlett.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Al-Atiyat, Aljumaah, Alshaikh and Abudabos. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Use of High Density Single Nucleotide Polymorphism (SNP) Arrays to Assess Genetic Diversity and Population Structure of Dairy Cattle in Smallholder Dairy Systems: The Case of Girinka Programme in Rwanda

Mizeck G. G. Chagunda<sup>1</sup> \*, Fidalis D. N. Mujibi2,3, Theogene Dusingizimana<sup>4</sup> , Olivier Kamana<sup>4</sup> , Evans Cheruiyot<sup>2</sup> and Okeyo A. Mwai<sup>5</sup>

### Edited by:

Joram Mwashigadi Mwacharo, International Center for Agricultural Research in the Dry Areas (ICARDA), Ethiopia

### Reviewed by:

Kwan-Suk Kim, Chungbuk National University, South Korea Junhong Xia, Sun Yat-sen University, China

\*Correspondence:

Mizeck G. G. Chagunda Mizeck.chagunda@uni-hohenheim.de

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 19 April 2018 Accepted: 14 September 2018 Published: 10 October 2018

### Citation:

Chagunda MGG, Mujibi FDN, Dusingizimana T, Kamana O, Cheruiyot E and Mwai OA (2018) Use of High Density Single Nucleotide Polymorphism (SNP) Arrays to Assess Genetic Diversity and Population Structure of Dairy Cattle in Smallholder Dairy Systems: The Case of Girinka Programme in Rwanda. Front. Genet. 9:438. doi: 10.3389/fgene.2018.00438 <sup>1</sup> Animal Breeding and Husbandry in the Tropics and Subtropics, University of Hohenheim, Stuttgart, Germany, <sup>2</sup> Usomi Limited, Nairobi, Kenya, <sup>3</sup> The Nelson Mandela African Institution of Science and Technology, Arusha, Tanzania, <sup>4</sup> College of Agriculture, Animal Sciences and Veterinary Medicine, University of Rwanda, Kigali, Rwanda, <sup>5</sup> International Livestock Research Institute, Nairobi, Kenya

In most smallholder dairy programmes, farmers are not fully benefitting from the genetic potential of their dairy cows. This is in part due to the mismatch between the available genotypes and the environment, including management, in which the animals perform. With sparse performance and pedigree records in smallholder dairy farms, the true degree of baseline genetic variability and breed composition is not known and hence rendering any genetic improvement initiative difficult to implement. Using the Girinka programme of Rwanda as an exemplar, the current study was aimed at better understanding the genetic diversity and population structure of dairy cattle in the smallholder dairy farm set up. Further, the association between farmer self-reported cow genotypes and genetically determined genotypes was investigated. The average heterozygosity estimates were highest (0.38 ± 0.13) for Rwandan dairy cattle and lowest for Gir and N'Dama (0.18 ± 0.19 and 0.25 ± 0.20, respectively). Systematic characterization of the genetic variation and diversity available may inform the formulation of sustainable improvement strategies such as targeting and matching the genotype of cows to productivity goals and farmer profile and hence reducing the negative impact of genotype by environment interaction.

Keywords: genetic diversity, population structure, dairy cattle, smallholder, SNP arrays

# INTRODUCTION

Smallholder dairying has the potential to drive people out of poverty, provide sustainable livelihoods and enhance household food and nutritional security. In different countries in Sub-Saharan Africa, a variety of dairy development initiatives are being implemented either by national governments or Non-Governmental Organisations (NGOs) (Chagunda et al., 2016). An example

of such initiatives is the "One Cow per Poor Family Programme" in Rwanda. This programme, which is locally known as "Girinka," is a country-wide initiative to provide poor households with dairy cattle. This target is to especially provide cattle in areas where there is currently low cattle population. The Girinka programme was launched in 2006 with the overall objective of increasing agricultural productivity through application of cow manure in crop field and also through increased dairy production. This in turn would drive improvements in human nutrition, household income and reduced poverty. According to the Rwandan Ministry of Agriculture, a total of 249,000 cows of different breeds had been distributed by June 2016. In addition to cattle of known breeds such as Ankole, Jersey, Ayrshire, and Holstein Friesian, cross-bred cows of different grades have also been distributed to farmers. Some of the animals were sourced from within the country while the majority of the animals were imported from countries such as Kenya, Uganda, Tanzania, South Africa, and Netherlands. Such an importation strategy not only changes the genotypic frequency at population level, but also increases the genetic diversity of the base population. The Girinka programme is a classic example of the different variants of smallholder dairy programme development in Sub-Saharan Africa. Key to any future improvement initiatives is the use of breed composition information to target and match genetics to productivity goals. The challenge, though is that with sparse performance and pedigree records in smallholder dairy farms, the true degree of baseline genetic variability and breed composition is not known and hence difficult to implement any meaningful genetic improvement initiative. The objective of the current study was to better understand the genetic diversity and population structure of dairy cattle under the Girinka programme through use of high density single nucleotide polymorphism (SNP) arrays. This approach has the potential to clearly inform the formulation of sustainable improvement strategies.

# MATERIALS AND METHODS

# Ethics Statement Ethical Approval

All procedures performed in the study involving human participants and the protocol for animal hair sample collection were reviewed and approved by the Ethics Committee of the University of Rwanda's Research and Postgraduate Studies (RPGS) Unit and the National Institute of Statics Rwanda (NISR) based on the guidelines provided by the Rwanda National Ethics Committee and in accordance with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Animal handling was done by knowledgeable personnel to ensure maximum comfort and minimal injury at all stages of the research.

# Farmer Survey and Animal Samples

This study was conducted as a survey that combined social economic data, data on indictor traits for cow productivity, biological data in terms of animal hair samples. All the numerators were properly trained to conduct the survey and all standard biosecurity and institutional safety procedures were adhered to under the supervision of the expert from the University of Rwanda. A total of 1564 smallholder dairy farmers from the South and North provinces of Rwanda were interviewed. The respondents were beneficiaries of the Girinka programme. Socio-economic and productivity data that were collected included information on gender issues, production systems, access to relevant dairy production inputs such as fodder, water, labour, and animal health services. Animal hair samples were collected from the tail switch, taking care to avoid faecal contamination, following a protocol provided by the International Livestock Research Institute (ILRI). A total of 2717 cows were sampled from smallholder dairy farms consisting of 1492 samples from the North province and 12245 samples from the South province. Due to budget limitations a total of 150 random samples were selected from each of the provinces and shipped for genotyping. Samples were heat treated at 70◦C for 2 h in preparation for shipping and genetic analysis. Of the 300 submitted samples, genotyping results were obtained from 299 samples. The rest of the samples have been safely stored in a biorepository at ILRI for future use. Results from the socioeconomic survey are beyond the scope of the current paper.

# Reference Dataset

A panel of genotypes from commercial international taurine dairy breeds was used as a reference for breed composition assignment. These included Friesian (n = 28 samples), Holstein (n = 63), Norwegian Red (n = 17), Jersey (n = 36), and Guernsey (n = 21) breeds. To capture genetic signatures representative of African cattle, an African taurine breed (N'Dama (n = 24)) and two indicine breeds, the East African Shorthorn Zebu (EASZ) (n = 50) and Gir (n = 30) were also included in the analysis.

# Genotyping and Quality Control

Samples were genotyped at Geneseek (Neogen Corporation, Nebraska, United States) using the Geneseek Genomic Profiler (GGP) High Density (HD) SNP array consisting of 150,000 SNPs, while SNPs for the reference breeds had been genotyped with the Illumina HD Bovine (777K SNPs) array. The SNPs in GGP array were optimised for use in dairy cattle having the most informative SNPs from Illumina Bovine 50k and 770k chips and additional variants known to have a large effect on disease susceptibility and performance. Genotype data quality control and cheques were carried out using PLINK v 1.9 (Purcell et al., 2007) and included removal of SNPs with less than 90% call rate, less than 5% minor allele frequency (MAF) and samples with more than 10% missing genotypes. Additional removal of SNPs not mapped to any chromosome left a total of 120,591 SNPs for analysis. Of the 299 animals, 12 failed the above outlined quality cheques and were removed from the analysis. Total genotyping rate in remaining samples was 0.991. The 120,591 SNPs used in the analysis covered 2516.25 Mb with an average distance of 22.67 kb between adjacent SNPs. The mean chromosomal length ranged between 42.8 Mb on BTA 25 and 158.86 Mb on BTA 1. The mean length of adjacent SNPs per chromosome ranged between 18.67 and 23.89 kb on BTA 14

and BTA 29, respectively. The linkage disequilibrium (LD) across the genome averaged 0.41. Private alleles, defined as variants which are segregating in only one population when evaluating multiple populations, were identified using a custom script in R. A total of 143 private variants, most (132) of which originated from the Rwanda cattle population were detected.

# Minor Allele Frequency, Inbreeding and Heterozygosity Estimates

Minor allele frequencies (MAF) were estimated using PLINK. The distribution of MAF in each subpopulation (i.e., European taurine, African taurine, Indicine breeds and Tanzanian crossbred cattle) was represented as the proportion of all the SNPs used in the analysis and subsequently grouped into five classes as follows: [0.0,0.1], [0.1,0.2], [0.2,0.3], [0.3,0.4], and [0.4,0.5]. The results were plotted for comparison between subpopulations using R (R Core Team, 2016). The observed heterozygosity estimates for each population were calculated from observed genotype frequencies obtained from PLINK (Purcell et al., 2007) using the programme Hierfstat. Inbreeding coefficient estimates were also calculated using the Hierfstat package (Goudet, 2005) in R (R Core Team, 2016). To obtain confidence intervals, 100,000 permutations after pruning such that markers were in approximate linkage equilibrium were performed. Pruning was carried out in PLINK programme using the –indep-pairwise (50 5 0.3) option. The pruning proceeded by calculating LD for 50 marker sliding windows, with a new window obtained by shifting 5 markers along the length of the chromosome. Marker pruning was carried out when LD between a pair of markers was either 0.3 or above. Consequently, 33,208 markers were removed leaving a total of 87,383 markers that were used for the inbreeding analysis. Negative FIS-values was set to zero because such inbreeding coefficient estimates reflects sampling error (Purcell et al., 2007).

# Admixture and Principal Component Analysis

Principal component analysis (PCA) was used to describe the genetic structure of the crossbred cattle population using PLINK (Purcell et al., 2007) by way of a variance-standardised relationship matrix for dimension reduction. The PCA results were visualised using the GENESIS package (Buchmann and Hazelhurst, 2014) in R. The unsupervised model-based clustering method implemented by the programme ADMIXTURE v. 1.3.0 (Alexander et al., 2009) was used to estimate the breed composition of individual animals using 111,836 markers. The analysis was undertaken with K (number of distinct breeds) ranging from 2 to 9 to reflect the genetic background of the cattle under study, starting with the basic cross (indicine and taurine cross) until the total number of the populations in the analysis, given the 8 reference breeds. Ten-fold cross-validation (CV = 10) was specified, with the error profile obtained thereafter used to explore the most probable number of clusters (K), as described by Alexander et al. (2009). Graphical display of the admixture output


 breeds included

Guernsey, Norwegian

 Red and other crosses. was done using the Genesis package (Buchmann and Hazelhurst, 2014) in R statistical programme (R Core Team, 2016).

# Phylogeny and Pairwise Fst

fgene-09-00438 October 10, 2018 Time: 12:54 # 4

In order to understand the relationships between the populations, the Euclidean distance between populations was evaluated using dartR package (Gruber et al., 2018) in R. A Neighbourjoining (NJ) relationship tree was then constructed using APE programme (Paradis et al., 2004). Pairwise population differentiation was calculated using Hierfstat. Confidence intervals were obtained after 100,000 permutations.

# RESULTS

Farmer-self reported information showed that the predominant genotype (45%) used for milk production in the Girinka programme was the cross between Holstein-Friesian and Zebu (**Table 1**). Ten percent of the farmers received pure Holstein-Friesian cattle while 6% farmers received Jersey cattle. Other farmers received local Zebu (20%). Quite a substantial proportion (18%) of farmers did not know the genotype of the cow that they received. From the genetic analysis, the majority (87%) of the cows was determined as cross-breeds between exotic dairy breeds such as Holstein Friesian, Jersey and Ayrshire; and local zebu type of animals. The rest were either pure exotic breeds (7%) or local zebu breeds (6%). There was 46.2% agreement and 29.4% disagreement between the farmer-reported genotypes and the genetically determined genotypes. The rest of the animals (24.4%) had their owners reporting that they did not know the genotype at all. The majority of the farmers received the animals as either calves (66%) or growing heifers (24%).

# Genetic Diversity

The distributions of average minor allele frequencies for all populations under study (African taurine, Indicine, and Rwandan crossbred cattle) are shown in **Figure 1**. Indicine (EASZ and Gir) and African taurine (N'Dama) breeds had the highest proportion of SNPs with the low MAF category ([0.0,0.1]) compared to European taurine (ET) breeds. The Rwandan crossbred cattle had relatively high proportion of SNPs with high MAF (mostly [0.3,0.4] and [0.4,0.5]). The observed heterozygosity estimates for the study populations are illustrated in **Table 2**. The average heterozygosity estimates were high for the Rwanda cattle (0.38 ± 0.13) and lowest for Gir and N'Dama (0.18 ± 0.19 and 0.25 ± 0.20, respectively). Heterozygosity estimates for European taurine breeds used as references ranged between 0.30 ± 0.19 and 0.37 ± 0.12 for Jersey and Holstein breeds, respectively.

FIGURE 1 | Minor allele frequency distributions for Rwanda cattle and reference breeds. AT, African taurine; ET, 0 European taurine; Indicine, East African Shorthorn Zebu and Gir; Rwanda, Girinka cattle population.

TABLE 2 | Average inbreeding coefficient, observed and expected heterozygosity estimates. Values are means ± SD.


The study populations showed low detectable levels of inbreeding for both Rwanda cattle and the reference samples (**Table 2**). The values obtained were not significantly different from zero.

**Figure 2** shows a heatmap of population differentiation for the Rwanda cattle and the reference populations. For Rwanda cattle, the Fst-values were small ranging from 0.07 to 0.19 for Friesian and Gir, respectively. Large differentiation ranging from 0.35 to 0.43 was observed between Gir and Taurine breeds, reflecting the historical divergence between these breeds (Loftus et al., 1994).

# Principal Coordinate Analysis

The first principal coordinate vector accounted for 12% of the total variation and separated European taurine breeds from non-European breeds as shown in **Figure 3**. The second vector accounted for 3.3% of total variation and separated the African taurine breeds (N'Dama) from the indicine breeds. The Rwandan

samples dispersed intermediate between EASZ and the Taurine breeds. A significant number of the Rwandan samples dispersed close to the N'Dama breed, suggesting a significant contribution of the breed in some of the animals in the population.

# Admixture Analysis and Relationship Among the Studied Breeds

ADMIXTURE analysis results are presented in **Figure 4**. Each animal is represented by a vertical line divided into K coloured segments representing the estimated fraction belonging to each cluster. Short vertical lines at the bottom of each horizontal bar delimit individuals of different populations. Reference breeds are labelled as Guernsey (GN), Norwegian Red (NR), Friesian (FR), Holstein (HO), Jersey (JE), N'Dama (ND), East African Shorthorn Zebu (ZB) and Gir (GI). Based on visual inspection of the admixture plot, scrutiny of the separate CV error plots and the PCoA plots, K = 8 represented the most appropriate population number for the dataset. Importantly, increasing K above 8 did not reveal any detectable population substructure and the breed clusters remained the same. Based on results obtained with K = 8, most animals were crosses of Holstein-Friesian breeds which contributed on average 58.3% of the total genes in the crossbred animals. The predicted absolute exotic breed gene content in the

crossbred cattle ranged from 12 to 100% (Huson and Bryant, 2006). The phylogenic tree showing the relationships among the studied breeds is presented in **Figure 5**. The phylogeny confirms that the majority of the cows in the Girinka are crosses between the African indicus breeds and the European taurine breeds.

# DISCUSSION

The Girinka programme was introduced by the government of Rwanda as a means of enhancing food and nutritional security for rural poor households. Based on the national poverty assessment, every poor family is mandated to have a dairy cow which provides milk for household nutrition and extra milk is sold to supplement other income streams. Dairy farming lends itself as a pathway out of poverty given its ability to generate a daily household cash flow while keeping the animal alive. However, for the programme to be sustainable, there is need to ensure that farmers access the right animals for their specific production environments. Dairy farmers in the tropics, and specifically in smallholder farms, face many challenges including disease pressure, poor feed availability, high temperatures and generally inappropriate management strategies. A better understanding of the genetic diversity of the population under study is not only important for maximising productivity but also provides a means to evaluate the germplasm supply chains. This would ensure that appropriate animals are sourced for any rural development initiative as well as for any genetic improvement programme. This is vital, not only for enhanced food and nutritional security but also for improved animal welfare.

The results from the current study indicate low genetic diversity in indicine (EASZ and Gir) and African taurine (N'Dama) breeds compared to European taurine (ET) breeds. This result is consistent with the design of the genotyping array used which targets Bos taurus breeds, and has low representation of indicine breeds (Bovine HapMap Consortium, 2009). This ascertainment bias causes the disproportionate distribution of MAF among the subpopulations, such that indicine and African breeds had lower diversity measures. The Rwanda population had a relatively large proportion of SNPs with high MAF given their frequent crossbreeding events predominantly with breeds of high European Taurine ancestry. Typically, the study animals are sourced from many smallholder farmers in diverse countries in the region (Hahirwa and Karinganire, 2017). This is because the demand of high quality heifers in East Africa is so high compared to available supply Staal et al. (1996) and Muriuki and Thorpe (2001). There are no large breeders to fill this gap. As such, a few animals are sourced from small herds which are dominated by smallholder farmers (Muriuki and Thorpe, 2001). The high genetic variability observed in the current populations presents an opportunity for implementation of genetic improvement programmes to facilitate adaptation to local production environments which are constantly changing due to continuous environmental perturbations, capacity of farmers to manage the animals and availability of feed resources (Thornton, 2010). The relatively low heterozygosity estimates for indicine and African taurine breeds observed in this study due to poor representation of SNPs for non-European Taurine cattle. It is interesting to note that the Rwanda cattle population had a large proportion of African taurine breed (N'Dama) signature. This represents significant crossbreeding with Ankole cattle, which are Sanga type cattle breed with 50% African taurine and 50% Zebu ancestry. The Rwanda cattle population therefore consists of a

unique genepool that can be harnessed to develop a synthetic breed with the best attributes of all cattle breeds in East Africa. This would have the potential to contribute to not only for higher production potential, but also for adaptability to heat and disease stress.

The results also showed minimal differences in inbreeding coefficient estimates between European taurine and the Rwanda population. Given the huge admixture observed for the Rwanda population, this was expected. To accurately assess population structure of the study populations, we chose the PCoA method to assess dissimilarity between populations. The PCoA plot illustrates the wide range of genetic composition and breed contribution. The Rwanda cattle in the Girinka programme are not only highly admixed but also mainly crosses of Holstein Friesian, African taurine (N'Dama) and the East African Zebu. The dispersion pattern observed in this study reflects the practised indiscriminate crossbreeding, where farmer's continually upgrade their animals to high exotic levels in a bid to increase productivity. ADMIXTURE results agree with the PCoA results and demonstrate the wide range of breed types that constitute the Rwanda Girinka cattle. The dominance of Holstein-Friesian breeds over other cattle breeds reflects the goals for the Girinka programme, in terms of maximising milk yields.

Farmers' ability to identify the genotype of their animals was limited. This implies that farmers either have poor knowledge of dairy breeds or the animals are not performing as expected. Based on the phenotypic performance of their animals, farmers may not have been convinced that the breed that they were told they would receive is the one they have when it does not perform at the level that the farmers expected. This could be in terms of both underperforming as well as over performing. This mismatch in terms of the breed that the farmers has and what they believe they have also reflect on poor pedigree record keeping and poor access to breed choices. Currently, there are no large farms that would provide large numbers of suitable animals, when needed. A scheme for appropriate sire selection and animal identification ought to be instituted across east Africa. In the meantime, handlers of the Girinka programme need to start instituting a breed composition profiling campaign after they purchase the animals so that they can match animals to specific farmer production systems. Farmers with the capacity to provide the right inputs such as animal feed, proper health management and have access to markets should receive the animals with the highest taurine composition, while those farmers with low capacity to provide inputs, ought to receive animals with a composition consistent with their production system. To ensure that the Girinka programme fulfils its goal, farmer education on dairy best practises and with consideration to cow genetic

# REFERENCES


diversity must precede farmer acquisition of the cattle. This will ensure that farmers are well prepared with regard to the demands of rearing dairy cattle and have the requisite knowledge and inputs. The low dairy productivity reported in different countries in Sub-Saharan Africa reflects the inappropriateness of the breed allocation programmes and also general lack of proper preparatory work done prior to breed allocation.

# CONCLUSION

This study has demonstrated that a substantial number of farmers in the Girinka programme did not know the real breed of their cow. This would be a major bottleneck in any efforts for breed improvement. The application of high density SNP markers can be used in smallholder production settings to inform decision making and offer insightful options in breed development and distribution among smallholder farmers. Such information is vital in developing future breed sourcing strategies and development efforts among governments and NGOs targeting smallholder farmers. Further, the diversity of breed types used and the wide admixture spread presents the Rwandan dairy population with the opportunity for in-depth studies to identify the appropriate breed types and admixture level for different production systems.

# DATA AVAILABILITY

The data supporting the conclusions of this manuscript has been uploaded to Figshare by the authors at https://doi.org/10.6084/ m9.figshare.7046768. Requests for the reference genotypes must be made directly to the owners of this data, as indicated.

# AUTHOR CONTRIBUTIONS

MC, TD, and OM conceived and designed the study. TD and OK oversaw the data and sample collection. FM and EC conducted the data analysis. MC, FM, TD, OK, EC, and OM wrote the manuscript.

# FUNDING

We thank the Bill and Melinda Gates Foundation for funding through the Programme for Emerging Agricultural Research Leaders (PEARL) grant number OPP1112621.



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Chagunda, Mujibi, Dusingizimana, Kamana, Cheruiyot and Mwai. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genome-Wide Variation, Candidate Regions and Genes Associated With Fat Deposition and Tail Morphology in Ethiopian Indigenous Sheep

Abulgasim Ahbara1,2 \*, Hussain Bahbahani <sup>3</sup> , Faisal Almathen<sup>4</sup> , Mohammed Al Abri <sup>5</sup> , Mukhtar Omar Agoub<sup>6</sup> , Ayelle Abeba<sup>7</sup> , Adebabay Kebede8,9, Hassan Hussein Musa<sup>10</sup> , Salvatore Mastrangelo<sup>11</sup>, Fabio Pilla<sup>12</sup>, Elena Ciani <sup>13</sup>, Olivier Hanotte1,9 and Joram M. Mwacharo<sup>14</sup> \*

### Edited by:

Peter Dovc, University of Ljubljana, Slovenia

### Reviewed by:

Kwan-Suk Kim, Chungbuk National University, South Korea Marco Milanesi, São Paulo State University, Brazil

### \*Correspondence:

Abulgasim Ahbara Abulgasim.ahbara@nottingham.ac.uk; abulgasim68@gmail.com Joram M. Mwacharo j.mwacharo@cgiar.org

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 21 April 2018 Accepted: 13 December 2018 Published: 09 January 2019

### Citation:

Ahbara A, Bahbahani H, Almathen F, Al Abri M, Agoub MO, Abeba A, Kebede A, Musa HH, Mastrangelo S, Pilla F, Ciani E, Hanotte O and Mwacharo JM (2019) Genome-Wide Variation, Candidate Regions and Genes Associated With Fat Deposition and Tail Morphology in Ethiopian Indigenous Sheep. Front. Genet. 9:699. doi: 10.3389/fgene.2018.00699 <sup>1</sup> School of Life Sciences, University of Nottingham, Nottingham, United Kingdom, <sup>2</sup> Department of Zoology, Faculty of Sciences, Misurata University, Misurata, Libya, <sup>3</sup> Department of Biological Sciences, Faculty of Science, Kuwait University, Safat, Kuwait, <sup>4</sup> Department of Veterinary Public Health and Animal Husbandry, College of Veterinary Medicine, King Faisal University, Al-Ahsa, Saudi Arabia, <sup>5</sup> Department of Animal and Veterinary Sciences, College of Agriculture and Marine Sciences, Sultan Qaboos University, Muscat, Oman, <sup>6</sup> Agricultural Research Center, Misurata, Libya, <sup>7</sup> Debre Berhan Research Centre, Debre Berhan, Ethiopia, <sup>8</sup> Amhara Regional Agricultural Research Institute, Bahir Dar, Ethiopia, <sup>9</sup> LiveGene, International Livestock Research Institute, Addis Ababa, Ethiopia, <sup>10</sup> Faculty of Medical Laboratory Sciences, University of Khartoum, Khartoum, Sudan, <sup>11</sup> Dipartimento di Scienze Agrarie e Forestali, Viale delle Scienze, Università Palermo, Palermo, Italy, <sup>12</sup> Dipartimento Agricoltura, Ambiente e Alimenti, Università degli Studi del Molise, Campobasso, Italy, <sup>13</sup> Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari "Aldo Moro," Bari, Italy, <sup>14</sup> Small Ruminant Genomics, International Center for Agricultural Research in the Dry Areas (ICARDA), Addis Ababa, Ethiopia

Variations in body weight and in the distribution of body fat are associated with feed availability, thermoregulation, and energy reserve. Ethiopia is characterized by distinct agro-ecological and human ethnic farmer diversity of ancient origin, which have impacted on the variation of its indigenous livestock. Here, we investigate autosomal genome-wide profiles of 11 Ethiopian indigenous sheep populations using the Illumina Ovine 50 K SNP BeadChip assay. Sheep from the Caribbean, Europe, Middle East, China, and western, northern and southern Africa were included to address globally, the genetic variation and history of Ethiopian populations. Population relationship and structure analysis separated Ethiopian indigenous fat-tail sheep from their North African and Middle Eastern counterparts. It indicates two main genetic backgrounds and supports two distinct genetic histories for African fat-tail sheep. Within Ethiopian sheep, our results show that the short fat-tail sheep do not represent a monophyletic group. Four genetic backgrounds are present in Ethiopian indigenous sheep but at different proportions among the fat-rump and the long fat-tail sheep from western and southern Ethiopia. The Ethiopian fat-rump sheep share a genetic background with Sudanese thin-tail sheep. Genome-wide selection signature analysis identified eight putative candidate regions spanning genes influencing growth traits and fat deposition (NPR2, HINT2, SPAG8, INSR), development of limbs and skeleton, and tail formation (ALX4, HOXB13, BMP4), embryonic development of tendons, bones and cartilages (EYA2, SULF2), regulation of body temperature (TRPM8), body weight and height variation (DIS3L2), control of lipogenesis and intracellular transport of long-chain fatty acids (FABP3), the occurrence and morphology of horns (RXFP2), and response to heat stress (DNAJC18). Our findings suggest that Ethiopian fat-tail sheep represent a uniquely admixed but distinct genepool that presents an important resource for understanding the genetic control of skeletal growth, fat metabolism and associated physiological processes.

Keywords: admixture, Africa, fat-tail, Ovis aries, thin-tail

# INTRODUCTION

African indigenous sheep originated in the Near East. They arrived, in the first instance, in North Africa via the Isthmus of Suez by the seventh millennium before present (BP) (Marshall, 2000). These sheep were of thin-tail type and their dispersion southwards into East Africa followed possibly the Nile river valley and the Red Sea coastline (Blench and MacDonald, 2006; Gifford-Gonzalez and Hanotte, 2011). The second wave brought fat-tail sheep into North and Northeast Africa via two entry points, the Isthmus of Suez and the Horn of Africa across the straits of Bab-el-Mandeb, respectively. The fat-rump sheep are a recent introduction and represent the third wave of arrival and dispersal of the species into eastern Africa (Epstein, 1971; Ryder, 1983; Marshall, 2000).

Sheep fulfill important socio-cultural and economic roles in the Horn of Africa. In Ethiopia they provide a wide range of products, including meat, milk, skin, hair, and manure, and are a form of savings and investment (Assefa et al., 2015). Ethiopia hosts many indigenous breeds of sheep, with currently 14 recognized populations/breeds, which are defined based on their geographic location and/or the ethnic communities managing them (Gizaw, 2008). Based on structure analysis, Edea et al. (2017) showed that the five Ethiopian indigenous sheep populations they analyzed clustered together based on their geographic distribution and tail phenotypes.

Fat depots act as an energy reserve that allows sheep to survive extreme environments and conditions such as prolonged droughts, cold, and food scarcity (Atti et al., 2004; Nejati-Javaremi et al., 2007; Moradi et al., 2012). Based on the combination of tail type and length, Ethiopian indigenous sheep can be classified as short fat-tail, long fat-tail, thintail, and fat-rump sheep. The short fat-tail inhabit sub-alpine mountainous regions, the long fat-tail predominate in mid- to high-altitude environments and the fat-rump sheep occur in semi-arid and arid environments (Gizaw et al., 2007). These populations are considered to be adapted to their production environments and they represent an important model species for investigating and enhancing our knowledge on the genome profiles of environmental adaptation, tail morphology, and fat localization.

Different approaches, that contrast groups of fat- and thintail sheep, have been used to identify candidate regions and genes associated with tail formation and morphotypes. Moradi et al. (2012) identified three regions on chromosomes 5, 7 and X associated with tail fat deposition in Iranian breeds. Using two fat-tail (Laticauda and Cyprus fat-tail) and 13 Italian thintail breeds, Moioli et al. (2015) identified BMP2 and VRTN as the likely candidate genes explaining fat-tail phenotype in the studied populations/breeds. Zhu et al. (2016) detected several copy number variations intersecting genes (PPARA, RXRA, and KLF11) associated with fat deposition in three Chinese native sheep (Large-tail Han, Altay, and Tibetan). Several candidate genes with possible links to fat-tail development, i.e., HOXA11, BMP2, PPP1CC, SP3, SP9, WDR92, PROKR1, and ETAA1, were identified using genome scans that contrasted fat- and thin-tail Chinese sheep (Yuan et al., 2017). Whole genome sequencing of extremely short-tail Chinese sheep revealed the T gene as the best possible candidate, among other nine genes, influencing tail size, following its association with vertebral development (Zhi et al., 2018). There is, so far, no information on the genetic basis of variation in tail fat distribution and size in African indigenous sheep.

In this study, using the Ovine 50 K SNP BeadChip genotypes, we investigated the (i) genetic relationships and structure within and between Ethiopian indigenous sheep of different fat-tail morphotypes alongside other sheep populations and breeds from the Caribbean, European, Middle East, China and Africa, and (ii) candidate genome regions and genes associated with tail morphology, fat deposition and possible eco-climatic adaptation in African indigenous sheep. For the latter, 11 Ethiopian indigenous sheep of different fat-tail morphotypes and two populations of thin-tail sheep from Sudan were analyzed.

# MATERIALS AND METHODS

# DNA Samples and SNP Genotyping

The sampling strategy targeted breeds of indigenous sheep from different geographic regions in Ethiopia (**Table 1** and **Figure 1**). Geographic positioning system (GPS) coordinates were recorded for all the populations. We used altitude to determine the agro-eco-climatic zones of the geographic locations where the sheep were sampled. All efforts were made to include populations representing the different tail phenotypes found in Ethiopia. Twenty DNA samples from two thin-tail sheep, Hammari and Kabashi, were obtained from Sudan. Genomic DNA was extracted from 146 ear tissue samples, collected from 11 Ethiopian indigenous sheep populations, using the NucleoSpin <sup>R</sup> Tissue Kit (www.mn-net.com) following the manufacturers protocol. All 166 genomic DNA samples were genotyped using the Ovine 50 K SNP BeadChip assay. The assay includes 54,240 SNPs composed of 52,413 autosomal, 1449 X-chromosome and 378 mitochondrial SNPs, respectively.

Ovine 50 K SNP BeadChip genotypes of Caribbean, European, Middle East and Chinese, as well as western, northern and southern African sheep, respectively were obtained from


FIGURE 1 | The locations where the Ethiopian and Sudanese sheep populations used in this study were sampled.

the Sheep HapMap database (http://www.sheephapmap.org/ hapmap.php, **Supplementary Table 1**) and included in the study. The aim was to provide a global context of the genetic origins, trajectories of introduction, and dispersal of sheep into Ethiopia.

# Quality Control and Genetic Diversity Analyses

The Sheep HapMap dataset were merged with the ones generated from Ethiopian and Sudanese sheep using PLINK v1.9 (Purcell et al., 2007). The data files for final analysis were generated after pruning the merged dataset of SNPs not mapping on any autosomes, with a minor allele frequency (MAF) ≤0.01 and animals and markers with ≥10 and 5% missing genotypes, respectively. This generated a dataset with 45,102 SNPs which were further pruned, using PLINK v1.9, to be in approximate linkage equilibrium to avoid the possible influence of clusters of SNPs on population genetic relationship and structure analysis (Yuan et al., 2017). Following the latter pruning, 34,088 SNPs were retained for population relatedness and structure analysis.

To minimize the possible loss of informative SNPs for selection signature analysis, the data for Ethiopian and Sudanese sheep was extracted from the dataset of 45,102 autosomal SNPs, that was obtained prior to LD pruning.

The proportion of polymorphic SNPs (Pn), expected (He), and observed (Ho) heterozygosity and inbreeding coefficient (F) were estimated for each population and across all populations using PLINK v1.9, to evaluate the levels of genetic diversity present in Ethiopian and Sudanese sheep, respectively.

# Population Genetic Analyses

Principal component analysis (PCA) were performed using PLINK v1.9 to investigate the genetic structure and relationships among the studied breeds based on genetic correlations between individuals. A graphical display of the first two principal components (PC1 and PC2) was generated using GENESIS (Buchmann and Hazelhurst, 2014). Admixture analysis implemented in ADMIXTURE v1.3 (Alexander et al., 2009) was used to investigate underlying genetic structure and estimate the proportion of shared genome ancestry between the study populations. A 5-fold cross-validation procedure following Lawal et al. (2018), was used to determine the optimal number of ancestral genomes (K) and proportions of genome ancestry that was shared among the study populations.

To further evaluate historical relationships and interactions (gene flow) within and between Ethiopian and Sudanese populations, we used the maximum likelihood tree-based approach implemented in TreeMix (Pickrell and Pritchard, 2012) and included the Soay sheep as an out-group. The number of migration events (m) varied between 1 (migration between two populations and 15 (migration between all the populations). The value of "m" with the highest reproducibility and consistency, among the 15 tested, and which also had the highest loglikelihood value following six replication runs of the analysis, was chosen as the most optimal.

The f3 and f4 tests implemented in TreeMix were also performed. The f3-statistics (A, B, C) were to determine if A was derived from the admixture of populations B and C; a significantly negative value of the f3-statistics would suggest population A is admixed. The f4-statistics (A, B,) (C, D) were to test the validity of hierarchical clustering pattern in fourpopulation trees. Significant deviations of the f4-statistics from zero for the three possible topologies of the four-population trees would provide evidence of gene flow between the populations tested. A significantly positive Z-score indicates gene flow between populations that are related to either A and C or B and D while a significantly negative Z-score indicates gene flow between populations that are related to A and D or B and C. Standard errors were estimated using blocks of 500 SNPs.

# Analysis of Signatures of Selection

For this analysis, we separated 12 of the 13 Ethiopian and Sudanese populations into four genetic groups based on the population clusters revealed by PCA. The four population groups included, western (Bonga, Kido, Gesses) and southern (Loya, ShubiGemo, Doyogena) long fat-tail, and fat-rump (Kefis, Adane, Arabo) sheep from Ethiopia and thin-tail sheep (Hammari, Kabashi) from Sudan. One short fat-tail sheep (Molale) was included with the fat-rump sheep and the other (Gafera), which appeared to be genetically distinct, was dropped from further analysis. Equal numbers of samples were chosen at random to represent each genetic group. Three comparisons which contrasted the fat-rump (E1), western (E2) and southern (E3) long fat-tail sheep with the thin-tail sheep (S) from Sudan were performed. The selection signature analysis involved three approaches, FST, hapFLK and Rsb.

A sliding window approach was used to perform the FST analysis using the HIERFSTAT package (Goudet, 2005) of R (R Core Team, 2012). The window size of 200 Kb, was allowed to slide along the genome by a distance of 60 Kb. The window size and slide distance were determined based on linkage disequilibrium (LD) decay analysis (**Supplementary Figure 1)**. The pairwise FST values (Weir and Cockerham, 1984) for each SNP in each window and between the genetic groups being tested were estimated as follows:

$$F\_{ST} = 1 - \frac{p1q1 + p2q2}{2prqr}$$

Where p1, p2 and q1, q2 are the frequencies of alleles A and a in the first and second group of the test populations, respectively, and pr and qr are the frequencies of alleles A and a, respectively, across the tested groups (Zhi et al., 2018). The FST values were standardized into Z-scores as follows:

$$ZF\_{ST} = \frac{F\_{ST} - \mu F\_{ST}}{\sigma F\_{ST}}$$

Where µFST is the overall average value of FST and σFST is the standard deviation derived from all the windows tested for a given comparison. **Supplementary Figure 2** shows the distribution of the ZFST values. We set the value of ZFST ≥ 4 as the threshold to identify candidate genomic regions under selection.

The hapFLK approach was implemented with hapFLK package v1.2 (Fariello et al., 2013) to detect selection signatures based on differences in haplotype frequencies between groups of populations. Reynolds genetic distances were converted into kinship matrix using an R script supplied with the package. The hapFLK values and kinship matrix were calculated assuming 15 clusters in the fastPHASE model (-K 15). The hapFLK statistic was then computed as the average value across 40 expectation maximization (EM) runs to fit the LD model. The P-values were obtained by running a python script "Scaling\_chi2\_hapFLK.py" available at (https://forge-dga.jouy. inra.fr/documents/588) which fits a chi-squared distribution to the empirical distribution. As with the FST calculations, the hapFLK statistics were also standardized using the formula:

$$hap\text{FLKadj} = \frac{hap\text{FLK\\_mean(hap\text{FLK})}}{\text{Sd(hap\text{FLK})}}$$

The calculation of the raw P-values was based on the null distribution of empirical values (Fariello et al., 2013; Kijas, 2014). The P-values were plotted in a histogram to assess their distribution pattern and the cut-off value to determine significance was set at –Log10 (P-value) ≥ 3 (**Supplementary Figure 2**).

Using haplotype information, we performed the Rsb analysis implemented in rehh package (Gautier and Vitalis, 2012) of R. Haplotypes were estimated with SHAPEIT (Delaneau et al., 2014). To identify loci under selection, the Rsb values were log-transformed into PRsb (PRsb = –Log10 [1–2(8 (Rsb)−0, 5)]), where 8(x) represents the Gaussian cumulative distribution function (Gautier and Vitalis, 2012). Assuming that the Rsb values are normally distributed (under neutrality), the PRsb can be interpreted as –Log10 (P-value), where P is the two-sided P-value associated with the neutral hypothesis. For each comparison, SNPs that exhibited PRsb ≥ 3 (P-value = 0.001) were taken to be under selection (de Simoni Gouveia et al., 2017). The hapFLK and Rsb analysis were also performed using window sizes of 200 Kb sliding along the genome by a distance of 60 Kb.

# Gene Annotation

Candidate regions that overlapped between FST, hapFLK, and Rsb were identified and compared using the intersectBed function of Bed Tools software (Quinlan and Hall, 2010). Considering an average marker distance of between 60 and 200 Kb (Moioli et al., 2015) and the observed LD decay pattern (**Supplementary Figure 1**), candidate regions under selection were identified by exploring the SNPs found up- and downstream, and within, the most significant windows. The Oar v3.1 Ovine reference genome assembly (Jiang et al., 2014) was used to annotate the candidate regions. Functional enrichment analysis was performed using the functional annotation tool in DAVID (Huang et al., 2008) using Ovis aries as the background species. Gene functions were determined using the NCBI (http://www. ncbi.nlm.nih.gov/gene/) and OMIM databases (http://www.ncbi. nlm.nih.gov/omim/) and a review of literature.

# RESULTS

# Genetic Diversity and Population Structure

The average values of Pn, He, Ho, and F, as indicators of within-breed genetic diversity, are shown in **Table 2** and **Supplementary Figure 3**. The lowest values of Pn, He, and Ho were observed in Bonga while the highest values were observed in Molale-Menz, Hammari and Kabashi, and Arabo, respectively.


The PCA plot incorporating the global populations and which was constructed using a sample size of five animals that were selected at random per population, is shown in **Figure 2**. We used the uniform sample size of five animals since differences in sample sizes may influence clustering patterns on the PCA. The choice to use five samples per population was based on the smallest sample size of five individuals genotyped for Sidaoun and Berber breeds. In spite the sample size rebalancing, the population cluster patterns did not differ from that observed when the PCA was performed using unequal sample sizes (**Supplementary Figures 4**, **5**). Generally, PC1 separates Ethiopian and South African fat-tail sheep, Sudanese thin-tail sheep, West African Djallonke and Algerian Sidaoun from the other sheep populations. Sheep from the Middle East and North Africa occur at the center of the PCA plot and, together with the Cyprus fat-tail and Chinese sheep (which cluster close together) are separated by PC2 from African Dorper, Barbados Blackbelly and European sheep. The two populations of Ethiopian short fat-tail sheep diverge from each other; Gafera-Washera clusters near Ethiopian long fat-tail sheep while Molale-Menz clusters together with the Ethiopian fat-rump sheep. The West African Djallonke clusters close to the two South African breeds (Ronderib and Namaqua). Sidaoun and Berber (both from Algeria) cluster separate, while the Cyprus fat-tail clusters close to the Chinese sheep (**Figure 2**).

To obtain a clearer picture of the variation within the fat-tail sheep, we performed the PCA excluding the thin-tail sheep (**Figure 3**). PC1 separates the Ethiopian fat-tails from their Middle East, North Africa, Mediterranean and Chinese counterparts. PC2 differentiates the South African breeds from the Ethiopian ones. Like the global PCA, one Ethiopian short fat-tail sheep (Gafera-Washera) clusters with the Ethiopian longfat tail sheep and the other (Molale-Menz) forms a cluster with the Ethiopian fat-rump sheep. Middle East sheep cluster together with the North African ones while the Mediterranean sheep unexpectedly cluster with the Chinese sheep despite the large geographic distance separating them.

To further illustrate the distribution of genetic variation among East African populations, we performed the PCA with only the Ethiopian and Sudanese thin-tail sheep (**Figure 4**). PC1 separates Ethiopian fat-rump, Molale-Menz (Ethiopian short-fat tail) and thin-tail sheep from the Ethiopian long fat-tail and Gafera-Washera (Ethiopian short-fat tail) sheep. Generally, PC1 separates the fat-rump sheep from the fat-tail ones derived from western and southern Ethiopia. PC2 reveals further separation of the Ethiopian sheep: (i) Molale-Menz, Adane and some Arabo individuals from Kefis, and the remaining Arabo individuals, and (ii) Gafera-Washera, Kido and Gesses from Doyogena, ShubiGemo, Bonga and Loya.

Admixture analysis on the global dataset, separates the study populations following their geographic origins (**Figure 5**). The cross-validation error registered the lowest value at K = 9 suggesting this to be the most optimal number of clusters explaining the variation in this dataset (**Supplementary Figure 6a**). Chinese sheep separate from the other populations at K ≥ 3. Among African breeds, the South

African ones (Namaqua, Dorper, Ronderib) and the West African Djallonke show a distinct but common genetic ancestry with the Ethiopian and Sudanese sheep for 3 ≤ K ≤ 6.

Two to six hypothetical ancestral clusters (K) were tested with Admixture on the East African dataset. The lowest crossvalidation error suggests K = 4 (**Supplementary Figure 6b**) as the optimal number of ancestral clusters present in Ethiopian and Sudanese thin-tail sheep. The proportion of each ancestral cluster (referred to as A, B, C, and D) in each population at K = 4 is shown in **Figure 6** and **Supplementary Table 2**. They occur with the highest proportion (>90%) in Loya (cluster A), Bonga, Kido and Gesses (cluster B), Molale-Menz and a few individuals of Adane (cluster C) and in thin-tail sheep (cluster D). Clusters A, B, and C are observed in ShubiGemo and Doyogena;

B and C in Gafera-Washera and Molale-Menz; B, C, and D in some individuals of Adane while Arabo and Kefis had C and D clusters. The analysis also shows that Gafera-Washera, Adane, Molale-Menz, Arabo, and Kefis share cluster C, while Hammari and Kabashi share the D cluster with Arabo and Kefis. ShubiGemo, Loya and Doyogena, all long fat-tail sheep from southern Ethiopia, share cluster A.

TreeMix revealed possibilities of gene-flow between East African sheep. The f index representing the fraction of the variance in the sample covariance matrix (Wˆ ) accounted for

FIGURE 6 | Admixture analysis involving Ethiopian indigenous sheep populations (K = 4 had the lowest cross-validation error). For brevity the four genetic clusters are designated (A)–(D), respectively.

by the model covariance matrix (W) was used to identify the information contribution of each migration vector added to the tree. Up to 15 possible migration vertices were computed. The first eight migration edges (gene flow) accounted for more than half of the total model significance explained by the f statistic, with the first migration edge having a f value of 0.51. We therefore chose m = 8 as the best predictive value for the migration model. Vectors from 9 to 15 resulted in only small incremental changes in the f value (**Figures 7A,B**). The eight migration events were Loya and ShubiGemo (both long fat-tail); Arabo and Adane (both fat-rump); Gafera-Washera, Molale-Menz (both short fat-tail) and Adane (fat-rump); Molale-Menz (short fat-tail) and Adane (fat-rump) with ShubiGemo (long fattail); Bonga with ShubiGemo, Doyogena and Loya (all long fattail sheep); Molale-Menz (short fat-tail) and Arabo (fat-rump); ShubiGemo (long fat-tail) with Arabo (fat-rump) and Kefis (fat-rump); Gesses (long fat-tail) with Kabashi and Hammari (thin-tail).

The f4-statistics, also highlighted possibilities of gene flow among various breeds. The highest Z values (>|50|) were observed between Hammari and Kabashi (thin-tails) and Arabo and Kefis (fat-rump) (**Supplementary Table 3**). The f3-statistics however, did not reveal any likelihood of gene-flow between the breeds analyzed (**Supplementary Table 4**). This could be due to a complex pattern of gene-flow between the study populations, which may not be accounted for by a three-way model.

## Signatures of Selection

The Admixture, TreeMix and PCA (**Figures 6**, **7**; **Supplementary Figure 4**) revealed three genetic groups in Ethiopian sheep viz fat-rump (E1), and long fat-tail from western (E2) and southern (E3) Ethiopia, respectively. The

two short fat-tail sheep (Molale-Menz and Gafera-Washera) analyzed here were separated from each other (**Figure 4**) with Molale-Menz showing close genetic affinity to fat-rump sheep and Gafera-Washera appeared genetically distinct. The three groups are distinct from thin-tail (S) sheep (**Figure 4**). For selection signature analysis, we included Molale-Menz with the fat-rump sheep but excluded Gafera-Washera from the analysis due to its low sample size. We selected, at random, 20 samples to represent each of the four genetic groups and performed the selection signature analysis. We contrasted the three groups of Ethiopian sheep (E1, E2, and E3) with the thin-tail sheep (S). The top windows (**Supplementary Table 5**), which passed the significance threshold, for each method (hapFLK ≥ 3, ZFST ≥ 4, Rsb ≥ 3) were used to define candidate regions under selection.

For E1<sup>∗</sup> S comparison, the fat-rump sheep were differentiated from the thin-tail in 23 candidate regions that overlapped between at least two selection signature methods and which spanned 86 genes (**Figure 8**, **Table 3**). Similarly, a total of 65 genes were present across 18 candidate regions that overlapped between at least two approaches in the E2<sup>∗</sup> S (western Ethiopia long fat-tail verses thin-tail) comparison (**Figure 9**, **Table 4**). Furthermore, 10 genes that seemed to be highly selected were identified by Rsb in three candidate regions on Oar8, Oar14, and Oar18, respectively (**Figure 9**, **Table 4**). Twelve overlapping candidate regions spanning 36 genes, were observed in the southern Ethiopian fat-tail verses thin-tail sheep (E3<sup>∗</sup> S) (**Figure 10**, **Table 5**). There were also 16 genes found across 1 (Oar26, 3 genes), 1 (Oar3, 1 gene), and 12 (Oar2, 1 gene; Oar3, 9 genes; Oar10, 2 genes) candidate regions that were identified by hapFLK, ZFST, and Rsb, respectively (**Figure 10**, **Table 5**).

We performed gene ontology (GO) enrichment analysis for the candidate genes revealed in each pairwise comparison (**Supplementary Table 6**). The five topmost TABLE 3 | Candidate regions and genes identified to be under selection by a combination of at least two methods in the Ethiopian fat-rump vs. Sudanese thin-tail sheep.


(Continued)

### TABLE 3 | Continued


GO terms associated with the candidate genes from the E1<sup>∗</sup> S comparison include embryonic skeletal system morphogenesis (GO:0009952, GO:0048704, GO:0030224, GO:0048706), response to cold (GO:0009409), innervation (GO:0060384), stem cell maintenance (GO:0019827) and positive regulation of cell adhesion (GO:0045785). The top GO terms associated with the E2<sup>∗</sup> S candidate genes include cellular response to heat (GO:0034605), lipid binding (GO:0008289), magnesium ion binding (GO:0000287) and response to gamma radiation (GO:0000287). The GO

terms for the genes from the E3<sup>∗</sup> S comparison included skin development (GO:0043588), regulation of actin cytoskeleton reorganization (GO:2000249) and wound healing (GO:0042060).

# DISCUSSION

In this study, we used Ovine 50 K SNP BeadChip generated genotype data to investigate autosomal genetic diversity in Ethiopian indigenous sheep. Including populations from other regions of the world and the African continent allowed us to assess this diversity in a global geographic context. Our findings showed that the Ethiopian indigenous sheep are genetically differentiated from the other populations including other African fat-tail sheep (**Figures 2**, **3**). The finding that the Ethiopian fattail sheep are distinct from those found in North Africa, support the presence of at least two genetic groups of fat-tail sheep in the continent and two separate introduction events, one via the Northeast Africa and the Mediterranean Sea coastline, and the other via the Horn of Africa crossing through the strait of Babel-Mandeb, respectively. The distinct clustering of the thin-tail sheep suggests its independent introduction into the continent. The fact that the South African Ronderib and Namaqua sheep occur on the same PC planar axis with Ethiopian sheep (**Figure 2**) may suggest, a common genetic heritage between the two rather than with the North African breeds. The movement of TABLE 4 | Candidate regions and genes identified to be under selection by a combination of at least two methods in the Ethiopian western long fat-tail vs. Sudanese thin-tail sheep.


(Continued)

### TABLE 4 | Continued


sheep southwards remains speculative; some linguistic evidence suggests movement of bantu speaking populations from West Africa to South Africa through central Africa and following a western route rather than the more traditionally postulated eastern routes from East to South Africa (Newman, 1995). In such context a close clustering of the thin-tail West African sheep with some fat-tail southern African sheep breeds, such as the Namaqua from Namibia studied here is worth mentioning as it offers some possible insights. This however, will require further investigation beyond the scope of this study.

Our results agree with previous findings that were arrived at using microsatellite loci (Muigai, 2003) and 50 K SNP genotype data (Mwacharo et al., 2017). They are also in line with archaeological and anthropological evidences indicating the introduction first, of thin-tail sheep into the continent followed by fat-tail sheep, initially through the Sinai Peninsula and later the Horn of Africa region (Gifford-Gonzalez and Hanotte, 2011; Muigai and Hanotte, 2013).

Interestingly, the PCA results involving Ethiopian and Sudanese sheep separate the Ethiopian populations into three groups while ADMIXTURE revealed four genetic clusters in Ethiopian sheep irrespective of their geographic origins in the country. TreeMix revealed extensive gene flow between populations of different geographic origins and tail-types. These results suggest, most likely, current and historical intermixing of sheep following human socio-cultural and economic interactions. This appears to be a common feature in Ethiopia and most likely the Northeast and eastern Africa region as it was also observed in Ethiopian goats by Tarekegn et al. (2018). We propose here that the common D genetic background present in short fat-tail and fat-rump sheep may represent historical introgression of the thin-tail gene pool into short fat-tail and fat-rump genepool. This result calls for further investigation.

Our findings on the genetic relationships and differentiation between Ethiopian sheep populations agree with findings

of previous studies, which were performed using either microsatellites (Gizaw, 2008) or 50 K SNP genotype data (Edea et al., 2017) and which indicated a grouping of Ethiopian indigenous sheep populations based on their tail phenotypes. However, uniquely in our study, the long fattail populations were further subdivided into two secondary groups representing sheep populations from western and southern Ethiopia (**Figure 4**). These two groups were also defined by different genetic backgrounds by ADMIXTURE (**Figure 6**) and they clustered separately in TreeMix (**Figure 7**). In addition, although they are defined by the same tail phenotype, the two populations of Ethiopian short fattail sheep did not cluster together. Geographic isolation coupled, most likely, with adaptation to different ecoclimates, as well as ethnic, cultural and religious practices and differences, that can impede gene flow, may have shaped this genetic sub-structuring (Madrigal et al., 2001; Gizaw et al., 2007).

In selection signature analysis, we contrasted groups of Ethiopian indigenous sheep that showed variation in the size of the fat-tail with thin-tail sheep. Our results identified several genes as potential candidates controlling tail morphotype and fat localization in the study populations. Several genes occurred within candidate regions that overlapped between at least two of the three approaches used to detect signatures of selection (hapFLK, FST, Rsb). The FST approach detects signatures arising from an increase or decrease in allele frequency differentiation between populations/breeds, hapFLK detects the same but based on increase/decrease in haplotype frequency differentiation between populations while accounting for hierarchical population structure (Kijas, 2014) while Rsb detects signatures associated with the patterns TABLE 5 | Candidate regions and genes identified to be under selection by a combination of at least two methods in the southern Ethiopia long fat-tail vs. thin-tail sheep.


of linkage disequilibrium between loci across the genome (Oleksyk et al., 2010; de Simoni Gouveia et al., 2014). Since these methods are based on different algorithms and assumptions, if common signatures are detected by at least two of the methods it suggests good reliability of the results while reducing the likelihood of interpreting false positives. They also detect signatures spanning different time periods; the FST and hapFLK detect signatures arising from long term differential selection while Rsb detects ongoing signatures of selection including those that arise in the short to medium term (Oleksyk et al., 2010).

In the E1<sup>∗</sup> S comparison, three genes associated with growth traits were present on the candidate region on Oar2, i.e., histidine triad nucleotide binding protein 2 (HINT2), sperm associated antigen 8 (SPAG8) and natriuretic peptide receptor 2 (NPR2). Previous studies reported these genes to be associated with birth and carcass weights, and fat depth, respectively, in cattle (Casas et al., 2000; McClure et al., 2010) and sheep (Moradi et al., 2012; Wei et al., 2015). We also identified two genes on Oar5 (ANGPTL8, INSR), which might be responsible for fat accumulation in adipose tissues. Angiopoietin-like 8 (ANGPTL8), when induced by insulin receptor (INSR), inhibits lipolysis and controls post-prandial fat storage in white adipose tissue and directs fatty acids to adipose tissue for storage during the fed state (Mysore et al., 2017). The ADAMTS3 (ADAM metallopeptidase with thrombospondin type 1 motif 3) gene was present in the region identified on Oar6. This gene is expressed in cartilage, where collagen II is a major component, as well as in embryonic bone and tendon, suggesting that it could be a major procollagen processing enzyme in musculoskeletal tissues (Dubail and Apte, 2015). The homeobox B13 (HOXB13) and ALX homeobox 4 (ALX4) were identified on the candidate region on Oar11 and Oar15, respectively. Mutations in the former result in overgrowth of caudal spinal cord and tail vertebrae in mice (Economides et al., 2003), while the latter is involved in the development of limbs and skeleton (Fariello et al., 2014).

Our enrichment analysis for the E1<sup>∗</sup> S genes revealed a cluster of genes (BMP4, MED1) with functions that could possibly be related to tail formation. Bone Morphogenetic Protein 4 (BMP4) was revealed by Rsb and FST to be on a candidate region on Oar7 and it has been implicated in tail formation (Moioli et al., 2015). Peroxisome Proliferator Activated Receptor Gamma (PPARG) expression has been associated with back-fat thickness in sheep (Dervish et al., 2011). Ge et al. (2008) reported Mediator Complex Subunit 1 (MED1) is an essential protein for the optimal functioning of PPARG. Despite this association, our analysis did not reveal any signals spanning PPARG, but two of our methods (Rsb and FST) revealed a signature on Oar20 that spanned the PPARD gene, a paralogue to PPARG.

In the same comparison (E1<sup>∗</sup> S), we identified a cluster of genes (CDH8, ADRB3, THRA, TRPM8, PLAC8) that are associated with the GO biological process, response to cold. This is not surprising considering that three out of the four E1 populations are living at a high altitude and therefore in a relatively cold habitat. Indeed, Adreno receptor Beta 3 (ADRB3) plays a major role in energy metabolism and regulation of lipolysis and homeostasis (Wu et al., 2012). It is also associated with birth weight, growth rate, carcass composition and survival in various breeds of sheep (Horrell et al., 2009). The ion channel TRPM8 has been reported to play a major role in eliciting cold defense thermoregulation, metabolic and defense immune responses in humans (Kozyreva and Voronova, 2015).

Several other genes occurring in the E1<sup>∗</sup> S candidate regions and which are associated with the GO term embryonic skeletal system development (GO:0048706) included HOXC6, SULF2, WNT11, and HOXB9. WNT11 was identified by ZFST on Oar15 while HOXC6 and HOXB9 were revealed by hapFLK on Oar3 and Oar13, respectively. The WNT gene family and the T gene have been implicated in vertebral development in laboratory mice (Greco et al., 1996), and with the short-tail phenotype in sheep (Zhi et al., 2018). In addition, the roles of the WNT gene family in lipid metabolic processes in fat-tail sheep have also been reported (Kang et al., 2017). The HOX genes represent transcriptional regulatory proteins that control axial patterning in bilaterians (Garcia-Fernàndez, 2005), where the inactivation of one of the HOX genes often causes transformations in the identity of vertebral elements (Mallo et al., 2010). The HOX genes are able to control morphologies along the anteroposterior axis (Lewis, 1978). Furthermore, HOXC11, HOXC12, and HOXC13 developmental genes were found to be expressed in the tail region indicating their possible associations with tail size and fat development in fat-tail sheep (Kang et al., 2017).

The candidate regions revealed by the E2<sup>∗</sup> S comparison, spanned 65 candidate genes. Three genes of the BPI fold Containing Family B (BPIFB3, BPIFB4, and BPIFB6) were present in a candidate region on Oar13. These, along with other paralogs (BPIFB1, BPIFA3, BPIFB2, BPIFA1), formed a cluster of functional genes related to the GO term, lipid binding functional process (**Supplementary Table 6**). In contrast to the E1<sup>∗</sup> S comparison, the cluster of genes identified in the E2<sup>∗</sup> S comparison were associated with the GO terms, Magnesium ion binding, response to gamma radiation and cellular response to heat. This suggests, most likely, the propensity of this group of sheep to adapt to the eco-climatic conditions prevailing in their home-tract. This is consistent with the humid highland and moist lowland conditions of the geographic area where the populations representing the E2 group (Bonga, Gesses, Kido) were sampled. High fecundity and prolificacy is a common reproductive trait preferred by farmers in the Bonga sheep (field observations by the last author). This may explain the occurrence of the CIB4 and PRKAA1 in a candidate region in the E2<sup>∗</sup> S comparison. The CIB4 gene was suggested to be linked, in some way, to high fecundity in the small Tail Han sheep (Yu et al., 2010) and PRKAA1 is involved in ewe's follicular development (Foroughinia et al., 2017).

The third comparison (E3<sup>∗</sup> S) resulted in 36 genes that occurred in candidate regions that were revealed by at least two methods used to detect selection signatures. Fatty acid binding proteins FABP3 and FABP1 found on candidate regions on Oar2 and Oar3, respectively are the genes that relate most closely to fat deposition. SREBF1 along with PPARG are the main transcription factors controlling lipogenesis in adipose tissue and skeletal muscle (Ferré and Foufelle, 2010), and are mainly regulated by fatty acid-binding proteins (FABP) (Lapsys et al., 2000). Recently, Bahnamiri et al. (2018) evaluated the effects of negative and positive energy balances on the expression pattern of these genes in fat-tail and thin-tail lambs. They observed differential transcriptional regulation of lipogenesis and lipolysis during periods of negative and positive energy balances in the two groups of lambs. In general, the cluster of genes identified in this comparison was significantly enriched for GO terms relating to skin development, wound healing and regulation of actin cytoskeleton reorganization (**Supplementary Table 6**).

The overlapped genes between all comparisons are shown in **Supplementary Figure 7**. The commonest genes between the three comparisons are TSPAN8, RXFP2, and RIN2. The TSPAN8 (Tetraspanin 8) occurred in the candidate region on Oar3; it is among the genes that are reported to be associated with insulin release and sensitivity, and obesity in humans (Grarup et al., 2008), while the relaxin family peptide receptor 2 (RXFP2) has been associated with horn morphology (Johnston et al., 2011; Wiedemar and Drögemüller, 2015).

Twelve genes (MELK, RNF38, GNE, CLTA, CCIN, RECK, HINT2, SPAG8, NPR2, FAM221B, MSMP, RGP1) were common between E1<sup>∗</sup> S and E2<sup>∗</sup> S comparisons. On Oar2, three genes were identified within the overlapping candidate region, i.e., CLTA which is associated with prion protein deposition in sheep (Filali et al., 2014), GNE which is important for the metabolism of sialated oligosaccharides in bovine milk (Wickramasinghe et al., 2011) and RECK which encodes an inhibitor of angiogenesis, invasion and metastasis, DNA methylation, and increased mRNA in cell lines in humans (Su, 2012). Other genes (i.e., HINT2, SPAG8, and NPR2) are associated with fat deposition in sheep as herein discussed for each of the three comparisons.

Furthermore, one gene (DIS3L2) was in a candidate region that overlapped between the E1<sup>∗</sup> S and E3<sup>∗</sup> S comparisons. DIS3 like 3'-5' exoribonuclease 2 (DIS3L2) has also been identified, among genes involved in cancer, cellular function and maintenance, and neurological disease, in a candidate region under selection in cattle (Gautier et al., 2009). In sheep, using FST, iHS, and Rsb, de Simoni Gouveia et al. (2017) indicated that DIS3L2 is among genes associated with height variation. In addition, DIS3L2 has reportedly been associated with the Perlman syndrome, which is characterized by overweight in humans (Astuti et al., 2012).

Finally, seventeen genes (PKD2L2, FAM13B, WNT8A, NME5, BRD8, KIF20A, CDC23, GFRA3, CTNNA1, LRRTM2, SIL1, SPATA24, DNAJC18, SMIM33, TMEM173, FRY, ATP10A) were in candidate regions that overlapped between the E2<sup>∗</sup> S and E3<sup>∗</sup> S comparisons. Among these, DnaJ heat shock protein family (HSP40) member C18 (DNAJC18) and spermatogenesis associated 24 (SPATA24) on Oar5 were reported among genes involved in heat stress tolerance and male reproductive function, respectively, in East African Shorthorn Zebu cattle (Bahbahani et al., 2015).

# CONCLUSION

Overall, our results revealed four distinct autosomal genomic backgrounds (A, B, C, D) in Ethiopian indigenous sheep. The genotypes of most of the individuals analyzed were made up of at least two genetic backgrounds which could be accounted for by some level of current or historical admixture between populations. Selection signature analysis identified several putative candidate regions spanning genes relating to skeletal structure and morphology, fat deposition and possibly adaptation to environmental selection pressures. Our results indicate that Ethiopian indigenous sheep could be a valuable animal genetic resource that can be used to understand genetic mechanisms associated with body fat metabolism and distribution. This is especially important because fat deposits are a crucial component of adaptive physiology and excessive fat deposition in adipose tissue can result in obesity and overweight, and energy metabolism disorders in humans.

# DATA ACCESSIBILITY

Genotypic data of 160 animals representing eleven Ethiopian and two Sudanese sheep populations are deposited and available at (https://www.animalgenome.org/repository/pub/NOTT2018. 0423/).

# ETHICS STATEMENT

The animals used in this study are owned by farmers. Prior to sampling, the objectives of the study were explained to them in their local languages so that they could make an informed decision regarding giving consent to sample their animals. Government veterinary, animal welfare and health regulations were observed during sampling. The procedures involving sample collection followed the recommendation of directive 2010/63/EU. Skin tissues importation and/or exportation was permitted by the Ethiopian Ministry of Livestock and Fisheries under Certificate No: 14-160-401-16.

# AUTHOR CONTRIBUTIONS

AbA, JM, and OH conceived and designed the study. AbA analyzed the data and together with JM wrote the manuscript. JM and OH revised the manuscript. HB provided support in data analysis. SM, FP, and EC contributed to genotyping and genotype data of non-Ethiopian breeds (Najdi, Omani, and Libyan Barbary) and provided critical inputs on data analysis and in writing the manuscript. FA, MA, and MOA supported the sampling and genotyping of Najdi, Omani and Libyan sheep. AK and AyA lead and coordinated the sampling of Ethiopian sheep HM lead and coordinate the sampling of Sudanese sheep. All authors contributed to the interpretation of the results based on their knowledge on local indigenous sheep genetic resources of their respective countries. All the authors read and approved the final manuscript.

# ACKNOWLEDGMENTS

This study was conducted during AbA's PhD study which is sponsored by the Libyan Ministry of Higher Education and Scientific Research and the University of Misurata. Sampling of Ethiopian sheep was supported by the CGIAR Research Program on Livestock (Livestock CRP) and accordingly, ICARDA and ILRI wish to thank the donors supporting the Livestock CRP. This study forms part of our on-going efforts to understand the adaptation of local indigenous livestock to improve their productivity.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2018.00699/full#supplementary-material

Supplementary Figure 1 | Patterns of linkage disequilibrium (LD) calculated within the Ethiopian (ET) and Sudanese (SD) sheep populations.

Supplementary Figure 2 | Distribution of the standardized Z-score values for (A) hapFLK and (B) FST for the autosomal markers.

Supplementary Figure 3 | Distribution of genetic diversity indices within each breed. (A) SNP displaying polymorphism (Pn), (B) Expected heterozygosity (He); Observed heterozygosity (Ho); (C) Inbreeding coefficient (F).

Supplementary Figure 4 | Genetic variation among the Ethiopian sheep populations in a global geographic context (all animals included for each population).

# REFERENCES


Supplementary Figure 5 | Distribution of genetic variation among the worldwide fat-tail sheep (all animals included for each population).

Supplementary Figure 6 | Cross-validation error plot for admixture analysis of the studied populations (A) in the national and (B) in the global context.

Supplementary Figure 7 | Venn diagram showing the distribution and number of genes shared between the three groups of sheep (E1, E2, E3, S) used in the analysis of selection signatures.

Supplementary Table 1 | Description of the world-wide breeds of sheep used in the study.

Supplementary Table 2 | Proportion of the genetic backgrounds in each study population as identified by Admixture analysis.

Supplementary Table 3 | Results of f4 statistics for the study breeds as generated with TreeMix.

Supplementary Table 4 | Results of f3 statistics for the study breeds as generated with TreeMix.

Supplementary Table 5 | Results of signature of selection between the three population groups.

Supplementary Table 6 | Enriched functional term clusters and their enrichment scores following DAVID analysis for genes identified in the candidate regions under selection.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ahbara, Bahbahani, Almathen, Al Abri, Agoub, Abeba, Kebede, Musa, Mastrangelo, Pilla, Ciani, Hanotte and Mwacharo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genomic Selection and Use of Molecular Tools in Breeding Programs for Indigenous and Crossbred Cattle in Developing Countries: Current Status and Future Prospects

### Raphael Mrode1,2 \*, Julie M. K Ojango<sup>1</sup> , A. M. Okeyo<sup>1</sup> and Joram M. Mwacharo<sup>3</sup>

*<sup>1</sup> Animal Biosciences, International Livestock Research Institute, Nairobi, Kenya, <sup>2</sup> Animal and Veterinary Science, Scotland Rural College, Edinburgh, United Kingdom, <sup>3</sup> Small Ruminant Genomics, International Centre for Agricultural Research in the Dry Areas (ICARDA), Addis Ababa, Ethiopia*

Edited by: *Ino Curik, University of Zagreb, Croatia*

# Reviewed by:

*Laercio R. Porto-Neto, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia Gregor Gorjanc, University of Edinburgh, United Kingdom*

> \*Correspondence: *Raphael Mrode r.mrode@cgiar.org*

### Specialty section:

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

Received: *08 May 2018* Accepted: *11 December 2018* Published: *09 January 2019*

### Citation:

*Mrode R, Ojango JMK, Okeyo AM and Mwacharo JM (2019) Genomic Selection and Use of Molecular Tools in Breeding Programs for Indigenous and Crossbred Cattle in Developing Countries: Current Status and Future Prospects. Front. Genet. 9:694. doi: 10.3389/fgene.2018.00694* Genomic selection (GS) has resulted in rapid rates of genetic gains especially in dairy cattle in developed countries resulting in a higher proportion of genomically proven young bulls being used in breeding. This success has been undergirded by well-established conventional genetic evaluation systems. Here, the status of GS in terms of the structure of the reference and validation populations, response variables, genomic prediction models, validation methods, and imputation efficiency in breeding programs of developing countries, where smallholder systems predominate and the basic components for conventional breeding are mostly lacking is examined. Also, the application of genomic tools and identification of genome-wide signatures of selection is reviewed. The studies on genomic prediction in developing countries are mostly in dairy and beef cattle usually with small reference populations (500–3,000 animals) and are mostly cows. The input variables tended to be pre-corrected phenotypic records and the small reference populations has made implementation of various Bayesian methods feasible in addition to GBLUP. Multi-trait single-step has been used to incorporate genomic information from foreign bulls, thus GS in developing countries would benefit from collaborations with developed countries, as many dairy sires used are from developed countries where they may have been genotyped and phenotyped. Cross validation approaches have been implemented in most studies resulting in accuracies of 0.20–0.60. Genotyping animals with a mixture of HD and LD chips, followed by imputation to the HD have been implemented with imputation accuracies of 0.74–0.99 reported. This increases the prospects of reducing genotyping costs and hence the cost-effectiveness of GS. Next-generation sequencing and associated technologies have allowed the determination of breed composition, parent verification, genome diversity, and genome-wide selection sweeps. This information can be incorporated into breeding programs aiming to utilize GS. Cost-effective GS in beef cattle in developing countries may involve usage of reproductive technologies (AI and *in-vitro* fertilization) to efficiently propagate superior genetics from the genomics pipeline. For dairy cattle, sexed semen of genomically proven young bulls could substantially improve profitability thus increase prospects of small holder farmers buying-in into genomic breeding programs.

Keywords: genomic selection, indicus cattle, GBLUP, sexed semen, accuracy

# INTRODUCTION

Genomic selection (GS) has resulted in rapid rates of genetic gains especially in dairy cattle in developed countries with the consequence that a higher number of currently artificial insemination (AI) active sires are genomically proven young bulls in the USA (Hutchison et al., 2014). The authors reported that young bulls accounted for 28 and 25% of Holstein and Jersey inseminations in 2007, respectively. These percentages increased to 51 and 52%, respectively, in 2012 due to the use of genomically proven young bulls. Well-established conventional genetic evaluation systems have provided the strong foundation for the success of GS in these countries. Furthermore, the existence of well-developed breeding structures, particularly breeding companies, has made enormous contribution to the success. In the dairy and beef industry, for example, the genotyping infrastructure for bulls and associated costs has mainly been undertaken by AI companies such as CRV in the Netherlands (https://www.crv4all.com/), ABS in the USA (http:// www.absglobal.com/us/) and Semex in Canada (http://www. semex.com/). In addition, these companies provide an efficient system for delivering superior genetics from the genomics pipeline.

In developing countries especially in Africa and Asia, most of the production occurs in small holder systems which are characterized by small herd sizes, lack of performance, and pedigree recording and therefore, the non-existence of conventional genetic evaluation systems (Kosgey and Okeyo, 2007). However, in some countries like Brazil in Latin America, the existence of breed associations have resulted in the establishment of some degree of data and pedigree recording and genetic evaluation (Silva et al., 2016; Boison et al., 2017), but there is still the lack of breeding structures such as AI companies, to drive breed improvement programs. Therefore in the era of genomics, most genotyping activities in developing countries are undertaken by breed organizations or associations, such as in Brazil (Carvalheiro, 2014; Silva et al., 2016), or are a result of several development projects, such as the East Africa Dairy Development Project (Brown et al., 2016), and the African Dairy Genetic Gains Cattle project (https://www.ilri.org/node/ 40458). Consequently, the number of genotyped animals tend to be limited; are mostly females, and this has major influence on both the size and structure of the reference and validation populations.

Given these characteristics, this paper examines the current status of GS and use of molecular tools in breeding programs for dairy and beef cattle in developing countries and offers some future perspectives. The basic principle of GS is that single nucleotide polymorphisms (SNPs) are assumed to be at linkage disequilibrium (LD) with QTLs in the genome. Therefore, the use of SNPs as markers enables all QTLs in the genome to be indirectly identified through the mapping of chromosome segments defined by adjacent SNPs. The implementation of GS usually involves estimating the SNP effects in a reference population which consists of individuals with phenotypic records and genotypes. This is then followed by prediction of genomic estimated breeding values (GEBV) for selection candidates (validation data set) with no phenotypes of their own (Meuwissen et al., 2001). Therefore, the current status of GS in developing countries is presented under the broad subtitles of the stages involved in the implementation of GS such as structure of the reference and validation populations, definition of input variables, genomic prediction models, validation methods, imputation efficiency, genotyping strategies, and routine genomic evaluation. A section on the use of molecular genomic tools and identification of genome-wide signatures of selection is then presented.

# STRUCTURE OF THE REFERENCE AND VALIDATION POPULATIONS

As indicated earlier, the lack of major AI companies to drive the initial breed improvement and genotyping activities in developing countries has meant smaller number of animals are genotyped and most of these are females. Firstly, it becomes very difficult to clearly define separate reference and validation populations, consequently studies have been designed to optimally use the available information. In general, these reported studies on genomic prediction in dairy and beef cattle are characterized by small reference populations (500–3,000 animals, **Table 1**) and most validations are undertaken in test data sets created by either random or structured sampling from all genotyped animals. A few of these reference populations are a combination of both bulls and cows (Boison et al., 2017) but most are cows (Brown et al., 2016; Silva et al., 2016). This has implications in terms of the accuracy of genomic prediction, which has tended to be lower compared to those obtained in developed countries, given the limited information of the response variable when using cow records.

However, the inclusion of cows in the reference population has resulted in up to 5-fold increase in the size of the reference population in some cases and increases of up to 12% in accuracy compared to using only bulls (Boison et al., 2017). In some of the studies (Neves et al., 2014; Silva et al., 2016; Boison et al., 2017), the accuracy of genomic prediction was undertaken in


validation sets consisting of young bulls born in more recent years. Thus, the accuracy of genomic prediction was evaluated in future selection candidates (forward validation) and thus better reflect the accuracy that will be obtained when selecting young animals based only on their genotypes.

In the other studies, validation sets were created from all genotyped animals by either random or structured sampling such as clustering (Ding and He, 2004) or sampling based on genomic relationship matrix (Cardoso et al., 2014) or breed composition (Brown et al., 2016). In such cross validation studies, the validation sets tend to be contemporaneous to the reference animals to some degree. Thus, the extent to which such estimates of accuracy are realized when selecting younger animals for breeding will be influenced by the degree of the relationship between the reference and the sampled validation sets. Thus, cross validation may not necessarily give the best results in terms of predicting the accuracy of selecting the youngest animals for breeding.

The influence of the relationships between various validation sets derived by sampling and the reference set on the accuracy of genomic prediction has been examined in a few studies. Boison et al. (2017) observed that the average genomic relationship for the five top individuals with highest relationships in the reference and validation data sets varied from 0.321 to 0.410. Corresponding ranges of estimates considering the top 10 individuals were 0.262–0.362. These values were higher than estimates reported in other populations (Clark et al., 2012; Neves et al., 2014). Boison et al. (2017) indicated an increase of 0.1 in the average genomic relationship for the top five individuals in the reference and validation sets (roughly equivalent to adding the sire of a selection candidate to the reference population), resulted in a substantial increase in accuracy of prediction by about 0.05. Similarly, Fernandes Júnior et al. (2016) also used the genomic relationship matrix to examine the relationship between the reference and 5-fold validation sets. The average of the maximum relationship was equal to about 0.25 and the average for the top five and 10 individuals with highest genomic relationships were 0.19 and 0.17, respectively. These values are much lower than those reported by Boison et al. (2017) and approximately correspond to the average value of 0.125 for distant relationships computed from pedigree information by Clark et al. (2012). However, Silva et al. (2016) examined the relationship between the reference animals and three sampled validation sets (random, young, unrelated) using the pedigree relationship matrix. The random had the highest relationship between the reference and validation sets, with 2.14% of the animals having relationship coefficients ranging from 0.25 to 0.50 in both datasets. Corresponding estimates were 1.17 and 0.53% for the young and unrelated validation sets, respectively. As expected, the mean accuracy of genomic predictions reported by Silva et al. (2016) from young validation set was intermediate to those for the unrelated and random data sets, with the latter being the highest.

Clark et al. (2012) indicated that the best predictor of accuracy was an animal's mean top 10 relationships with the reference followed by its highest relationship to the reference. Habier et al. (2010) reported that maximum relationship values of 0.6–0.49 between reference and validation sets gave the best estimates of accuracy of predictions. In general, the relationship between the training and validation sets in the genomic prediction models implemented in developing countries will fall within the categories of close relationships (0.5) and distant relationships (0.125) (Clark et al., 2012).

The small reference population call for collaboration between developed and developing countries, given that some of the sires used in the latter could have been imported from the former. The benefits of including foreign genotypes in estimating accuracy of genomic prediction for milk, fat, and protein yields in Brazil Holstein was examined by Li et al. (2015) by including information from Nordic and French Holsteins. None of the Brazilian bulls and cows were genotyped, but a bivariate ssGBLUP approach was implemented incorporating genotypes of 5,244 and 5,088 Nordic and French bulls, respectively, that were genotyped with Illumina 50 K chip and their de-regressed breeding values (dEBVs) expressed in a Nordic scale. The first lactation yield of the Brazilian cows expressed in 305-day yields was used in the analysis with 115 of the Nordic and 19 of the French bulls represented as sires of these cows. The inclusion of only the Nordic sires resulted in increases in sire accuracies from a cross validation approach of 13, 64, and 4% for milk, fat, and protein yields, respectively, from the genomic prediction compared to using the pedigree relationship matrix. Including both French and Nordic bulls resulted in increases of 2 and 45% in reliability for milk and fat, respectively, but none for protein. While the expression of the dEBVs for French bulls in the Nordic scale simplified the analysis to a bivariate model, it could have limited the realization of all possible benefits of including information from the French bulls. However, the increases in cow reliabilities from using foreign genotypes were rather marginal. While the study demonstrated possible benefits from incorporating foreign genotypes especially for the Brazilian bulls, it also stressed the need to undertake some genotyping in the developing countries especially if the accuracy of cow evaluations is to increase substantially. Similarly, Haile-Mariam et al. (2015) demonstrated the benefits of incorporating foreign information in the genomic prediction for the Jersey breed which has a small reference population of about 784 Australian bulls. The inclusion of about 2,000 foreign bulls with only daughter information in the Netherlands and New Zealand increased the genomic accuracy by 5% on average across 6 main dairy traits in the validation bulls relative to the use of only Australia information. The increase in accuracy resulting from the use of bulls with foreign information was relatively higher when bulls and cows in the validation sets were less related to the reference set.

The small reference populations indicate the need for across regional genomic prediction systems where this is possible with data pooled across nearby countries especially in sub-Sahara Africa, where dairy systems tend to be similar. Several procedures and approaches for combining data across breeding programs or countries have been developed and these range from postevaluation blending procedures, application of appropriate linear models, or Bayesian methods (Vandenplas and Gengler, 2015; Vandenplas et al., 2018). Mrode et al. (2018) analyzed pooled

**222**

data for milk yield from crossbred cattle in Kenya and Tanzania. The number of cows with genotypes in Tanzania was 539 and in Kenya there were 1,034. The joint genomic prediction resulted in increased accuracy of genomic prediction in Tanzania by more than 20% for most categories of cows with substantial improvement of the predictive ability of the model for Tanzania. However, there was no much gain in accuracy for Kenyan animals from the joint analysis compared to the within country analysis as the Tanzania data was very limited and the average relationship between both populations was rather low.

# GENOMIC PREDICTION MODELS AND RESPONSE VARIABLES

The large data sets of genotyped bulls available for dairy cattle in the developed countries has influenced the choice of models implemented for genomic prediction in developing countries. In addition, the complex models such as the random regression models in dairy cattle and multi-trait models in beef cattle implemented for the conventional genetic evaluation at the national level in most developed countries has given birth to the two-step genomic prediction systems especially for dairy cattle (http://www.interbull.org/ib/nationalgenoforms). This implies the running of conventional evaluations to compute EBVs, which are subsequently de-regressed (dEBV) and used as input variables for SNP-BLUP or GBLUP genomic predictions (http://www. interbull.org/ib/nationalgenoforms). Recently, some developed countries have implemented single-step genomic evaluations, ssGBLUP, mostly in beef cattle (Moore et al., 2018) for the evaluation of fertility and calf traits.

However, in developing countries, the small data set of genotyped individuals, in addition to either no or less complicated conventional genetic evaluation systems, have resulted in the implementation of GBLUP and various Bayesian methods and a summary is presented in **Table 1**. GBLUP has been commonly utilized with **G** usually computed by method 1 of VanRaden (2008). Importantly, the computation of **G** has enabled the estimation of genetic relationship between different groups of animals and to undertake genetic evaluations in the absence of pedigree information (Mrode et al., 2018). The availability of genotypic information on only a limited proportion of animals has promoted the implementation of ssGBLUP (Misztal et al., 2009) enabling the combination of pedigree and genotypic information in the prediction of the genetic merit, usually resulting in higher accuracy due to the utilization of all available data (Cardoso et al., 2014; Silva et al., 2016). Plurality of Bayesian methods (**Table 1**) have been utilized, possibly due to the limited data size. However, no clear advantage of these methods over GBLUP or ssGBLUP have been demonstrated. It could be inferred that developing countries do not lag behind in terms of models used in predicting genomic genetic merit compared to developed countries.

The availability of genotypic data, mostly of females, have influenced the response variable used in genomic prediction models in developing countries. Most studies have therefore used corrected phenotypic records of genotyped cows as input variables for genomic prediction (Brown et al., 2016; Fernandes Júnior et al., 2016; Silva et al., 2016). This usually involves an initial genetic evaluation either using the pedigree or the genomic relationship matrix to obtain the fixed effects solutions for adjusting the phenotypic records. In some cases, phenotypic information available on each cow is variable and in some cases, weights are computed to account for the varying accuracy associated with each record (Brown et al., 2016). Like developed countries, dEBVs from conventional genetic evaluations have been used as response variables in genomic prediction (Cardoso et al., 2014; Boison et al., 2017). The dEBVs in the study of Boison et al. (2017) were weighted in the analysis based on the reliability of the dEBV and heritability of the trait. In some studies, due to limited information resulting in poor de-regression (Morota et al., 2014), EBVs have been used as response variables (**Table 1**); and in most of these studies the use of EBV have resulted in lower accuracy of genomic prediction compared to the use of adjusted phenotypes (Fernandes Júnior et al., 2016; Silva et al., 2016). The use of EBVs as the response is rarely the case in developed countries but the tendency is to use dEBVs especially for traits that have well-established conventional evaluations. In some cases, especially for novel traits or difficult to measure traits which are recorded mostly on cows such as feed intake, cow phenotypes have been utilized (de Haas et al., 2014).

Ideally the dEBVs used as response for the genomic prediction in the reference should not include information from the validation data set, otherwise the contribution of the information from the validation animals could lead to inflated estimates of reliabilities. However, this could not be achieved in the study by Boison et al. (2017) and so estimates of reliabilities were reported to be inflated.

# ACCURACY OF GENOMIC PREDICTIONS

Generally, the accuracy of genomic prediction is usually based mostly on correlations between the direct genomic breeding and the dEBV or adjusted phenotypes in the validation data set. When adjusted phenotype is used, the correlation coefficient is divided by the square root of the heritability of the trait to measure the correlation between predicted and true breeding values (Legarra et al., 2008; Pryce et al., 2012). Similar approach has been employed in most of the studies in developing countries (Brown et al., 2016; Silva et al., 2016). However, in the studies of Terakado et al. (2014) and Boison et al. (2017), the estimation of accuracy of genomic prediction was based on prediction error variances estimated from the inverse of the mixed model equations. This is usually termed as the theoretical or expected estimates of accuracy (VanRaden, 2008) and usually tend to be higher than the estimates obtained from correlations because it ignores changes in genetic variance due to drift or selection (Gorjanc et al., 2015). Further, the theoretical accuracy is based on the assumption that the used statistical model is the true genetic model. Taken together, the theoretical accuracies may often be inflated. The large number of animals in the reference population in the genomic prediction systems of many developed countries implies it is not feasible to obtain the inverse of the mixed model equations, hence theoretical estimates of accuracy are not usually computed routinely but it has been implemented in Canada based on a reduced set of SNPs (http://www.interbull. org/ib/nationalgenoforms).

The accuracy of genetic prediction in dairy traits ranges from 0.50 to 0.85 for production traits with medium to high heritability to about 0.20–0.50 for fertility and survival traits with lower heritability in developed countries (Moser et al., 2010; Wiggans et al., 2017). Those for beef traits are slightly lower (0.33–0.55) due mainly to lower reference population sizes (Saatchi et al., 2011; Lu et al., 2016). In the case of developing countries, the accuracies of genomic predictions have rather been low to medium in the range of 0.21–0.60. The major factors for these ranges include the small size of the reference populations and the composition in terms of being mostly cows that have lower accuracy of phenotype data than progeny tested bulls in developed countries. The deterministic prediction equations for genomic accuracy by Goddard (2009) and Daetwyler et al. (2013) could help explain such lower accuracies arising from having mainly cows in the reference population. Assuming traits are influenced by a large number of QTL, Daetwyler et al. (2013) gave the following formula to predict genomic prediction accuracy defined as the Pearson correlation (r) of true and predicted observed values: r = √ [Nph<sup>2</sup> (Nph<sup>2</sup> + Me)−<sup>1</sup> ], where Np is the number of individuals with phenotypes and genotypes in the reference population, h<sup>2</sup> is the heritability of the trait or reliability of breeding values in the reference population, and Me is the number of independent chromosome segments. Me can be computed as Me = 2NeL, with Ne equals the effective population size and L, the genome length in morgans. Using typical values of 100 and 30 for Ne and L, respectively, (Daetwyler, 2009), and assuming Np of 1,000, reliabilities of about 0.80 and 0.3 for deregressed breeding values (dEBV) for progeny tested bulls and individual cows, respectively, the formula indicates that about 4– 5 cows would be needed to provide equivalent information to one progeny tested bull. Compared to specialized dairy breeds in developed countries, effective population is likely to higher in indigenous dairy cattle and crossbreds reared in smallholder systems. Increasing the value of Ne to 200 in the above formula to account for this, indicates that the ratio of about 4 to 5 cows providing equivalent information to one bull still holds.

Also, the lack of a proper breeding program in most developing countries does not provide the breeding structure to ensure that good relationship between younger animals in the validation set are well-related to animals in the reference population. The levels of accuracy reported in most of the studies are however higher than would be obtained from the parental average although they are lower than those estimated in developed countries, thus providing a basis for the selection of good bulls that can be used as parents for the next generation.

Similar to the accuracy of genomic predictions, the regression of the response variable on direct genomic breeding values in the validation set as a measure of the calibration (inflation or deflation) of GEBV, have showed great variation (**Table 1**). In some of the studies, the regression coefficients were in general close to 1 as expected for traits of higher heritability except for lowly heritable traits, which, in most analysis, were over 1, meaning that predictions were underestimated (Fernandes Júnior et al., 2016; Silva et al., 2016; Boison et al., 2017). The Bayesian methods (BayesC, BayesCπ, and Bayesian Lasso) have resulted in underestimated predictions compared to GBLUP in several of these studies (Neves et al., 2014; Boison et al., 2017). However, some of these regression coefficients were rather low and below 0.5 (**Table 1**) and due mainly to the smaller size of the reference population. An improvement in the calibration is expected as more animals are genotyped.

# UTILIZING LOW DENSITY CHIP AND IMPUTATION

A major issue with the implementation of GS is the cost of genotyping and it constitutes one of the obstacles to GS in developing countries. Several studies have therefore examined the use of cheaper low-density Chips or investigated the use of low numbers of SNPs accompanied by imputation on the accuracy of genomic prediction.

Boison et al. (2017) examined the use of several LD chips, using common SNPs between the HD and the Illumina 50K, GeneSeek super genomic profiler (SGGP-20Ki), and GeneSeek genomic profiler (GGP-75Ki) in genomic prediction. The accuracy of genomic prediction they reported when only bulls were used in the reference population was similar in the LD chips compared to the HD. However, with a larger reference population consisting of bulls and cows, they reported an average increase in reliability of 3.3% across all traits with the HD marker panel compared with SGGP-20Ki. In addition, Boison et al. (2017) examined the impact of using un-imputed HD genotypes in the validation datasets compared to the use of HD genotypes imputed from LD chips. The imputation accuracy was high (about 0.96 on average) and the use of imputed genotypes had no effect on the accuracy of estimates.

Aliloo et al. (2018) investigated the efficacy of imputation in East Africa crossbred dairy cattle in terms of its impact on the accuracy of imputation and genomic prediction using four different commercial chips [Illumina BovineLD v2, BovineSNP50 v3, GeneSeek-Genomic-Profiler (GGP) Bovine 50 K, and Indicus 35 k v1.03 (Neogen Corporation, Lincoln, NE, USA)] with different reference populations and three different imputation algorithms [FIimpute v2.2; (Sargolzaei et al., 2014), Beagle v4.1 (Browning and Browning, 2016), and Minimac v3 (Das et al., 2016)]. The highest imputation accuracy was obtained with a reference population consisting of a mixture of crossbred and ancestral purebred animals and using Minimac. The accuracies of imputation, measured as the correlation between real and imputed genotypes, were around 0.76 and 0.94 for 7 and 40 K SNPs, respectively, when imputed up to a 770 K panel. In general, the accuracies of the imputation from LD chips to HD genotypes were higher as the genomic relationships increase between target and reference animals.

In addition to examining the efficiency of imputation from different commercial chips, the study of Aliloo et al. (2018) also examined the efficiency of several methods for creating low density SNP chip panels of varying sizes (3,757 to 37,8216) from the HD Illumina chip. The methods examined for SNP selection included using MAF within intervals, random selection within intervals, random selection across chromosome, MAF across chromosome and the covariance method (it accounted for the covariance between adjacent SNPs and the MAF of SNPs). The efficiency of each method was determined by the accuracy of imputing the created LD chips to the HD and the accuracy of using the imputed HD in genomic prediction. The covariance method performed best compared to various other methods. The accuracies of imputation from 7 to 40 K panels selected using the covariance method were around 0.80 and 0.94, respectively. It also resulted in higher accuracy of genomic prediction at lower densities of selected SNPs.

The influence of foreign genotypes on imputation accuracy when imputing from 6, 9, 50, and 77 K chips to 45 K markers used in the USA genomic evaluations in 2014, was examined by García-Ruiz et al. (2014) in Mexican Holstein under three scenarios: (i) using only 2,018 Mexican genotyped animals; (ii) animals from scenario (i) plus 886 related North American animals; and (iii) animals from scenario (i) and 338,073 North American genotyped animals. High imputation accuracies were obtained (96, 96, 99, and 99%, when imputing from 6, 9, 50, and 77 K chips, respectively) when using only local genotypes [scenario (i)]. With scenario (ii), the imputation accuracy increased by almost 1% for 6 and 9 K chips and half a percentage point for the 77 K chip. Comparing results with scenario (i), there was an increase of ∼2% for 6 and 9 K chips, and 1% point for the 77 K chip under scenario (iii). However, no increase in accuracy was observed for the 50 K chip in any scenario because of the small number of SNPs that actually were imputed due to the large number of SNPs common in both chips. Generally, high imputation accuracies have been reported in developing countries although the reference populations are smaller compared to the ones in developed countries. This may be due to the fact that the imputation involves mostly cows and the limited number of sires may be used in these populations and hence higher degree of relatedness. However, collaboration between developed and developing countries could be beneficial in terms of further increasing imputation accuracies (García-Ruiz et al., 2014).

A purpose-built LD SNP chip for the purpose of GS in cross bred populations (Hidalgo et al., 2016) has recently been developed by the National Dairy Development Board (NDDB) of India (https://www.nddb.coop/services/animalbreeding/ geneticimprovement/genomic). The SNP chip called the INDUSCHIP, consisting of 45,700 SNPs, has been developed from HD genotypes of mostly four indicus breeds (Gir, Sahiwal, Kankrej, Red Sindhi) and their taurine crosses mostly with Holstein and Jersey, in India and has been employed for the determination of breed composition and genomic prediction for milk yield.

# ROUTINE GENOMIC EVALUATIONS

The basis of tremendous genetic progress from GS in developed countries has been underpinned by routine genomic predictions several times in a year. Although several studies have been undertaken in several breeds (see **Table 1**) in developing countries, routine genomic prediction is undertaken in only a few breeds. Several parallel breeding improvement programs exist in the Nellore beef cattle in Brazil and some GS is currently being undertaken in some of these breeding programs (Carvalheiro, 2014). The author indicated that several independent Nellore breeding programs have already developed prediction equations for usual and difficult/expensive to measure traits, however some of the programs are using genomic predictions more as a marketing than a selection tool. Carvalheiro (2014) summarized the two business models driving GS in the Nellore cattle. In the first scenario, the breeders or the breeding programs do not have access to the genotypes and genomic prediction equations are regarded as intellectual property of the multinational private companies that invested in their development. Under this model the genomic breeding values (GEBVs) are produced, for example, by combining genomic predictions and regular EBVs as correlated traits in a multi-trait mixed animal model analyses (Garrick, 2011). Therefore, the breeding programs become dependent on the company that sells the GEBVs and its sustainability depends on the interest of the commercial company to constantly invest in recalibrating the prediction equations. The second model he described involves breeding programs and the breeders have full access to the genotypes. He considered this a very attractive model because no dependencies exist between any two segments, enabling breeding programs to change their service providers without any prejudice if they are not satisfied, for example, with the genotyping cost or with the quality of the genetic evaluations.

The Africa Dairy Genetic Gains (ADGG) project in Tanzania and Ethiopia are currently establishing a pipeline for routine genetic evaluation using the genomic relationship matrix, in addition to screening and selecting young bulls using the genomic predictions (https://www.slideshare.net/ILRI/mrodewcap). The non-existence of AI companies to drive genetic improvement programs, implies that genetic and genomic evaluations would inevitably be linked to either National Artificial Insemination Centers or breed societies to help deliver the superior genetics. This is the current approach being exploited by ADGG while encouraging public-private partnership in the space. The beef breeds in South Africa are in the process of implementing GS but current activities are still limited to defining the reference population and understanding the population structure.

# MOLECULAR GENOMIC TOOLS AND IDENTIFICATION OF GENOME-WIDE SIGNATURES OF SELECTION

Genome sequencing and SNP genotyping technologies, and new statistical tools have prompted a transition from studies focusing on the analysis of neutral variation to functional variation. These developments have led to new tools for addressing fundamental and applied questions in evolutionary and developmental biology, and animal breeding. Sequencing of full genomes and the development of SNP Chip sets has led to studies on identification and mapping of genes and QTLs, genome-wide association analysis (GWAS) and genome-wide signatures of selection, introgression, and/or admixture. The studies have led to the identification of many genes and some incorporated into selection schemes. In developed countries, whole genome sequence analysis and GS are being applied in breeding schemes of major food animals (cattle, sheep, goats, chicken, pigs). In developing countries, genomic technologies are applied to assessing genetic diversity and admixture and signatures of selection to identify genomic regions and variants contributing to variation.

Genomic technologies have shown that indigenous cattle in developing countries have high levels of genome diversity compared to commercial breeds (Kim et al., 2017) due to their different breeding history (Freeman et al., 2004; Decker et al., 2014; Flori et al., 2014; Edea et al., 2015). Kim et al. (2017) also revealed the genomes of indigenous breeds are admixed which suggests genomic diversity as an efficient adaptation strategy. SNP genotyping and whole-genome sequencing has shown the genome admixture is of ancient and recent origin. An analysis of zebu cattle from Kenya, Uganda, and Nigeria revealed an even admixed autosomal Asiatic indicine∗African taurine genome composition as well as European taurine ancestry (Mbole-Kariuki et al., 2014; Bahbahani et al., 2017) confirming previous findings (Hanotte et al., 2002; Decker et al., 2014). The Asian indicine∗African taurine composition is ancient and decreases westwards and southwards from the Horn of Africa (Hanotte et al., 2002; Decker et al., 2014) while the European taurine background arises from recent crossbreeding of local cattle with European Bos taurus breeds. For example, the Borgou cattle of West Africa is a stabilized admixed breed with genetic contributions from four African taurine (Baoulé, Somba, Lagune, N'Dama) and two African Zebu (Fulani, Bororo) cattle, whose origin traces back to about 130 years ago (Flori et al., 2014). The genomes of Kenyan local cattle have contributions from several B. taurus breeds including Guernsey, Norwegian Red, and Holstein with the contribution of Holstein-Friesians being the most substantial (Kim and Rothschild, 2014). The authors postulate the admixture to have occurred in recent times. Admixed genomes are also a common feature of indigenous and locally developed breeds of cattle in South Africa (Makina et al., 2014). Admixed genomes have also been observed in Asian (India, Pakistan, China, and Indonesia) Bos indicus cattle which show evidence of Bos javanicus ancestry (Decker et al., 2014). Kumar et al. (2003) reported an ancestral influence from taurine cattle in South Asian Bos indicus cattle, probably of Near eastern origin and Wangkumhang et al. (2015) observed a Southeast Asian indicine ancestry in the genomes of Thailand cattle.

Written pedigree records are lacking in most small holder farms in developing countries which makes it almost impossible to make informed breeding decisions. Genomic technologies can be valuable in this case in assessing breed composition and parentage assignment (Werner et al., 2004; Weerasinghe, 2014). Recently, Strucken et al. (2017) demonstrated such an application using crossbred cattle in East Africa (Kenya, Uganda, Ethiopia, Tanzania). The authors identified two marker panels with 200 SNPs each. One panel predicted best, the dairy breed compositions and the other resulted in accurate estimates of parentage assignment. A composite panel incorporating the 400 SNPs achieved sufficient accuracy in estimating breed admixture proportions but not parentage identification.

The development of new technologies which assess genome architecture with high resolution (full genome sequences, HD Chips etc.) has resulted in a large number of studies investigating genome-wide signatures of selection in indigenous cattle in developing countries and especially in African cattle. For instance, 18 candidate regions under selection and intersecting genes and QTLs associated with production and reproduction performance and adaptation to environmental stress (e.g., immunity and heat stress) were identified in East Africa cattle from the analysis of SNP genotype datasets (Bahbahani et al., 2017). Bahbahani et al. (2018) found several dairy trait QTLs overlapping candidate selection regions in Kenana and Butana cattle based on the analysis of SNP genotype data. Using whole genome scans, Gautier et al. (2009) identified 53 genomic regions that spanned 42 genes with functions related to immune response, nervous system and skin, and hair properties in West African cattle. Makina et al. (2015) identified 47 candidate selection regions which also spanned genes associated with adaptation to tropical environments, nervous system, immune response, production and reproductive performance in South African cattle. In a study that analyzed genome sequences of indigenous breeds of cattle from East, West and Southern Africa, Kim et al. (2017) identified signatures of selection including genes and/or pathways controlling anemia, feeding/drinking behavior and circadian rhythm in the N'Dama, coat color and horn development in Ankole, and heat tolerance/thermoregulation and tick resistance in Boran, Ogaden, and Kenana cattle. The findings from the selection signature studies spanning genes with functions related to production, reproduction and adaptation, suggest that genomes of cattle African indigenous cattle have been uniquely selected to maximize hybrid fitness for adaptation to reproduce and perform in stressful environments.

# FUTURE PROSPECTS

The major factor limiting the application of GS in developing countries is poor breeding infrastructure that is fundamental to conventional breeding, lack of routine recording of reliable phenotypes and good analytical tools to synthesize the data, providing timely feedback to help improve farmer management and husbandry techniques. The ADGG has sort to address some of these major bottlenecks in East Africa by employing recent developments in information and communication technology (ICT). In addition, as Ribaut et al. (2010) indicated the revolution in ICT has also created opportunities to counter some of the shortcomings in resources through the establishment of global virtual platforms. For example, the Bill & Melinda Gates Foundation and CGIAR Generation Challenge Program has established a public molecular breeding platform (https://www. gatesfoundation.org/Media-Center/Press-Releases/2010/02/ GCP-launches-Molecular-Breeding-Platform) as a one stop shop which centralizes functional access to modern breeding technologies and marker service laboratory, data management and analysis for crops. A similar initiative for livestock, or incorporating livestock requirement to such a center, will boost genomic activities and increase cost efficiency. The rapid developments in marker technologies has led to highthroughput platforms for SNP genotyping and hence reduced costs. However, in the absence of such centers as described above, good outsourced cost-effective genotyping services which are easily accessible are now available. This provides opportunities to increase the efficiency of implementing advanced genomics in developing countries.

The provision of bundled services beyond just GS will accelerate the adoption and use of molecular tools including GS. Programs for genetic improvement utilizing genomics approaches should include the development of tools for parentage verification, breed composition determination, mating tools that exploit genomic information, traceability, breed characterization, and tools for computing genomic inbreeding readily and addressing issues relating to sustainable utilization. Such approach maximizes the benefits of genotyping and increases cost-efficiency.

Generally, in the beef industry, GS is expected to generate a more modest increase in genetic gain for regular traits compared to dairy cattle partly due to the breeding structure and relatively limited use of AI. Strategies for optimizing cost-benefits for the application of GS in beef cattle in developing countries are still being investigated. Carvalheiro (2014) compared several scenarios for the application of GS in Nellore cattle using the current breeding scheme for Nellore as the base standard. This, in brief, consisted of a breeding program with half of its calves being born from AI proven bulls and the other half from natural mating sires and estimated an annual genetic gain of 0.134 genetic standard deviation for growth traits. However, when only genotyped young sires were used for a fixed time in AI, annual genetic progress increased by about 58% compared to the base situation. When a scheme that incorporated GS in addition to exploring the use of in vitro fertilization (IVF) (with embryos produced by genotyped donors accounting for 5% of the calves) was investigated, the annual increase in genetic gain was 79% relative to the base situation. Carvalheiro (2014) concluded, more pronounced genetic gains can be realized, if GS is applied in combination with reproductive technologies, which agrees with the observations of García-Ruiz et al. (2014). Carvalheiro (2014) further indicated that the production of embryos through IVF is becoming very accessible in Brazil, and he indicated a cost of about US\$150 per calf born. This would indicate that much higher returns from the application of GS in beef cattle in developing countries would involve pronounced usage of reproductive technologies incorporating to some degree, both the widespread use of AI and IVF. Even in dairy cattle, future investments in the production of high quality genomically proven embryo for use in medium to large scale farms could be a routine for the rapid dissemination of superior genetics leading to more benefits from GS.

Another development in reproductive technologies that is more likely to have a profound effect in the dairy cattle industry is the use of sexed semen. In the small holder system, the cost of purchasing a replacement heifer constitute a major capital investment not easily affordable to most of the farmers. Also, the milking of the dairy cow constitutes the main source of income in the dairy farmer in India, given the sacred status of cattle. The use of sexed semen of genomically proven young bulls with a very high probability of a female calf, could substantially improve productivity and profitability of small holder farmers and therefore offers prospect for farmers buying-in into genomic breeding programs. Thus, continuous improvement in semen sorting technologies and methods to enhance conception rates with use of sexed semen opens up future prospects for the application of GS.

Collaboration between developing and developed countries will be important in implementing genomic breeding technologies in the former, especial in dairy cattle, where there has been a large importation of bulls. It is likely that most of these bulls have been genotyped in the developed countries and willingness to share genotypes and some other relevant performance data will help in enlarging the reference population and hence the accuracy of genomic predictions in developing countries. Some of the possible impacts have been demonstrated by Li et al. (2015).

The ability of Governments to put in place enabling policies, statutory and regulatory frameworks that encourage private-public partnerships will be crucial in the long term in sustaining breeding programs based on conventional or genomic approaches. Also, the limited genomic data in each country calls for pooling of data across multiple countries or geographic regions to maximize the benefits of GS. Initial possible increases in accuracies, the result of pooling data across two countries have been demonstrated (Mrode et al., 2018). However, pooling data across countries could be a sensitive issue in terms of who has access to the data from other countries. Thus, there is the need by different government bodies in developing countries to come up with proper and well-defined protocols that guide and govern data sharing with adequate confidentiality.

# AUTHOR CONTRIBUTIONS

RM and JM undertook most of the work in this manuscript and contributed equally. JO and AO contributed as part of the team that generated data for the initial work on genomic prediction in East Africa and part of the current ADGG project that is stimulating GS in East Africa.

# ACKNOWLEDGMENTS

We sincerely acknowledge the Bill and Melinda Gates Foundation for funding the Dairy Genetics for East Africa (DGEA) project and the African Dairy Genetic Gains (ADGG) project that forms the basis for stimulating GS in dairy cattle in East Africa. The authors are particularly grateful to the ADGG for sponsoring this manuscript.

# REFERENCES


recent admixture history of East African Shorthorn Zebu from Western Kenya. Heredity 113, 297–305. doi: 10.1038/hdy.2014.31


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Mrode, Ojango, Okeyo and Mwacharo. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Detection of Selection Signatures Among Brazilian, Sri Lankan, and Egyptian Chicken Populations Under Different Environmental Conditions

Muhammed Walugembe<sup>1</sup> \*, Francesca Bertolini<sup>1</sup> , Chandraratne Mahinda B. Dematawewa<sup>2</sup> , Matheus P. Reis<sup>3</sup> , Ahmed R. Elbeltagy<sup>4</sup> , Carl J. Schmidt<sup>5</sup> , Susan J. Lamont<sup>1</sup> and Max F. Rothschild<sup>1</sup>

<sup>1</sup> Department of Animal Science, Iowa State University, Ames, IA, United States, <sup>2</sup> Department of Animal Science, Faculty of Agriculture, University of Peradeniya, Kandy, Sri Lanka, <sup>3</sup> Department of Animal Science, College of Agricultural and Veterinary Sciences, São Paulo State University, Jaboticabal, Brazil, <sup>4</sup> Department of Animal Biotechnology, Animal Production Research Institute, Giza, Egypt, <sup>5</sup> Animal and Food Sciences, University of Delaware, Newark, DE, United States

### Edited by:

Peng Xu, Xiamen University, China

### Reviewed by:

Yu Jiang, Northwest A&F University, China Keliang Wu, China Agricultural University, China

> \*Correspondence: Muhammed Walugembe mwalugem@iastate.edu

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 13 April 2018 Accepted: 22 December 2018 Published: 14 January 2019

### Citation:

Walugembe M, Bertolini F, Dematawewa CMB, Reis MP, Elbeltagy AR, Schmidt CJ, Lamont SJ and Rothschild MF (2019) Detection of Selection Signatures Among Brazilian, Sri Lankan, and Egyptian Chicken Populations Under Different Environmental Conditions. Front. Genet. 9:737. doi: 10.3389/fgene.2018.00737 Extreme environmental conditions are a major challenge in livestock production. Changes in climate, particularly those that contribute to weather extremes like drought or excessive humidity, may result in reduced performance and reproduction and could compromise the animal's immune function. Animal survival within extreme environmental conditions could be in response to natural selection and to artificial selection for production traits that over time together may leave selection signatures in the genome. The aim of this study was to identify selection signatures that may be involved in the adaptation of indigenous chickens from two different climatic regions (Sri Lanka = Tropical; Egypt = Arid) and in non-indigenous chickens that derived from human migration events to the generally tropical State of São Paulo, Brazil. To do so, analyses were conducted using fixation index (Fst) and hapFLK analyses. Chickens from Brazil (n = 156), Sri Lanka (n = 92), and Egypt (n = 96) were genotyped using the Affymetrix Axiom <sup>R</sup> 600k Chicken Genotyping Array. Pairwise Fst analyses among countries did not detect major regions of divergence between chickens from Sri Lanka and Brazil, with ecotypes/breeds from Brazil appearing to be genetically related to Asian-Indian (Sri Lanka) ecotypes. However, several differences were detected in comparisons of Egyptian with either Sri Lankan or Brazilian populations, and common regions of difference on chromosomes 2, 3 and 8 were detected. The hapFLK analyses for the three separate countries suggested unique regions that are potentially under selection on chromosome 1 for all three countries, on chromosome 4 for Sri Lankan, and on chromosomes 3, 5, and 11 for the Egyptian populations. Some of identified regions under selection with hapFLK analyses contained genes such as TLR3, SOCS2, EOMES, and NFAT5 whose biological functions could provide insights in understanding adaptation mechanisms in response to arid and tropical environments.

Keywords: chickens, environment, selection signatures, adaptation, immune system

# INTRODUCTION

fgene-09-00737 January 9, 2019 Time: 19:8 # 2

Extreme environmental conditions are a major challenge in livestock production. Changes in climate, particularly those that contribute to weather extremes like drought or extreme temperatures or humidity may result in reduced performance, reproduction and could compromise the animal's immune function (St-Pierre et al., 2003). In chickens, extreme environmental temperatures lead to generation of reactive oxygen species (ROS), causing oxidative stress and lipid peroxidation (Altan et al., 2003). However, chickens particularly the local (indigenous) breeds often adapt over time to tolerate extreme challenging environments. Local chicken populations are characterized in terms of production status by limited management and veterinary services but are considered important genetic resources. They are reported to have been derived after many hundreds of years of successful adaptations to extreme environments (Hall and Bradley, 1995). In Egypt, there is undisputed evidence that chickens (domestic fowls) were kept since 1840 B.C (Coltherd, 1966), and Egypt was a major entry of Indian chickens to the African continent (Eltanany and Hemeda, 2016; Osman et al., 2016). Egyptian local breeds are generally characterized into three groups: the first group are the native breeds such as Fayoumi and Dandarawi, second group includes the Baladi and Sinai strains, and third group results from the cross between exotic and local strains accompanied by various trait selection (Osman et al., 2016). The native/local breeds/ecotypes have been kept as backyard or free-range chickens and could have developed adaptation mechanisms to their respective climates. In spite of successful adaptations to their environments, there is limited knowledge about genomic regions involved in the adaptation of local village chickens to the specific environmental conditions. There is also uncertainty whether geographical locations of local chicken populations could be the cause of their genetic differentiation (Mahammi et al., 2016). Domestication by humans and subsequent breed formation has led to chickens being adapted in physiology, morphology, fertility, and behavior to increase production (Ericsson et al., 2014). Selection pressure, natural or artificial, has been influential in enabling chickens to adapt to their environments and may leave signatures of selection in chicken population genomes. Signatures of selection, or selective sweeps as they are sometimes called, are particular patterns of DNA that are identified in regions of the genome with mutation or have been under selection pressure in a population (Qanbari and Simianer, 2014). Larger homozygosity regions are exhibited in such regions than expected under Hardy-Weinberg equilibrium whenever there is positive selection for a particular allele. These regions may have genes with functional importance in particular processes and reflect allelic selection under differing environmental conditions.

There are many methods used in the detection of selection signatures in the genome. These methods are classified into intrapopulation and inter-populations statistics. Inter-population statistical analyses can be categorized into single site or haplotype differentiation analyses (Qanbari and Simianer, 2014). To detect regions of divergence or similarity, most studies have used the single site differentiation statistic commonly known as Fixation Index, Fst (Elferink et al., 2012; Gholami et al., 2015; Fleming et al., 2017) and hapFLK (Gholami et al., 2015) analyses to detect selection signatures in both commercial and noncommercial breeds. Inter-population statistics are reported to have more statistical power to detect selection signatures in recently diverged populations (Yi et al., 2010). The major concern with Fst is that it assumes the populations have same effective population size and are derived independently from one ancestral population (Price et al., 2010). HapFLK is a method that is based on extension of the FLK statistic and accounts for both the hierarchical structure and haplotype information, and its use greatly improves the detection power and can detect signatures of selection that may be occurring across several populations (Fariello et al., 2013).

In this study we applied both Fst and hapFLK statistical analyses on indigenous chicken breed/ecotype populations from three countries that have different climates [Brazil and Sri Lanka = Tropical, and Egypt = Arid] for regions where selection may have taken place and shaped the genome to enable the chickens to adapt to different environments.

# MATERIALS AND METHODS

Chicken blood sample collections procedures in Brazil were approved by Animal Care and Use Committee of São Paulo State University (Process 009999/14; approved on 06 June 2014). Chicken blood samples from Egypt and Sri Lanka were collected in accordance with the local veterinary guidelines.

# Sample Collection

Blood samples were collected from 156 Brazilian, 92 Sri Lankan, and 96 Egyptian chickens under veterinary supervision in the home countries and according to accepted animal care practices. The Brazilian chickens represented eleven ecotypes/breeds (Sedosa, Cochinchina, Ketros Oceania, Suri, Backyard Giant Indian, Shamo, Brahman, Backyard, Bantham, Brazilian Musician, and Bakiva) and were sampled from different farms, outside Porto Ferreira in the State of São Paulo. A total of 92 samples were collected from 3 Sri Lanka ecotypes which were made up of 27, 34, and 31 samples collected from Gannoruwa (GN) town, Karuwalagaswewa (KR), and Uda Peradeniya (UPA) villages, respectively. A total of 95 samples were collected from an Egyptian ecotype and two breeds; 31 Baladi (Bal, ecotype) from 3 villages in Qalyubia, 31 Fayoumi (Fay) from 4 villages in Mid-Egypt, and 33 Dandarawi (Dan), from 4 villages in Southern Egypt.

# Genotyping and Quality Control

Genotyping for all samples was conducted at GeneSeek (Lincoln, NE, United States) using the Affymetrix Axiom <sup>R</sup> 600k Array. SNP (single nucleotide polymorphism) genotype data quality filtering was assessed with PLINK 1.9 software (Chang et al., 2015) and only autosomal SNPs were screened based on parameters of >90% call rate (-geno 0.1) and minor allele frequency (MAF) > 0.02. In total, 523,186 SNPs were utilized for downstream analysis.

# Population Stratification Analyses

Multi-dimensional scaling (MDS) was performed to examine population structure for stratification in two dimensions using cluster algorithm in PLINK v1.9 (Chang et al., 2015). Shared ancestry, with no prior knowledge on the origin of the breeds, was explored using the Admixture software (Alexander et al., 2009) for varying K-values, ranging from 1 to 12, where K is the number of expected subpopulations. The optimum K-value of K = 10 was determined based on the lowest value of the cross-validation error.

# Fst Analyses

The Fst statistic analysis is a widely used approach and was performed to determine genetic differentiations between populations (Barreiro et al., 2008; Bonhomme et al., 2010; Fariello et al., 2013). Three pairwise comparisons were performed in Plink v1.9 (Purcell et al., 2007) for Brazil vs. Egypt, Sri Lanka vs. Egypt, and Brazil vs. Sri Lanka ecotypes to identify any genomic regions under increasing differentiation using an overlapping sliding window approach. The populations were designated as a case or control category based on hypothesized proxy climatic phenotype of tropical (Brazil and Sri Lanka) vs. arid (Egypt) climatic conditions. For each comparison, mean Fst (mFst) value was calculated in 100 kb sliding windows with a step size of 50 kb to examine data with 50% overlap using an in-house script (Karlsson et al., 2007). Genomic regions with the highest peaks, 0.2% of the empirical distributions of the mFst values, were considered for downstream analyses.

# HapFLK Analyses

The hapFLK statistic accounts for varying effective population sizes and haplotype structure of the populations using multipoint linkage disequilibrium model (Scheet and Stephens, 2006; Bonhomme et al., 2010; Fariello et al., 2013). This approach was

FIGURE 2 | The admixture plot showing mixed ancestry among individuals and populations. The Brazil breeds/ecotypes from left to right; Shamo, Brahma, Cochinchina, Bakiva, Sedosa, Bantham, Suri, Brazilian musician, Ketros oceania, Backyard Giant Indian, and Backyard.

used to identify possible regions under selection across chicken breeds/ecotypes within each country. To do this, it required estimation of a neighbor joining tree and a kinship matrix based on the matrix of Reynolds' genetic distances between ecotypes/breeds (Bonhomme et al., 2010). A phylogenetic tree was constructed among the populations from the three countries: Sri Lanka (KR, UPA, and GN), Brazil (Sedosa, Cochinchina, Ketros Oceania, Suri, Backyard Giant Indian, Shamo, Brahman, Backyard, Bantham, Brazilian Musician, Bakiva), and Egypt [Baladi (Bal), Fayoumi (Fay), and Dandarawi (Dan)]. To identify any regions under selection, analyses were performed separately across breeds/ecotypes within each climatic region (country). The number of haplotype clusters per chromosome was determined in fastPHASE using cross-validation based estimation and was set at 15 (Scheet and Stephens, 2006). The hapFLK values were generated for each SNP and computation of P-values were performed using a chi-square distribution with a python script that is provided on the hapFLK webpage<sup>1</sup> . A q-value threshold of 0.05 was applied to limit the number of false positives.

<sup>1</sup>https://forge-dga.jouy.inra.fr/documents/588

breeds/ecotypes.

# Gene Annotation

Gene annotation of the identified regions under possible selection was completed using NCBI's Genome Data Viewer<sup>2</sup> on the chicken genome version Gallus gallus 5.

# RESULTS

# Population Stratification

The MDS plot in **Figure 1** shows distinct separation among ecotypes from the three countries and separation of Brazilian and Sri Lankan ecotypes from the Egyptian ecotypes. The Brazilian breeds, Cochinchina and Brahma (black circled) and Sedosa (red circled) are separated from the rest of the Brazilian breeds/ecotypes, but closer to Sri Lanka ecotypes. The admixtures analysis based on the SNP genotyping calls showed evidence of shared ancestry among breeds/ecotypes within each country and limited across countries (**Figure 2**). Although the Brazilian breeds/ecotypes were sampled from one location, admixture results revealed limited crossover among breed/ecotypes. The phylogenetic tree based on Reynolds' distances with all the SNPs that passed quality control is shown in **Figure 3**. Here, the Sri Lankan ecotypes were separated from Egyptian breeds and some Brazilian breeds/ecotypes grouped in sub-trees. This is consistent with MDS plot. The Brazilian breeds, Cochinchina and Brahma, that are historically known to originate from Asia are grouped in one sub-tree with the Sri Lankan ecotypes.

# Fst Analyses

The Fst analyses for the comparisons between Brazil or Sri Lanka vs. Egypt generally indicated the strongest peaks on chromosomes 2, 3, and 8 (mFst > 0.28) (**Figure 4**). A total of two regions were detected only in the Brazil vs. Egypt comparison, on chromosomes 2 (71.85–71.95 Mb) and 8 (10.45– 10.55 Mb) that contained the MicroRNA 6545 and TRMT1L (tRNA methyltransferase 1 like) genes, respectively. For the Sri Lanka vs. Egypt comparison, a region on chromosome 3 (64.65–64.75 Mb) was detected and contained the HS3ST5 gene. There were also common regions between the two analyses of chickens from Brazil or Sri Lanka vs. Egypt. A total of three common regions were identified on chromosome 2 (25.25– 25.35 Mb; 25.35–25.45 Mb; and 26.15–26.25 Mb) with 38, 40, and 45 SNPs, respectively. Chromosomes 3 and 8 had each one common region of 111.25–111.35 Mb and 650–750 Kb with 4 and 44 SNPs, respectively. The Brazil vs. Sri Lanka comparison had generally the lower mFst values.

<sup>2</sup>https://www.ncbi.nlm.nih.gov/genome/gdv/

# Genes Under Putative Selection Within Egyptian, Sri Lankan, and Brazilian Populations

The hapFLK statistic is an extension of FLK, accounts for the haplotype information and hierarchical structure (Fariello et al., 2013; Servin et al., 2013) and greatly improves the power of detection of selection signatures that may be occurring across several populations. HapFLK analyses revealed significant unique selection signals within Sri Lankan, Egyptian, and Brazilian chicken populations. Eight significant regions on chromosomes 1 (1.71–2.72 Mb; 43.05–46.79 Mb), 2 (38.74–38.96 Mb), 3 (102.39– 103.09 Mb), 4 (71.24–71.34 Mb), 5 (28.61–29.14 Mb), 10 (14.06– 14.09 Mb), and 11 (18.79–20.20 Mb) were detected as strong selection signatures across the Egyptian breeds (**Figure 5A**). Multiple genes, with a majority of them such as Suppressor of cytokine signaling 2 (SOCS2), Eomesodermin (EOMES) and Nuclear factor of activated T-cells 5 (NFAT5) are involved in the immune system were identified within the regions under selection (**Tables 1**, **2**), but to date there were no annotated genes within the regions on chromosomes 4 and 10. Two regions with strong selection signals were detected on chromosomes 1 (34.44–34.53 Mb) and 4 (61.18–62.15 Mb) across the Sri Lankan chicken ecotypes (**Figure 5B**). One gene was identified within the chromosome 1 region, while 18 genes, including genes involved in the immune system such as Toll like receptor 3 (TLR3) and Nuclear factor kappa B subunit 1 (NFKB1) were identified within the chromosome 4 selection region (**Tables 3**, **4**). In addition to immune response genes, hapFLK analyses revealed genes associated with production traits in the regions under selection across Egypt and Sri Lanka chicken populations. Genes such as SNRPF, MRPL42, and ACSF3 on chromosomes 1 and 11 (**Table 2**) were identified across the Egypt populations, whilst MTNR1A and CYP4V2 on chromosome 4 (**Table 4**) were identified across the Sri Lanka populations.

There were no strong selection signals across the eleven Brazilian breeds/ecotypes, but two regions with strong signals were detected across the two Brazilian breeds with Asian ancestry, Cochinchina and Brahma on chromosomes 1 and 14 (**Figure 5C**).




 breeds.

6 January 2019 | Volume 9 | Article 737

fgene-09-00737 January 9, 2019 Time: 19:8 # 6


(Continued)

TABLE 2


Continued


TABLE 3 | Putative selection signatures identified across Sri Lanka ecotypes in the hapFLK analysis.


TABLE 4 | List of genes in the identified putative selection signatures among Sri Lanka ecotypes.


Three genes were identified within the selection signature region on chromosome 1 and there were no annotated genes within the chromosome 14 region (**Tables 5**, **6**). No selection signals were detected across the rest of the nine Brazilian breeds/ecotypes (results not shown). None of the selection signature regions from the hapFLK in any country (Egypt, Sri Lanka, and Brazil) populations were consistent with Fst analyses.

# DISCUSSION

The admixture of populations in the three countries indicates mixed genetic backgrounds of the chickens (**Figure 3**). The overlap across ecotypes/breeds within individual countries could be due to unrestricted inter-mating among chickens of different genetic backgrounds, resulting in chickens with ancestors from different groups that eventually contribute to the shared ancestry. The other factor that might contribute to the admixture within and across the respective countries could be due to movement of birds through trading. Although chickens were sampled from one location, Porto Ferreira in Brazil, it is surprising that there TABLE 5 | Putative selection signatures identified across Cochinchina and Brahma Brazilian breeds in the hapFLK analysis.




was more admixture and more discrete breeds in the Brazil population, unlike Egypt and Sri Lanka populations. Moreover, the Brazilian breeds/ecotypes clustered closer to the Sri Lankan ecotypes (**Figures 1**, **3**). This is, however, not surprising because chickens in Brazil are not indigenous and are reported to have been imported from Asia (Komiyama et al., 2004). The Reynolds' genetic distances population tree compliments the stratification by the MDS plot and admixture of the populations. The Egyptian breeds are within their own sub-tree and appear to have some shared ancestry with some Asian breeds as revealed by the admixture plot. The indication of shared ancestry is in agreement with previous findings which reported that Egyptian local/native breeds/ecotypes originated from Asia or the Indian sub-continent (Elferink et al., 2012; Elkhaiat et al., 2014; Eltanany and Hemeda, 2016).

The MDS results allowed the analyses to be performed on a case/control basis, with environmental/climatic conditions of the three countries as the proxy phenotype to allow the results to be viewed as regions of the genome under possible selection for environmental tolerance/adaptation by the local chicken populations of each of the three countries. The Fst results indicated possible selection signatures on chromosomes 2 and 8 for the Brazil vs. Egypt comparison, and on chromosome 3 for the Sri Lanka vs. Egypt comparison and common differences between Arid (Egypt) and Tropical (Sri Lanka and Brazil). The two genes, TRMT1L and MicroRNA 6545 detected in regions for the Brazil vs. Egypt comparison could suggest chicken adaptation and survival in hot conditions. TRMT1L catalyzed tRNA modification is required for redox homeostasis to ensure proper cellular proliferation and oxidative stress survival. Cells that are deficient in the TRMT1L will exhibit a decrease in proliferation rates, alteration in protein synthesis and perturbation in redox homeostasis including hypersensitivity to oxidizing agents (Dewe et al., 2017). The second gene, MicroRNA 6545, is reported to be involved in reproductive processes and embryogenesis, including TGF-β and Wnt that specifies the neutral fate of the blastodermal cells (Shao et al., 2012). For the Sri Lanka vs. Egypt comparison, a gene, HS3ST5 that could be important in immune response was detected. HS3ST5 is involved in immunity and defense molecular

functions (Szauter et al., 2011). Although we did not detect annotated genes in the common regions between the two analyses of chickens from Brazil or Sri Lanka vs. Egypt, these regions could present recent important selection signatures that could enable chicken survival in either the tropics or arid conditions. The common genomic regions of chickens from Sri Lanka or Brazil when compared to Egypt could indicate exposure of chickens from Sri Lanka and those from Porto Ferreira (Brazil) to same environmental conditions and they may have evolved similar selection signatures for adaptation and survival.

The identification of genomic regions that may be under both artificial and natural selection could help identify possible selection signatures across breeds/ecotypes within a country. Several genomic regions with putative selection were identified in the current study using the hapFLK method across Egyptian and Sri Lankan breeds and ecotypes, respectively. The hapFLK analyses identified several regions under selection on chromosomes 1, 2, 3, 4, 5, 10, and 11, across the three Egyptian breeds; Fayoumi, Dandarawi, and Baladi (**Figure 5A** and **Table 1**). Some genes detected in the genomic regions under selection across the Egyptian chickens are reported to be involved in the modulation of growth (Bolamperti et al., 2013), and the immune system (Szczesny et al., 2014; Zhang et al., 2018) and others could possibly be important in thermal/heat tolerance. These genes could be relevant in the adaptation of the Egyptian chickens to the arid hot dry conditions. One notable gene in a region under selection, on chromosome 2 is the SOCS2. Suppressor of cytokine signaling (SOCS) proteins generally play vital roles in the feedback inhibition of cytokine receptor signaling (Larsen and Röpke, 2002). The SOCS2 gene is a multifunctional protein that is involved in growth hormone signaling through cytokine-dependent pathways and the JAK/STAT pathway (Metcalf et al., 2000; Rico-Bautista et al., 2006). This gene is important in the regulation of several biological processes that control growth, development, immune function, homeostasis (Rico-Bautista et al., 2006), and has been hypothesized to have an effect on breast meat yield during heat stress (Van Goor et al., 2015). The region on chromosome 2 under selection contains two genes, and one of the genes, EOMES is also important in the immune system. The EOMES is one of the two T-box proteins expressed in the immune system and are responsible with driving the differentiation and function of cytotoxic innate lymphocytes such as the natural killer (NK cells). NK cells are endowed with cytotoxic properties and contribute to the early defense against pathogens and immunosurveillance of tumors (Zhang et al., 2018). The regions under selection on chromosome 11 contains 66 annotated genes, with some genes involved in immune response. One of the genes, NFAT5 is required for TLR-induced responses to pathogens, and previous studies have shown that TLR-induced NFAT5-regulated genes such as TNF-α play a vital role in inflammatory responses (Buxadé et al., 2012; Tellechea et al., 2017). We have reported only a few genes plus their associated roles/functions in regard to the regions under selection across the Egyptian breeds. Most of the genes in these regions on the different chromosomes (1, 2, 3, 5, and 11) could play vital roles in the adaptation mechanisms to enable the survival of the Egyptian chicken

breeds in the hot arid climatic conditions. Although we did not detect any annotated genes in the regions under selection on chromosomes 4 and 10, it is important to note that these could be recent possible selection signatures for the Egyptian breeds to their climate. In other parallel studies, it has been shown that domesticated animals often develop physiological and genetic adaptations when encountered with harsh or new environments such as hypoxia (Ramirez et al., 2007; Storz et al., 2010). A study conducted on Tibetan chickens that primarily live at high altitudes of between 2,200 and 4,100 m revealed several candidate genes that are involved in the calcium signaling pathway to possibly enable them adapt to hypoxia (Wang et al., 2015). There were two regions under selection on chromosomes 1 and 4 across the Sri Lanka ecotypes. Like the selection in the Egyptian breeds, the region under selection on chromosome 4 of the Sri Lanka ecotypes contain several genes and two of them, Toll like receptor 3 (TLR3) and Nuclear factor kappa B subunit 1 (NFKB1) are important in the immune system. A TLR signaling pathway is an innate immune defense mechanism against pathogen attack in both vertebrates and invertebrates. TLR3 in chickens is orthologous to its mammalian counterpart (Kannaki et al., 2010), and together with TLR7 it is known in the recognition of RNA virus encoded pathogen associated molecular patterns (PAMPs) (Akira, 2001). TLR3 are able to recognize and bind to double-stranded RNA intermediates that are produced during viral replication (Iqbal et al., 2005), and the end product of its signaling pathway is the production of antiviral type I inferno (IFN)-α and -β (Guillot et al., 2005). Another important gene, NFKB1 could also be of importance to the survival of Sri Lanka chicken ecotypes in the tropical hot humid climate climatic conditions of Sri Lanka. NFKB transcription factors are important in immunity and inflammation (Hayden and Ghosh, 2008). TLR are activated by binding to the PAMPs that in turn initiates MAPK- or nuclear factor kappa B (NFkB) dependent cascades that leads to a proinflammatory response, resulting in the secretion of antibacterial substances, such as β-defensins and cytokines (Kogut et al., 2006). NFKB proteins are also involved in a wide range of processes, including; cell development, growth and survival, proliferation and are also involved in many pathological conditions (Morgan and Liu, 2011). Sri Lanka has hot humid climatic conditions that besides being favorable for pathological infection to livestock, also presents challenging conditions like heat stress, especially during a drought that requires the animal to adapt to such conditions. Challenges like heat stress result in the production of ROS that are produced by a variety of cellular processes. NFKB-regulated genes are vital in regulating the amount of ROS in cells (Morgan and Liu, 2011). The ROS have several stimulatory and inhibitory roles in NFKB signaling.

Chicken survival in challenging environments involves different adaptation mechanisms, among which is the ability to perform under harsh conditions. The current study indicated selection signatures with genes associated with production traits in both Egypt and Sri Lanka populations. For Egypt populations, we identified MRPL42 which is a candidate gene associated with breast yield under heat stressed chickens. The MRPL42 gene is vital in DNA synthesis, transcription, RNA processing

and translation (Van Goor et al., 2015). Another gene ACSF3, belonging to the ACSF gene family is reported to be correlated to egg laying performance in chickens (Tian et al., 2018). For Sri Lanka chicken populations, the CYP4V2 gene associated with control of fat deposition in chickens was identified on chromosome 4 of the region under selection (Claire D'Andre et al., 2013). Because local chickens are mostly free range and exposed to high humid hot conditions in developing countries, such as Sri Lanka, it could be vital for chickens to control the depositions of fat as an adaption mechanism.

There were no regions of selection across all the eleven Brazilian breeds/ecotypes, but we detected possible regions of selection across two breeds, Cochinchina and Brahma, known to have Asian ancestry, on chromosomes 1 and 4. However, these regions didn't overlap with regions under selection across the Asian Sri Lankan ecotypes. This could be due to the fact that chickens were introduced to Brazil from Asia over a few hundred years ago, and possibly because of the differences in climatic conditions between Porto Ferreira, Sao Paolo and Sri Lanka. The chicken genomes from these locations could have been modified to enable chicken adaptation and survival in the respective changing climates.

There is clear evidence that chickens, particularly the domestic fowl, were kept in Egypt for thousands of years and this is dated back to 1840 B.C (Coltherd, 1966). For other traditional breeds such as Fayoumi and Dandarawi, studies based on mitochondrial (mtDNA) sequence variation have shown that these Egyptian indigenous breeds could have roots in Indian subcontinent and southwest Asia (Elkhaiat et al., 2014; Eltanany and Hemeda, 2016), because Egypt was an entry route of Indian chickens to Africa. In spite of the fact that Egyptian chicken breeds might have Asian origin, none of the regions under selection was shared between Egyptian breeds and Sri Lanka ecotypes. Asian chicken breeds could have been imported to Egypt over thousands of years ago, and because of the difference in climatic conditions; hot arid and hot humid for Egypt and Sri Lanka, respectively, chickens in the two climatic conditions developed different adaptation mechanisms to survive in the different climates.

The two methods, Fst and hapFLK, did not detect any overlapping regions, and we noted that hapFLK detected more selection signals with several important genes compared to Fst. HapFLK approach has been reported by previous simulation studies to have the ability to greatly increase the detection power of selection signatures occurring across several populations (Bonhomme et al., 2010; Fariello et al., 2013). Due to this, were able to detect several regions under selection; within Egypt and Sri Lanka populations with hapFLK that were not detected by the Fst analyses. HapFLK considers the hierarchical structure of the population and this improves the detection power of soft sweeps.

# REFERENCES


# CONCLUSION

There is evidence of stratification and admixture, particularly among breeds/ecotypes within each country's populations. The Fst differences between Sri Lanka and Egypt populations could indicate the differences in the chicken adaptations due to the different climatic conditions in the two countries. The low Fst values between Sri Lanka and Brazil could possibly be due to common shared ancestry of Asian origin over a few years ago rather than climate. This might change with the continuous changes in climatic conditions where local Brazilian chickens from Porto Ferreira, Sao Paolo region might develop certain genome modification to adapt to the climate. For hapFLK analyses, there were no common regions under selection among breeds/ecotypes across the populations from the three countries. This could indicate climatic specific selection signals that have enabled those chickens to develop adaptation mechanisms in response to their respective climatic conditions. In that regard, Sri Lanka and Egypt chicken ecotypes/breeds have developed mechanisms to survive in their humid and dry hot climates.

# DATA AVAILABILITY STATEMENT

The link to the data access: https://www.animalgenome. org/repository/pub/ISU2018.0416/. It is in the NRSP-8, Bioinformatics data repository.

# AUTHOR CONTRIBUTIONS

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

# FUNDING

Funding for this research was provided by the Ensminger Endowment, State of Iowa and Hatch funding and by USDA-NIFAAFRI Climate Change Award #2011-67003-30228.

# ACKNOWLEDGMENTS

We thank the Egyptian Breeders Association for providing the Egyptian chicken samples and co-authors CD, MPR, and CS for samples from Sri Lanka and Brazil, respectively.




in Tibetan chickens. Mol. Biol. Evol. 32, 1880–1889. doi: 10.1093/molbev/ msv071


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Walugembe, Bertolini, Dematawewa, Reis, Elbeltagy, Schmidt, Lamont and Rothschild. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Livestock Genomics for Developing Countries – African Examples in Practice

Karen Marshall1,2 \*, John P. Gibson<sup>3</sup> , Okeyo Mwai<sup>1</sup> , Joram M. Mwacharo<sup>4</sup> , Aynalem Haile<sup>4</sup> , Tesfaye Getachew<sup>4</sup> , Raphael Mrode1,5 and Stephen J. Kemp1,2

<sup>1</sup> Livestock Genetics Program, International Livestock Research Institute, Nairobi, Kenya, <sup>2</sup> Centre for Tropical Livestock Genetics and Health, Nairobi, Kenya, <sup>3</sup> School of Environmental and Rural Science, University of New England, Armidale, NSW, Australia, <sup>4</sup> Small Ruminant Breeding and Genomics Group, International Center for Agricultural Research in the Dry Areas, Addis Ababa, Ethiopia, <sup>5</sup> Scotland's Rural College, Edinburgh, United Kingdom

African livestock breeds are numerous and diverse, and typically well adapted to the harsh environment conditions under which they perform. They have been used over centuries to provide livelihoods as well as food and nutritional security. However, African livestock systems are dynamic, with many small- and medium-scale systems transforming, to varying degrees, to become more profitable. In these systems the women and men livestock keepers are often seeking new livestock breeds or genotypes – typically those that increase household income through having enhanced productivity in comparison to traditional breeds while maintaining adaptedness. In recent years genomic approaches have started to be utilized in the identification and development of such breeds, and in this article we describe a number of examples to this end from sub-Saharan Africa. These comprise case studies on: (a) dairy cattle in Kenya and Senegal, as well as sheep in Ethiopia, where genomic approaches aided the identification of the most appropriate breed-type for the local productions systems; (b) a cross-breeding program for dairy cattle in East Africa incorporating genomic selection as well as other applications of genomics; (c) ongoing work toward creating a new cattle breed for East Africa that is both productive and resistant to trypanosomiasis; and (d) the use of African cattle as resource populations to identify genomic variants of economic or ecological significance, including a specific case where the discovery data was from a community based breeding program for small ruminants in Ethiopia. Lessons learnt from the various case studies are highlighted, and the concluding section of the paper gives recommendations for African livestock systems to increasingly capitalize on genomic technologies.

Keywords: livestock, Africa, genomics, smallhold, SNP, breeding program, genetic improvement strategy

# INTRODUCTION

In developing countries, the livestock sector plays a key role in the provision of livelihoods as well as food and nutrition security. The majority of livestock are kept by the rural poor, where they serve multiple functions. These include: savings and insurance, food security (meat and milk), income, livelihood diversification and thus risk reduction (such as in mixed crop-livestock systems), inputs to crop production (draft power, manure as fertilizer), transportation, various uses of hides and skin (such as for housing), allowing households to benefit from common-property resources (such as

### Edited by:

Tad Stewart Sonstegard, Recombinetics, United States

### Reviewed by:

Filippo Biscarini, National Research Council (CNR), Italy Gábor Mészáros, University of Natural Resources and Life Sciences, Vienna, Austria

> \*Correspondence: Karen Marshall kmarshall@cgiar.org

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 11 July 2018 Accepted: 19 March 2019 Published: 24 April 2019

### Citation:

Marshall K, Gibson JP, Mwai O, Mwacharo JM, Haile A, Getachew T, Mrode R and Kemp SJ (2019) Livestock Genomics for Developing Countries – African Examples in Practice. Front. Genet. 10:297. doi: 10.3389/fgene.2019.00297

communal grazing areas), and fulfilling social obligations (such as being used in special ceremonies or for dowry), amongst other (Herrero et al., 2013; Marshall, 2014; Marshall et al., 2014; ILRI, 2019). The livestock sector also benefits other actors in the associated value chains, such as input providers, traders, processors and retailers, through the provision of employment and income. Critically, animal source foods – consumed in even small amounts - play a key role toward food and nutritional security of the poor, as they provide quality protein and micronutrients essential for normal development and good health (Grace et al., 2018; Smith et al., 2013).

The demand for animal source foods is rapidly increasing in developing countries: for example, in low income countries the demand in 2030 for beef, milk, poultry and eggs is predicted to be a 124, 136, 301, and 208% increase over that in 2000, respectively (FAO, 2011). This demand increase has been largely attributed to population growth, income growth and increasing urbanization (Delgado, 2005; Thornton, 2010). To ensure this demand is met, large increases in livestock production within developing countries will be required (Delgado et al., 2001; Steinfeld et al., 2006; Thornton, 2010). Achieving this in a sustainable manner is expected to be challenging, with a key component of this recognized to be increasing livestock productivity (output per unit of input).

Increasing livestock productivity in developing countries generally requires simultaneous interventions in the areas of animal feed, health and genetics. In many livestock development programs these interventions take the form of capacity building of the livestock keepers and other value chain actors, ensuring the availability and accessibility of inputs, provision of new technologies or customization of existing technologies, support to private and/or public sector involvement, and advocacy for supportive policies. The provision of incentives for increased productivity can also be important, such as in some small-hold and pastoral sectors where livestock are primarily kept for savings and insurance purposes, so maintaining a livestock asset base is more important to the household than improving livestock productivity. Such incentives could be provided by, for example, increasing livestock income through facilitating access to strong and stable markets, or ensuring that intra-household benefit from the livestock enterprise is equitable. In addition, attention to other issues which can be affected through increased livestock productivity, such as equality, food safety and environmental sustainability, are also commonly part of livestock development programs. As livestock systems within developing countries are both diverse and dynamic, intervention packages typically need to be customized for each livestock sector.

To date, the majority of African livestock systems have not benefited from livestock technologies to the extent that developed countries have, including in relation to genetic improvement strategies (Marshall, 2014). Currently, there are few examples of sustainable breeding programs and the use of reproductive technologies, such as artificial insemination, is limited to specific livestock sectors. Contributing factors to this include: the lack of public and private sector investment; lacking or weak supportive policies and institutional arrangements; the heterogeneity of livestock systems, farm-scales, management practices, and needs and preferences of livestock keepers; poor infrastructure; and limited capacity in the field of animal breeding and reproduction, amongst other (Kosgey and Okeyo, 2007; Rege et al., 2011; Marshall, 2014). The potential of genetic improvement to increase livestock productivity is, however, increasingly being recognized by decision makers, with many African countries now explicitly including genetic improvement within their national livestock development plans.

The types of structured genetic improvement programs being implemented in Africa vary by system. These include: breed-substitution with other African breeds, breeds from other tropical countries such as India and Brazil, as well as breeds from elsewhere; cross-breeding, most commonly where a highly adapted but lowly productive indigenous breed is crossed with a poorly adapted but highly productive exotic breed; and less commonly within-breed improvement (FAO, 2015). Increasingly, explicit attention is being paid to the development of working models to ensure sustainability of these programs, as it has been well demonstrated that the models implemented in developed countries cannot be directly applied. The application of genomics – ranging from the determination of breed composition of animals in the absence of pedigree data for in situ comparison studies, or for the application of genomic selection in breed improvement programs – is just beginning to emerge, often overcoming a constraint that would otherwise exist, such as lack of recorded pedigree.

In this article we describe several examples of the use of genomics in sub-Saharan African livestock systems, draw lessons learnt from these, and giving recommendations for African livestock systems to increasingly capitalize on genomic technologies. The paper proceeds as follows. The subsequent section 'case studies' describes the case studies grouped by application, namely the use of genomic information to: (1) to identify the most appropriate breed or cross-breed type for different livestock production systems; (2) to enable or enhance breeding programs; (3) create new breed-types; and (4) discover genetic variants of economic and ecological significance. A discussion follows, first addressing current developments on livestock genomics in Africa, drawing on the case studies, and secondly describing the future outlook for livestock genomics in Africa.

# CASE STUDIES

# Use of Genomic Approaches to Aid Identification of the Most Appropriate Breed or Cross-Breed for Different Livestock Production Systems

Identification of the most appropriate livestock breed or crossbreed type in a particular livestock production system is typically the starting point of a genetic improvement strategy. In African livestock systems that are undergoing intensification this is particularly relevant (Marshall, 2014). To-date there are few studies to this end due to lack of investment in this area plus, in the case of cross-breeds, the inability to assign breed-type to

animals in the field which is necessary for in situ comparisons (see Marshall, 2014 for a review). The latter stems from the lack of pedigree information in most African livestock systems and the near impossibility of assigning breed-type based on phenotype, particularly in systems where unstructured crossbreeding is prevalent. The use of genomic approaches to assign breed composition to individual animals can overcome this constraint (Kuehn et al., 2011; Ojango et al., 2014). Here we discuss examples for dairy cattle systems in Senegal and Kenya, and sheep systems in Ethiopia.

### Kenya Dairy Cattle

In Kenya the large majority of milk is produced by smallholder farmer who typically milk 1–5 cows. Smallholders mostly keep crosses between indigenous cattle and exotic dairy breeds such as Holstein, Friesian, Ayrshire, and Jersey. There is no systematic breeding of crossbred cattle and farmers rarely keep pedigree or performance records. Most mating events involve local crossbred or indigenous bulls, where the crossbred bulls are of unknown breed composition. Farmer production environments vary greatly and this translates into a wide range of production output per cow, from less than 1,000 l milk per annum to more than 5,000 l, with the large majority likely in the range 1,000 to 3,000 l milk. There is no information about which breed composition works best for different production environments, other than the general observation that high grade exotics (cows with a very high proportion of exotic dairy breed composition) can do well in very good environments. The likelihood is that the intermediate grades do better in poorer production environments but given the lack of evidence most advice provided to farmers is that they should upgrade to higher grade exotic animals by using AI.

The Dairy Genetics East Africa project set out to determine what grade of crossbred (i.e., what percentage of exotic dairy breed composition in a crossbred cow) worked best for different production environments. The project worked with farmers to collect performance data, including on milk yields, reproduction events, and disease incidence, for 18– 24 months. Further the recorded animals were genotyped using the Illumina bovine high density (HD, 780 k) single nucleotide polymorphism (SNP) assay with the HD SNP data used to perform admixture analyses, using the ADMIXTURE software (Alexander and Novembre, 2009), to generate an estimate of ancestral breed composition of each animal. This allowed, for the first time, accurate information on breed composition to be combined with in situ performance data to determine what breed composition worked best in different smallholder environments. By comparing farmer and enumerator (field staff) assessments of breed composition, based mostly on phenotypic appearance and farmer recollections on cows origins, with the admixture determinations of actual breed composition it was confirmed that phenotype-based assessments were very poor predictors of actual breed composition (R <sup>2</sup> = 0.16). The results showed that intermediate to low grade (<50% exotic breed ancestry) cows performed best in the majority of the smallholder farms, while animals with higher grades (>50%) only performed better than lower grades in the best environments (those supporting >1800 l/cow/annum: Ojango et al., 2014).

A surprising result of this study was that average production levels (approximately 1,500 l/cow/annum) of the cows in the study, which were randomly sampled based on location in order to achieve a representative sample, was much lower than the 3,000–5,000 l/cow/annum typically assumed in most development projects and extension programs. The highest yielding cow in the study only achieved around 2,400 l/cow/annum. The result meant that it could not be inferred at what level would high grade exotic crossbreds or purebred exotics become the best performing breed type. The results also mean that most development and extension programs are making unrealistic assumptions about smallholder production environments and are likely, therefore, to be offering suboptimal or unrealistic interventions and advice. This is mentioned here because studies such as Dairy Genetics East Africa have multiple objectives in studying what are highly complex systems. As such, genomics is a powerful tool that assists better understanding of system function that should be incorporated into multidisciplinary studies rather than used to tackle isolated (genetic) issues. In the case of Dairy Genetics East Africa the results that were enabled by the use of genomic testing provided much of the baseline information that demonstrated the value and feasibility of establishing long-term genetic improvement programs, beyond the provision of the most appropriate breed cross, which led to the establishment of the Africa Dairy Genetic Gains (ADGG) program, which appears as another case study later in this review.

### Senegal Dairy Cattle

In Senegal, dairy production is mainly from cattle kept in low input systems, with domestic production unable to meet national demand. To increase national dairy production the Senegalese government has implemented a number of initiatives, including the introduction of exotic high-yielding dairy breeds through public artificial insemination campaigns. However, at the time of these campaigns there was no evidence base for Senegalese cattle keepers and other stakeholders to make informed decisions on which dairy breed or crossbreed to use. This knowledge gap was addressed by a project termed "Senegal Dairy Genetics" which aimed to identify the most-appropriate dairy cattle breed/cross-breed for Senegalese production systems.

Project data was obtained by monitoring 220 rural or peri-urban dairy cattle keeping households, with collectively more than 3,200 cattle, over an almost 2 year period. Data collected included that on animal performance, economics of the household dairy enterprise, social issues including on gender, and dairy cattle feed and milk safety, amongst other. The aim was to collect a range of data such that different household dairy systems (defined as a combination of breed-type kept and level of animal management) could be compared from multiple perspectives including milk-yields, household profit and cost:benefit ratio, and food safety (Marshall et al., 2016b, 2017; Salmon G. et al., 2018).

The main breeds and cross-breeds of cattle kept by the project households comprised pure indigenous Zebu, indigenous Zebu

crossed with Guzerat, indigenous Zebu crossed with Bos taurus breeds (such as Montbéliard and Jersey) and pure (or almost pure) Bos Taurus breeds. With the exception of the indigenous Zebu, the breed-type of individual animals was not able to be determined based on phenotype, and none of the cattle keepers kept pedigree records. Thus breed composition of a subset of the study animals, those with the most informative records, was determined using a genomic approach. Specifically, genotyping was performed using the Bovine 50 K SNP assay and admixture analysis performed. using the Bayesian Analysis of Population Structure software (Corander et al., 2008). Animals were each assigned proportions of ancient Zebu, recent Zebu, ancient Taurine and recent Taurine, and from here assigned to breed-groups: see Tebug et al. (2016) for more details. In comparing breed-group assignment from the genomic analysis to that based on farmer-stated breed-type there was only a match in 32% of the cases.

Following breed-composition assignment of the study animals, trade-off analysis proceeded for the various household dairy systems (Marshall et al., 2017; Salmon G. et al., 2018). Notably it was found that cross-bred indigenous zebu by Bos taurus dairy cattle kept under better management produced up to 7.5-fold higher milk-yields, 8-fold higher household profit, and 3-fold lower greenhouse gas emission intensity, per cow per annum, in comparison to indigenous Zebu kept under poorer management, for a typical herd size of eight animals (Marshall et al., 2016b; Salmon G.R. et al., 2018). Trade-offs to this were that the cross-bred cattle consumed more supplementary feed, some of which was aflatoxin contaminated which can result in milk unfit for human consumption (Marshall et al., 2016a), and that as the household dairy enterprises commercialized (associated with the keeping of cross-bred dairy cattle) there was a partial shift in the control of income from milk sale from women to men (Walugembe et al., 2016).

Results of the study were shared with decision makers on dairy in Senegal, including women and men dairy cattle keepers, other value-chain actors, and policy makers, for betterinformed decision making. Discussions with these stakeholders are currently underway to implement a livestock development program aimed at increasing the availability and accessibility of cross-bred animals, whilst addressing the known trade-offs. Similar to the Kenya Dairy case study above, this highlights the use of genomics in multi-objective studies.

### Ethiopia Sheep

Crossbreeding local sheep with usually much bigger exotic breeds has been common practice in many countries of Africa over the last five decades (Getachew et al., 2016). Generally, performances and adaptability of crossbreds greatly varied by location, management and exotic inheritance level (Getachew et al., 2013, 2016). In Ethiopia, the common approach is to upgrade local breeds by repeatedly back crossing to high level exotic sires, mainly of the Awassi and Dorper breeds. However, it is difficult for farmers and other stakeholders to make informed decisions on which level of cross (in terms of local versus exotic contribution) to aim for, due to lack of evidence to this end. This was addressed in the highlands of Ethiopia by a project aimed at associating cross-breed type with performance, as described here.

Study data was obtained from an on-going crossbreeding program being implemented in the Amhara region of the Ethiopian highlands (Gizaw and Getachew, 2009). This crossbreeding program has been ongoing since 1998 and involves crossing of the local Menz and Wollo breeds to the exotic Awassi breed, with a wide range of crossbreeds produced. Phenotype data collection on lamb growth and ewe reproductive was routine in the breeding programs. However, the breed composition of the animals was unknown as pedigree had not been recorded (due the practice of communal grazing).

Genomics helped to estimate breed proportion in the absence of pedigree recording with, specifically, breed-composition assigned to individual animals using a reduced set of ancestry informative markers (AIM). The AIM were selected from Ovine SNP50K data from the Menz, Wollo, and Awassi breeds. A total of 74 SNP that showed large differentiation between the local Menz and Wollo breeds to the Awassi breed were selected based on their FST values. These accurately (r = 0.98) identified the breed proportion of reference samples (which comprised pure Awassi, 75% Awassi and 50% Awassi), as did sub-sets of 65, 55, and 45 SNPs selected on high or low FST values (with correlations of 0.9996 to 0.969 between breed estimates from these subsets and the 74 SNP; Getachew et al., 2017). The small number of AIM required is consistent with studies in human populations (Halder et al., 2008).

More than 700 animals, presumed to have a wide range of breed compositions, were genotyped using selected AIMs. Breed proportion of individual animals was then determined and related to ewe productivity expressed as 8 months lamb weight per year (considered a useful combined trait comprising growth, reproduction and lamb survival). The most productive breed compositions were then identified as 37.5–50% Awassi in the first study site, and12.5–25% Awassi in the second study site where ewes produced (on average) 26.5 and 19.5 kg lamb, respectively, at 8 months (Getachew et al., 2017). Findings of this project were shared with various local research centers with recommendations from the project adopted. Accordingly, crossbreeding in the first study site is moving toward synthetic breed development, whilst cross-breeding in the second study site was discontinued due to perceived unfavorable economic benefits (i.e., high cost:benefit ratio).

The AIM is considered a great opportunity to estimate the level of admixture (breed proportion) in a cost-effective way. Currently, the cost per SNP is in the range of about €0.04–0.15 for low density panels, highly dependent on the method and number of samples to be genotyped at a time. It is of note that information on ram breed composition (based on visual assessment and in some cases partial pedigree) is currently used in ram marketing, and that many farmers within the study site showed interest to pay for breed composition information. If an affordable tool (based on a low-cost SNP chip) was available for this purpose, ram sellers would be better placed to take advantage of the market opportunity for rams of known breed-type.

# Use of Genomic Approaches to Enable or Enhance Breeding Programs

In intensive livestock systems, genomic data enhances existing genetic improvement programs by increasing the accuracy of estimates of relationships among animals, and hence increasing accuracy of estimated breeding value (EBV), and in some cases also revealing functional variants which can be selected for directly using genotype data. The big immediate advantage of genomic data in Africa is to enable rapid implementation of genetic improvement where pedigree information is lacking, which is commonly the case. In such cases genomic data can be used to build a genetic relationship matrix among animals in a new recording program, so that EBV can be generated almost immediately. Where genetic relationships are based on pedigree recording, EBV cannot be generated until the next generation of animals have been born and recorded. Similarly, once phenotype and pedigree recording programs are in place, genomic data allows rapid expansion of recording to include animals with no previous pedigree information. Where genetic improvement systems are well established in Africa, genomic data potentially offers the same technical benefits as in intensive livestock systems. An additional advantage in crossbred populations is that genomic data can be used to accurately determine breed composition of individual animals and this information can be used to increase the accuracy of genetic evaluations and breed effects, in addition to being used directly to select animals of desired breed composition. In the case of pure breed populations, estimates of breed composition can also be used to ensure the purity of the breed. The case study below is an excellent example highlighting how genomics has facilitated a breeding program in an African livestock system.

## East Africa Dairy Cattle

In the smallholder, crossbred dairy system that dominate sub-Saharan milk production, the lack of performance and pedigree recording means that there are no conventional genetic evaluation systems for these systems (Kosgey and Okeyo, 2007). In addition, indiscriminate crossbreeding has been undertaken, with no clear goal in mind, thus leading to populations of highly varied breed composition and no information about the breed composition of individual animals. Two initiatives in East Africa, the Dairy Genetics for East African program (described above) and the African Dairy Genetic Gains program funded by the Bill and Melinda Gates Foundation have explored routes to establishing relevant and sustainable genetic improvement programs by combining genotype information with establishment of effective performance and pedigree recording.

Genotype data from high-density SNP assays can offer quick wins in smallholder systems. SNP data can be used to assign parentage where pedigree data is not available. Knowledge of breed composition of bulls allows farmers to use bulls of the breed composition they desire, and having cows with known breed composition allows farmers to determine what breed composition of bulls is required to produce progeny of the desired breed type. Further, knowing the breed composition of cows and bulls allows purchasers of animals to obtain animals with the breed composition required for their production environment. As illustrated in the case studies described above, the same approach can be used to determine breed composition in studies that determine the optimum breed composition for different production environments, thereby informing farmers what breed composition of cow or bull they should be aiming to purchase or to produce through breeding.

Commercially available SNP assays are currently too expensive to allow their routine commercial use in parentage assignment and determination of breed composition in East African dairy systems. However, using the Dairy Genetics East Africa highdensity genotype data on 2940 crossbred cattle in East Africa (Kenya, Uganda, Ethiopia, and Tanzania), Strucken et al. (2017) developed reduced SNP panels consisting of 200–400 SNPs each; one set of panels for the accurate determination of breed composition and the other set for accurate parentage verification. These assays will soon be tested in the field to determine the feasibility of delivery on a large scale at a price farmers and others are willing to pay, with a target of \$10–\$20 for laboratory costs. If smartly and widely used, these tools will enable almost immediate genetic improvement through targeting of the best genotypes to different production environments, which in turn will allow the formation of synthetic dairy breeds in which long-term genetic improvement can be practiced.

The availability of genotypic data has enabled the estimation of genetic parameters and the estimation of genomic breeding values for milk yield in these populations using the G matrix obtained from SNP genotype data (Brown et al., 2016; Mrode et al., 2018). Using milk test day records on 1034 cows and genotypes from the Dairy Genetics East Africa project, Brown et al. (2016) applied genomic best linear unbiased prediction (GBLUP) and Bayes C models to examine the accuracy of genomic predictions for cows of different breed composition. The study reported accuracies of genomic prediction varying from 0.30 to 0.40. Using the same dataset, Mrode et al. (2018) examined models with dominance effect and a multi-trait approach fitting breed proportion as separate traits. Although the dominance effects were essentially zero, possibly due to the small size of the dataset, the multi-trait approach resulted in a slight improvement in the predictive ability of the model, although not in accuracy of prediction, compared to the results of Brown et al. (2016). While the accuracies reported in these studies in East Africa are lower than estimates from developed countries (Wiggans et al., 2017), they are very promising given the limited data sets and the fact that there is no existing breeding program with which these genomic EBV (gEBV) for crossbred performance have to compete. The results highlight the need for more data and the consequent advantage of pooling data across countries in future (Mrode et al., 2018). The Bill and Melinda Gates Foundation funded African Dairy Genetic Gains (ADGG) project is generating more data across two countries and would offer more opportunity to further examine the application of GS in small holder system (Mrode et al., 2018). The intention is to initiate routine genomic evaluations, and selection and recruitment of young bulls for use in the National Artificial Insemination Centers (NAIC) in Tanzania,

Ethiopia, and Kenya. In addition, genome wide association studies (GWAS) are planned to determine whether genes or genetic regions controlling production and reproduction traits can be identified that can be used to further enhance genetic improvement in these populations.

To improve the cost-effectiveness of applying genomic selection in East Africa, the feasibility of developing a reduced (i.e., cheaper) chip for genomic prediction was examined using the 3,513 animals with high density genotypes in the Dairy Genetics East Africa data (Aliloo et al., 2018). Various methods were examined for selecting panels with reduced number of SNP for imputation and genomic prediction within the crossbred populations. It was found that a specially developed (co)variance method that accounted for the covariance between adjacent SNPs and the minor allele frequency of SNPs, out-performed other approaches such as using the minimum minor allele frequency or random SNP selection. High accuracies of imputation of about 0.80 and 0.94 were observed when imputing from optimized 7 K and 40 K panels to HD. The use of these LD data imputed to HD was accompanied by a high accuracy of genomic prediction of about 0.98 compared to use of unimputed HD data. The highest imputation accuracy were obtained with a reference population consisting of a mixture of crossbred and ancestral purebred animals. As the cost of existing commercial genotyping assays continues to fall, the value of having smaller customized assays is reducing, and, with current technologies, they may well become more expensive than commercial assays that are used globally because of their more limited use. Innovative applications of genomic technology or tools for breed composition and parentage determination, and genomic prediction, if accompanied by sound business models for their delivery hold great potential for impact in Africa.

# Use of Genomic Approaches in the Creation of New Breed-Types

The most appropriate breed-type for African livestock systems are typically considered those which are both productive and adapted/resilient. Genomics and its associated technologies/techniques (transgenesis, cloning, gene/genome editing etc.) offer opportunities for creating such breed-types. The below case study is one example of this.

## Trypanosome Resistant Cattle

Animal trypanosomias is caused by a group of extracellular protozoan parasites and transmitted by the tsetse fly (Glossina spp.) is a major constraint to livestock production across much of the African continent with massive economic consequences (Kristjanson et al., 1999; Shaw et al., 2014). Attempts to develop vaccines against this pathogen have largely failed due to its ability to rapidly change its highly antigenic surface glycoprotein (La Greca and Magez, 2011). The alternative prevention measure, tsetse vector control has proved expensive and difficult to sustain with adverse environmental consequences (Tirados et al., 2015). However, some African Bos taurus cattle breeds, such as N'dama, are tolerant of infection with trypanosomes, remaining healthy and productive and without the anemia that is characteristic of infection in susceptible breeds. This phenomenon has been termed trypanotolerance. Importantly trypanotolerant animals continue to harbor parasites and can succumb to pathology under physiological stress (Murray et al., 1984).

Because of the difficulty in conventional control methods, there has been significant research into a genetic approach to enabling livestock production under trypanosome challenge. In a series of studies, quantitative trait loci influencing response to trypanosome challenge were mapped in a mouse model (Kemp et al., 1996) and in N'dama cattle (Hanotte et al., 2003). Eventually, a combination of linkage mapping, expression analysis, candidate gene sequencing, population analysis and in vitro studies allowed candidate genes and variants to be identified with some confidence (Noyes et al., 2011). However, no genes of large effect were identified and the mechanism of tolerance remains unclear.

An alternative genetic-based approach is currently under investigation that attempts to exploit the resistance to infection with some trypanosome species shown by most primates. Resistance in primates is mediated by subset of high-density lipoproteins (HDLs) called trypanosome lytic factors (TLFs) which kill many trypanosome species (Thomson et al., 2009). The active component of TLF has been shown to be apolipoprotein (apoL-1) which, following endocytosis by the trypanosome, is activated within the acidic lysosome to form membrane pores, resulting in parasite swelling and lysis (Molina-Portela Mdel et al., 2005; Thomson and Finkelstein, 2015). Primate TLF has been shown to kill the cattle-infective trypanosome, Trypanosoma congolense as well as the humaninfective trypanosomes, T. brucei rhodesiense. Furthermore susceptible mice have been shown to become fully resistant to infection with these trypanosomes following transfection with primate-derived APOL1 (Thomson et al., 2009). There is thus good reason to believe that transgenic cattle could be constructed, which are fully resistant to trypanosomes. This could potentially allow Bos indicus cattle breeds that are well adapted to the African environment, except for susceptibility to trypanosomes, to become sustainably resistant without the use of toxic drugs or environmentally damaging insecticides and research to explore this possibility in East Africa is currently underway (Lukeš and Raper, 2010; Yu et al., 2016).

# African Indigenous Livestock as Resource Populations for Discovery of Genetic Variants of Economic and Ecological Significance

African livestock populations are rich resources for discovery of genetic variants, and many efforts are underway to this end. The first case study below describes a breeding program for small ruminants (sheep and goats) which, whilst currently not using genomics as part of the breeding program itself, is using the breeding program data for genetic variant discovery purposes. Following this a second 'case study' illustrates other efforts toward genetic variant discovery: unlike the other cases described here which are specific initiatives/projects, this draws on numerous studies to showcase the various types of activities occurring in this space.

### Ethiopia Small Ruminants

fgene-10-00297 April 22, 2019 Time: 17:37 # 7

In small ruminants, centralized breeding schemes, entirely managed and controlled by governments – with minimal, if any, participation by farmers – were developed and implemented in many developing countries. Such programs have generally failed to sustainably provide the desired genetic improvements to smallholder livestock keepers. Community-based breeding programs have been suggested as an alternative and are being implemented in a few pilot countries. Programs that adopt this strategy consider the farmers' needs, views, decisions, and active participation, from inception through to implementation, and their success is based upon proper consideration of farmers' breeding objectives, infrastructure, participation, and ownership (Sölkner et al., 1998; Mueller et al., 2015). Community-based breeding programs in Ethiopia started in 2009 and currently cover 3,200 households keeping more than 48,000 sheep and goats. The goal of the program is to improve the productivity and income of these small-scale resource-poor sheep and goat producers by providing access to improved animals that respond to improved feeding and management, facilitating the targeting of specific market opportunities (Haile et al., 2011, 2018).

A study using selected animals recorded as part of the community-based breeding program was performed toward identifying genes for prolificacy. Here 84 sheep giving either single, twin, triplet, quadruplet etc. birth types were used in a signatures of selection study to identify candidate genes for prolificacy. Animals giving single births (20) were taken as controls while those giving multiple birth (64) formed the cases. FST analysis revealed two candidate regions, one on chromosome 5 and the other on the X. The latter was the most significant. hapFLK identified the region on the X only. The candidate region on chromosome 5 was adjacent to GDF9 and the region on the X spanned the BMP15 (GDF9B) gene. These two genes are expressed in oocytes and have been shown to be essential for ovulation rate, normal follicular growth and maturation of preovulatory follicles (McNatty et al., 2004). From examination of inherited patterns of ovulation rates in other sheep, point mutations have been identified in both genes. Animals heterozygous for any of these mutations have higher ovulation rates (that is, +0.8–3) than wild-type contemporaries, whereas those homozygous for each of the mutations are sterile with ovarian follicular development disrupted during the preantral growth stages. The genes are being sequenced to identify the point mutations and once confirmed, strategies to introgress the alleles conferring prolificacy into other, nonprolific, populations would be designed.

### Other Initiatives on Genetic Variant Discovery

Post domestication, livestock genomes have continuously been modified through selective breeding for economically or otherwise important traits, and natural selection for adaptation to local agro-environments. Africa has diverse agro-environments and a predominantly tropical environment that is characterized by harsh and extreme climatic conditions, seasonal feed and water scarcity, heat stress, high solar radiation, widespread pathogens, parasitic infections and disease epidemics. These present the main evolutionary forces shaping Africa's livestock genomes. Accordingly, African livestock display unique adaptive traits including enhanced disease resistance, superior innate immunity and greater ability to thrive, produce and reproduce in unfavorable environments. Some of the adaptive traits in African livestock, such as resistance to gastro-intestinal parasites in small ruminants, are of global significance.

There are numerous African livestock populations already identified as of interest for gene-discovery studies. These include, as examples: breeds that are highly resistant/tolerant to gastronematodes, such as the Red Maasai sheep and Small East African Goats of East Africa, West African Dwarf sheep and Goat (Preston and Allonby, 1978; Baker et al., 1999, 2003; Goossens et al., 1999; Behnke et al., 2006); breeds from West Africa that exhibit strong trypanotolerance, such as the N'dama, Somba, Baoulé, Lagune and Muturu cattle, and West African Dwarf sheep and goat (Agyemang, 2005; Geerts et al., 2009; Berthier et al., 2016); cattle breeds that produce "robust" milk yields in harsh conditions, such as the Butana and Kenana of Sudan (Peters et al., 2005; Salim et al., 2014); Zebu cattle that demonstrate innate ability to regulate body temperature under heat stress by maintaining lower metabolic rates and rectal temperatures, lower respiratory rates and lower water requirements (Gaughan et al., 1999; Hansen, 2004); and breeds that are highly prolific, such as the sheep breeds of Djallonké from West Africa (Tuah and Baah, 1985), Bonga, Horro, and Arsi-bale from Ethiopia (Rekik et al., 2015), D'Man from Morocco (Aherrahrou et al., 2015), and Barbarine from Tunisia (Lassoued et al., 2017).

There are an increasing number of examples of African livestock populations being used in studies aimed at identifying the genes or gene-pathways and genomic variants underpinning economically or ecologically important traits. These include a number of studies that have detected putative signatures of selection for a variety of traits including feeding/drinking behavior, heat tolerance/thermoregulation, tick resistance, milk production under harsh environments, immune response, meat quality, and reproductive performance, amongst others (Makina et al., 2015; Mwacharo et al., 2017; Taye et al., 2017; Bahbahani et al., 2018). There are additionally some reports of GWAS, such as for tick and gastrointestinal parasite resistance (Benavides et al., 2015; Mapholi et al., 2016), though these are rarer due to lack of datasets with both phenotypes and genotypes recorded on sufficient animals. Some genetic mapping studies targeting QTL identification, such as for resistance to gastrointestinal nematodes and trypanotolerance (Hanotte et al., 2003; Marshall et al., 2013), have also been reported. In cases candidate genes have also been identified within the genomic regions of interest, for instance genes likely associated with trypanotolerance (Berthier et al., 2016). Should this work be extended to the identification of refined genomic regions and/or validated functional mutations and variants, there is potential for it to be fed into genetic improvement strategies, either via breeding programs incorporating the use of genomic/genetic data or through the creation of new breeds via either introgression or genome modification approaches.

An exciting possibility in crossbred dairy cattle populations such as those in the Dairy Genetics East Africa project is that as data increases it will be possible to undertake GWAS to

identify genetic regions, and potentially the genes, controlling genetic variation in milk production and adaptation traits. The differences between exotic dairy breeds and indigenous breeds in their genetic potential for milk production and in adaptation traits are larger than for any other crosses of livestock. For example the genetic potential for milk production of Holstein cattle is about 10-fold higher than that of indigenous breeds such as the Small East African Zebu. GWAS may be able to identify the genetic regions that control these massive genetic differences between breeds. However, GWAS in crossbred cattle presents some challenges. In a purebred population GWAS is based on population-wide linkage disequilibrium (LD) between SNP and functional genetic variants. In a crossbred population there are at least three forms of LD: the LD coming from within the indigenous ancestors; the LD from within the exotic ancestors; the between population LD, which is the LD generated within the crossbreds due to segregation of loci that were fixed for opposite alleles in the exotic vs. indigenous ancestors. In practice the problem is even more complicated because most LD is not conserved between exotic or between indigenous breeds, so each of the various ancestral dairy breeds and indigenous breeds injects different amounts and phases of LD, reducing further the average LD observed in crossbred populations. It is not yet clear whether existing SNP assays provide sufficient information to track inheritance of segments of the genome back to their diverse origins with sufficient accuracy to undertake GWAS that separates the different forms of LD in the population. Assuming that it will prove possible, the between population LD is potentially of greatest interest given the very large genetic differences between ancestral breeds and the potential to apply gene-based selection for suitable combinations of productivity and adaptation traits. Very low density, and hence potentially cheap, assays of a few hundred SNP might be developed and applied to widely test animals and select those with optimum combinations of productivity and production variants, even if genotyping with commercial assays proves too expensive for routine use in genetic improvement in these systems.

An additional body of work has focused on characterizing genetic diversity, population structure and relationships in African livestock (Hanotte, 2002; Missohou et al., 2006; Muigai and Hanotte, 2013; Decker et al., 2014). Such studies are useful in understanding their evolutionary history as well as identifying appropriate populations for the identification of genomic variants.

# DISCUSSION

# Current Developments

The case studies presented show-case a number of livestock genomic technologies currently being applied or piloted in livestock systems of sub-Saharan Africa. These included those aimed at identifying the most appropriate breed-type for particular production systems/environments, a breeding program incorporating genomic selection as well as parentage and breed composition determination, an initiative aimed at creating a new breed-type, and efforts toward discovery of genetic variants. Other examples outside of those presented here also exist within sub-Saharan Africa, with, in particular, major efforts in South Africa to incorporate genomic selection into established breeding programs for a number of species (van Marle-Köster et al., 2013; Cloete et al., 2014; Westhuizen and van der Marle-Köster, 2014; Mohlatlole et al., 2015; Prescilla et al., 2015). These examples are all fairly recent, mostly emerging within the last 5 years, and highlight the developing use of genomics in African livestock systems.

It is of note that the differences in livestock production systems, and type of genetic improvement strategy used within them, between developed countries and Africa (as discussed in the introductory section of the paper) have led to different emphasis on how genomics is currently being applied. In developed countries the most suitable animal genetic resources for a particular production systems is usually well established, whereas in many African production systems, and particularly those undergoing change such as through intensification, there is generally little evidence to make such recommendations (Marshall, 2014). The use of genomic data to determine the breed-type of admixed animals' monitored in situ (i.e., kept by farmers) has been transformational to this end, as it has removed the high error of assigning breed-type of admixed animals based on observation (phenotype) or farmer recall. Genomic selection is now common-place is many developed country livestock breeding programs, whereas in Africa it is in its infancy. This principally stems from the lack of breeding programs into which genomic selection can be implemented, with some notable exceptions including the African Diary Genetics Gains initiative described here and various breeding program in South Africa, many of which are working on developing sufficiently sized reference populations to incorporate genomic selection (van Marle-Köster et al., 2013; Cloete et al., 2014; Westhuizen and van der Marle-Köster, 2014; Mohlatlole et al., 2015; Prescilla et al., 2015). In the case of African Dairy Genetics Gains, the use of genomic information has overcome the constraint of lack of pedigree data, enabling a breeding program where it would have previously been difficult, if not impossible. African Dairy Genetics Gain is also piloting the use of genomic technologies for parentage verification as well as breed composition determination (particularly for cross-bred bulls) with a view to potential commercialization, the success of which will likely depend on whether there is sufficient market demand, in-turn linked to whether the technologies can be sold at a price affordable to African livestock keepers.

Using genomics to aid the development of new breed-types for African livestock systems has received limited attention to date. Given the high interest in developing new breeds that have the adaptation and resilience of indigenous breeds combined with the productivity of exotic breeds, and the difficulty in many systems of maintaining a structured cross-breeding program, the cost:benefit of using genomic approaches to create an adapted and productive synthetic breed, in comparison to traditional approaches, is worth exploring in the African context. On the creation of new breed-types via transgenic or gene-edited approaches, few validated genes of interest currently exist. One notable exception to this is the gene conferring resistance to

trypanosomiasis as described in the case study presented here. Other variants of potential interest are those underpinning the slick hair phenotype, given this phenotypes association with heat tolerance and tropical adaptation (Mariasegaram et al., 2007; Dikmen et al., 2014; Littlejohn et al., 2014; Porto-Neto et al., 2018). As with many countries, a concern here is public and government acceptance of the new products.

As also described significant efforts are ongoing aimed at discovering genetic variants of economic and ecological significance, primarily using a signature of selection approach. Given the current emphasis on incorporating traits conferring adaptation to harsh (including hotter) environments into breeding programs, both within developed and developing countries, this body of work may gain momentum. In one of the case studies presented the signature of selection study utilized data availed from an African breeding program, which adds value to the performance data collected. Whilst GWAS studies are currently few, additional studies using this approach are also expected as data-sets build up, such as what will be available via the African Dairy Genetics Gain project. Besides feeding into the discovery of genomic variants, GWAS studies can provide useful QTL information for use in genetic improvement programs. The evolutionary history of Africa indigenous livestock species, make African populations a particularly powerful resource for gene discovery (for example, Mwai et al., 2015; Kim et al., 2017), and if genes of significant effects are discovered they could be highly valuable. However, moving from initial results to confirmation of associations and then on to gene discovery requires substantial resources and time. Substantial investment will be required to move from genetic associations to applications in African livestock.

An important issue related to the use of African animal genetics resources is the fair and equitable sharing of benefits arising from their utilization. The Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilization to the Convention on Biological Diversity (the Nagoya Protocol<sup>1</sup> ) is the critical guiding document to this end. This protocol is a 2010 supplementary agreement to the 1992 Convention on Biological Diversity, which entered into force October 2014. The protocol defines obligations of the providers and users of the all genetic resources in relation to access, benefit sharing and monitoring the compliance of users with legal ABS requirements of the provider country. By default and due to the lack of any specialized international instrument, access to animal genetic resources for food and agriculture for R&D activities would fall under national ABS regulation, if the country did not determine this otherwise. According to information provided in the Access and Benefit Sharing (ABS) Clearing House, a web platform aimed at supporting implementation of the Nagoya Protocol<sup>2</sup> 43 African countries are currently parties the Nagoya Protocol, though many of these are still developing the related policies and laws as well as implementing procedures and practices. The implementation of the Nagoya Protocol provides both opportunities and challenges for African countries, as discussed in AU-IBAR (2016). To help capitalize on these opportunities, whilst reducing the challenges, additional capacity building of both non-African as well as African actors on the Nagoya Protocol and the implementing legal framework in its Member States is urgently required.

# Future Outlook

Thinking toward the future, for African livestock systems to better capitalize on the potential of livestock genomics several key issues needs to be addressed. Critically these include: establishment of sustainable genetic improvement strategies into which the use of genomic technologies can be embedded; enhanced phenomic capabilities; new genomic tools and/or algorithms designed for application in African livestock population structures; and enhanced capacity in animal breeding, genomics, and genetics. These are discussed in more detail in the following paragraphs.

Few genetic improvement strategies, i.e., breeding programs linked to multiplication and delivery systems that have the potential to produce impact at scale, exist in Africa, with the notable exception of South Africa (which has a highly developed economy and livestock infrastructure, as well as high capacity in animal breeding). For the majority of countries, significant further investment in identifying and establishing context-specific and sustainable genetic improvement strategies are required, which genomic technologies can help enable or into which the genomic application can be embedded. Excellent guidelines to this end are given in FAO (2010), and other useful experiences have also been shared (Kosgey et al., 2011; Philipsson et al., 2011; Haile et al., 2013, 2016; Mueller et al., 2015; Bruno et al., 2016; Mrode et al., 2016; Ojango et al., 2016). Some elements promoted as being key to the success and sustainability of a genetic improvement strategies within Africa are: supportive policy and institutional arrangements; close engagement with all stakeholders to ensure their needs and preferences are met, including in the design stage; incorporation of the private sector; providing incentives for farmer participation, such as timely feedback on their own animals for enhanced farmmanagement decision making; ensuring equality of access to the breeding technologies and information, including from a gender perspective; and awareness raising of livestock keepers and other stakeholders on the value of genetic improvement, particularly when packaged with other interventions, such as animal health-care and feeding, that allow the improved genetics to be expressed.

In initiatives where both phenotypes and genotypes are required, the phenotypic information is usually more expensive and difficult to obtain than the genotypic information, particularly as the cost of genotyping declines (Biscarini et al., 2015). To this end phenotyping tools that are cheap, reliable and easy to use are required. Once such example is the use of weigh-bands (tape measures placed around the girth of an animal from which the animal's weight can be read) in cases where farmers do not have access to weighing scales (for example, Tebug et al., 2016). Whilst many other 'higher tech' examples exist, such as wearable devices for remote recording of livestock health, movement and reproductive status (Rutten et al., 2013;

<sup>1</sup>https://www.cbd.int/abs/text/default.shtml

<sup>2</sup>https://absch.cbd.int/

Egger-Danner et al., 2015), and several are being tested in African systems, these are currently not affordable by the majority of livestock keepers in Africa.

Phenomic tools extend beyond recording into methods of analysis. Production systems, population structures and data quality of many African livestock populations differ markedly from the intensive systems in which most existing phenotype and genetic data analysis methods have been developed and tested. It can be expected that this will, at the very least, often lead to very different phenotype and genetic parameter values than typically seen in intensive systems. In many cases, statistical models will need to be developed that are appropriate for the population. For example, for smallholder dairy systems, typical herd-year-season effects cannot be applied (because of the very small number of cows per herd), methods of fitting lactation curves may not be appropriate to lactations that do not exhibit a classical lactation curve and/or the shape of the curve is highly dependent on production level, and variation across the lactation may be high due to short-term fluctuations in feed availability. Additionally, factors such as genotype by environment (GxE) interaction that typically have modest effects in intensive systems, where environmental differences between farms are typically relatively small, may be much more imported in African livestock systems. For example, smallholder crossbred dairy farms in East Africa range from under 1,000 l milk/cow/annum to over 5,000 l milk/cow/annum. There is massive GxE in terms of breed composition (high grade or pure exotics do best in the best environments while low-grade exotics perform best in the poorest environments) and hence it should be expected there will be large GxE when undertaking genetic evaluations in such populations. It will be important to ensure that existing methods of analysis drawn from the global literature are properly tested and adapted where needed to provide appropriate analyses for African livestock populations.

African livestock genetic research, development and application has a huge advantage in being able to utilize the wide range of genomic tools that have been developed for use globally. Most notably the existing genome sequence assemblies and associated annotations coupled with the various commercially available SNP genotyping assays provide immediate tools for analyses of genetic diversity, genetic evaluations, signatures of selection, GWAS and gene discovery. However, all of these tools were developed with little or no information from African livestock populations. It is not yet known whether updated or customized assays will be required to obtain the maximum utility in African populations, though in the case of cattle work is being undertaken to this end (ILRI, 2016). As a precursor to the work on imputation of SNP data in East African crossbred cattle populations (Aliloo et al., 2018) it was shown that the bovine high density assay with 777,000 SNP was highly informative for African indigenous cattle populations, in the sense that it has more than 190,000 markers with high minor allele frequency for most cattle populations tested. However, it also showed that the existing commercial 7 k SNP assay had low power for imputation in crossbred populations (Aliloo et al., 2018). Related to this, imputation algorithms will need to be developed for African pure and crossbred populations, as Aliloo et al. (2018) did for the East African crossbred dairy cattle. The degree of shared LD between African indigenous populations is not yet known but, as is the case for developed world breeds, it is not expected to be high. So imputation algorithms will need to be trained for each population separately or trained on a population of animals sampled from a variety of breeds, as has worked well for some minor breeds in developed countries. Although the existing high density (>600 k) SNP assays are expected to work well for basic GWAS in all populations, they may remain suboptimal on two levels: (a) we are lacking the sequence information to impute up to full sequence data for African populations plus the assays may not have an ideal SNP set to allow imputation to sequence variants that exist in African populations; (b) the information content of the existing SNP may not allow accurate separation of the indigenous versus exotic versus between-breed LD and hence not allow an advanced (and hence accurate) GWAS to be performed in crossbred populations. As more information accumulates it will be become clear how much value improved assays will add for each of the livestock species in Africa. Given that current genotyping platforms have a strong negative relationship between volume of sales and price, this value can be assessed against the cost relationship to determine the cost-benefits of developing customized assays for each species.

Building human capacity in animal breeding, genetics and genomics within Africa, such that appropriate expertise exists to design and support implementation of the genetic improvement strategies and linked genomic technologies, is required. Suggestions on how to strengthen developing country higher education systems in animal breeding are given by Ojango et al. (2008). These include concerted efforts in training of trainers, co-operation among higher education institutes within regions (South–South collaboration) in order to improve the quality of training offered, and collaboration with institutes in more developed countries. A formal on-line discussion forum revealed that the needs for human capacity development in livestock genetics and breeding go far beyond expanding postgraduate training (Chagunda et al., 2015). Principal among the needs was the current lack of effective career and mentoring structures for post-graduates trained in livestock genetics and breeding, such that most such graduates end up working in other disciplines or lacked support to evolve from a trained post-graduate to become and expert practitioner. Sharing of learning lessons across genetic improvement initiatives within Africa would also be extremely valuable, and additional efforts to this end are warranted.

# Concluding Comments

In conclusion, genomic applications are currently benefiting African livestock systems in a variety of ways, including on genetic improvement and more broadly, such as assisting in system characterization. This has emerged relatively recently, largely within the last 5 years. The expectation for the future is that African livestock systems will increasingly benefit from genomics, particularly if the various issues constraining this (as discussed in this paper) are addressed. The rate at which this will occur will large depend on the level of investment in African livestock genetic improvement.

# AUTHOR CONTRIBUTIONS

fgene-10-00297 April 22, 2019 Time: 17:37 # 11

KM and JG were the main authors of the manuscript, with information on the case studies supplied by JG, OM, RM: Kenya dairy cattle and East Africa dairy cattle, KM: Senegal dairy cattle, TG: Ethiopia sheep, SK: trypanosome resistant cattle, AH: Ethiopia small ruminants, and JM: other initiatives on genomic variant discovery.

# ACKNOWLEDGMENTS

In relation to the case studies, we gratefully acknowledge the funders of the work, and our numerous partners both within and outside Africa, and particularly the national agricultural research systems and the women and men livestock keepers,

# REFERENCES


with whom we closely work. Funders include the Bill & Melinda Gates Foundation (Kenya dairy cattle and East Africa dairy cattle), the Finnish Ministry of Foreign Affairs via the FoodAfrica program (Senegal dairy cattle), and the Livestock and Fish, and Livestock, CGIAR Research Programs (most case studies). We would particularly like to acknowledge the Tanzania Livestock Research Institute and Ethiopia's National Animal Improvement Institute (East Africa dairy cattle), the Interstate School of Veterinary Science and Medicine (Senegal dairy cattle), and Jayne Raper from Hunter College, New York and Harry Noyes from the University of Liverpool (trypanosome resistant cattle). The Centre for Tropical Livestock Genetics and Health is a research alliance of The International Livestock Research Institute, Roslin Institute at the University of Edinburgh, and Scotland's Rural College.



50,000 markers from the US meat animal research center 2,000 bull project. J. Anim. Sci. 89, 1742–1750. doi: 10.2527/jas.2010-3530



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer GM declared a past co-authorship with several of the authors TG, AH, and OM to the handling Editor.

Copyright © 2019 Marshall, Gibson, Mwai, Mwacharo, Haile, Getachew, Mrode and Kemp. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Leveraging Available Resources and Stakeholder Involvement for Improved Productivity of African Livestock in the Era of Genomic Breeding

Eveline M. Ibeagha-Awemu<sup>1</sup> \*, Sunday O. Peters<sup>2</sup> , Martha N. Bemji<sup>3</sup> , Matthew A. Adeleke<sup>4</sup> and Duy N. Do<sup>1</sup>

<sup>1</sup> Sherbrooke Research and Development Centre, Agriculture and Agri-Food Canada, Sherbrooke, QC, Canada, <sup>2</sup> Department of Animal Science, Berry College, Mount Berry, GA, United States, <sup>3</sup> Department of Animal Breeding and Genetics, Federal University of Agriculture, Abeokuta, Abeokuta, Nigeria, <sup>4</sup> School of Life Sciences, University of Kwazulu-Natal, Durban, South Africa

### Edited by:

Peter Dovc, University of Ljubljana, Slovenia

# Reviewed by:

Mekonnen Haile-Mariam, Department of Economic Development, Jobs, Transport and Resources, Australia Xiangdong Ding, China Agricultural University, China

\*Correspondence:

Eveline M. Ibeagha-Awemu eveline.ibeagha-awemu@canada.ca

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 28 August 2018 Accepted: 03 April 2019 Published: 24 April 2019

### Citation:

Ibeagha-Awemu EM, Peters SO, Bemji MN, Adeleke MA and Do DN (2019) Leveraging Available Resources and Stakeholder Involvement for Improved Productivity of African Livestock in the Era of Genomic Breeding. Front. Genet. 10:357. doi: 10.3389/fgene.2019.00357 The African continent is home to diverse populations of livestock breeds adapted to harsh environmental conditions with more than 70% under traditional systems of management. Animal productivity is less than optimal in most cases and is faced with numerous challenges including limited access to adequate nutrition and disease management, poor institutional capacities and lack of adequate government policies and funding to develop the livestock sector. Africa is home to about 1.3 billion people and with increasing demand for animal proteins by an ever growing human population, the current state of livestock productivity creates a significant yield gap for animal products. Although a greater section of the population, especially those living in rural areas depend largely on livestock for their livelihoods; the potential of the sector remains underutilized and therefore unable to contribute significantly to economic development and social wellbeing of the people. With current advances in livestock management practices, breeding technologies and health management, and with inclusion of all stakeholders, African livestock populations can be sustainably developed to close the animal protein gap that exists in the continent. In particular, advances in gene technologies, and application of genomic breeding in many Western countries has resulted in tremendous gains in traits like milk production with the potential that, implementation of genomic selection and other improved practices (nutrition, healthcare, etc.) can lead to rapid improvement in traits of economic importance in African livestock populations. The African livestock populations in the context of this review are limited to cattle, goat, pig, poultry, and sheep, which are mainly exploited for meat, milk, and eggs. This review examines the current state of livestock productivity in Africa, the main challenges faced by the sector, the role of various stakeholders and discusses in-depth strategies that can enable the application of genomic technologies for rapid improvement of livestock traits of economic importance.

Keywords: Africa, breed improvement, genomic breeding, stakeholders, sustainable livestock development

# INTRODUCTION

fgene-10-00357 April 22, 2019 Time: 17:40 # 2

The African continent is home to diverse populations of livestock breeds adapted to their local environments in diverse agroecological zones. The diversity of the various cattle, sheep, goat, pig, and chicken breeds since their introduction or domestication has been shaped by a delicate balance between human and natural selection, and environmental adaptation. Livestock are central to the Africa society and economy and serve diverse roles such as: (1) source of food (provides meat and milk in the diet); (2) income generation through sale of meat, milk, and hide; (3) savings and insurance; (4) source of draft power and manure in crop production; (5) a means of transportation; (6) use in festivals and traditional ceremonies (marriage, birth, death, coronation, and initiation ceremonies) and, (7) source of power, pride, prestige, and status. Despite these benefits, livestock productivity is less than optimal, not sustainable and unable to match demand and population growth.

The Food and Agricultural Organization (FAO) estimates the total African population at 1.3 billion in 2017 (FAOSTATS, 2018) with rural and urban populations of 717 million and 505 million, respectively. Furthermore, the urban population has witnessed a steady annual increase of 3.59% since 2010, as compared to 1.74% annual rural population increase, and these increases are accompanied by increased demand for animal products. To meet this demand in the face of low productivity of livestock, African governments have increased imports of cattle meat from 482,111 tons in 2012 to 612,353 tons in 2016 and pig meat from 184,322 tons in 2012 to 252,611 tons in 2016 (**Table 1**). The populations of cattle, sheep, goat, pig, and chicken in the African continent and the various regions in 2016 are shown in **Table 2**. Similarly, the statistics on livestock productivity (meat, milk, and eggs) from 2010 are shown in **Figure 1**. To position the livestock sector to adequately contribute to food supply and economic development of the continent, measures must be taken to ensure sustainability in African livestock production systems which form part of FAO's strategic objectives<sup>1</sup> . The livestock sector has the potential to enhance the livelihoods of Africa's rural poor and genomic selection can play a key role. Given that the human population growth of the continent is higher than its food protein production, the need for targeted action to increase livestock productivity has never been greater, and genomic selection may play a significant role.

# AFRICAN LIVESTOCK PRODUCTIVITY IN THE ERA OF GENOMIC BREEDING

# State of Genomic Breeding Application in Western Countries

Genomic breeding in simple terms refers to the inclusion of deoxyribonucleic acid (DNA) or genomic information to select superior animals and make them parents of the next generation (Meuwissen et al., 2001, 2016). Thus, genomic selection simply is application of the knowledge of genetic variations found in the genome and their relationship with traits or phenotypes (e.g., milk yield, body weight, egg size, etc.) in selection for improved productivity (e.g., litter size, milk yield, etc.). The application of quantitative genetic theories, statistical approaches, artificial insemination and organized breeding practices resulted to rapid gains in livestock traits in the last nine decades (Blasco, 2013; Hill, 2014; Oldenbroek and Waaij, 2014; Weller, 2016). Mostly, the exact mechanisms behind these gains were not known but with the discovery of the DNA structure and developments in DNA sequencing and genotyping techniques, knowledge on the association between DNA variations and livestock traits began to emerge. Thus, with increasing demands for animal products by an ever growing population and changing societal needs, the animal breeding act needed to evolve to incorporate genomic information in order to speed up response and increase productivity.

Genomic breeding started with the application of marker assisted selection considering a few markers at a time and has evolved to the use of thousands of markers and even whole genome data (Hayes and Goddard, 2001; Hayes et al., 2013). Genomic selection entails the estimation of breeding values from markers spanning the entire genome. The estimation of marker effects is carried out within a reference population (a population of individuals with phenotype and marker genotype information). These effects are then applied to select candidates with marker genotype information without phenotypes to estimate genomic breeding value (GEBV). The reliability and accuracy of this approach depends on many factors including the number of individuals genotyped, the density of the markers on the genome, effective population size, the genetic relationship between the reference and predicted populations, the nature of the traits and the applied methods, etc. (Habier et al., 2007, 2011; Bolormaa et al., 2013a,b, 2014; Meuwissen et al., 2016). Genomic selection for milk and beef traits has been successfully implemented in several countries including Australia, Canada, France, Germany, Great Britain, Ireland, the Netherlands, New Zealand, the Scandinavian countries and the United States of America (Silva et al. (2014); Weller et al., 2017). Genomic breeding application in these countries is facilitated by many factors including: (1) large population of animals; (2) specialized farms; (3) comprehensive data on animals; (4) access to genotyping platforms which makes single nucleotide polymorphism (SNP) genotyping more cost effective; (5) available resources (finance, technical knowhow) to accomplish genotyping; (6) existence of breed associations; (7) application of artificial insemination; (8) development of breeds for specific purposes; (9) large scale international breeding companies that sell semen from high performing males for use in breeding for specific traits; (10) creation of farmer organizations, (11) implementation of national evaluation schemes; (12) development of statistical models to handle large data and (13) computing infrastructure to deliver genomics information which helps to facilitate genetic gains in livestock.

In addition, several initiatives have been undertaken to make available data on all sources of genomic variation in livestock genomes to further increase the success of genomic breeding. Such efforts include, but not limited to, the 1,000

<sup>1</sup>http://www.fao.org/3/MW154en/mw154en.pdf


<sup>∗</sup>FAOSTATS, 2018, http://www.fao.org/faostat/.



<sup>∗</sup>FAOSTATS, 2018, http://www.fao.org/faostat/. ∗∗Percent average annual increase in livestock populations from 2011 to 2016.

bull genome project with aim to re-sequence the whole genomes of 1,000 bulls and has already made available about 84 million SNPs and 2.5 million small insertions/deletions (Hayes and Daetwyler, 2019) and the international consortium for Functional Annotation of Animal Genomes (FAANG<sup>2</sup> ) established to provide the infrastructure to detect and proficiently analyze genome wide functional regulatory elements (DNA methylation, histone modifications, chromatin remodeling, noncoding RNA) in animal genomes (cattle, chicken, goat, pig, and sheep) necessary to understand how variation in gene sequences and functional components determines phenotypic diversity, and how this is translated into complex phenotypes; and thus fill the genotype-to-phenotype gap that is missing in current livestock improvement programs (Andersson et al., 2015; Tuggle et al., 2016). Major gains achieved with the use of genomic information and implementation of genomic selection include higher rate of genetic gain, increased reliability of predicting breeding values, higher intensity of selection, shortened generation interval, selection of animals possible at early age, and rapid genetic improvement in lowly heritable traits (e.g., fertility, lifespan, health, etc.) (reviewed by Hayes et al., 2009; Ibeagha-Awemu and Khatib, 2017; Weller et al., 2017; Mrode et al., 2018).

The application of genomic selection in Western countries and the advances that have been made in breeding (e.g., dairy traits) have been driven by the economic needs of the producers. However, challenges regarding sustainability of livestock production necessitate consideration of the economic, societal and environmental factors. A focus on increased milk production for example and intensive selection for this trait for several decades resulted in a deterioration of many traits like fertility, udder width/circumference and disease resistance (e.g., mastitis, metabolic diseases) traits, etc., and an increase in its ecological footprint (e.g., greenhouse gas emission) (Boichard and Brochard, 2012; Egger-Danner et al., 2015). These factors together with growing demand by consumers for animal safety warrant that successful programs for sustainable animal improvement should create a balance between selection for traits of economic value, animal health, conformation traits, adaptation traits, animal welfare and environmental foot-print.

The successes of genomic selection in Western countries mentioned above were possible through organized and sustained breeding practices supported by government regulations, finance and involvement of private companies. The picture for the majority of African countries is different given that, most livestock are kept for multi-purposes (meat, milk, traction, hides/wool, as a savings account, social status, cultural reasons, etc.), in small herds and flock sizes, under small scale to midscale low performing and low input systems, and lack of enabling government policies and financial support. Thus, procedures to increase livestock productivity in Africa in the era of genomic breeding must take into consideration the different production systems, ecological zones and participation of all stakeholders.

# African Livestock Production Systems

In majority of African countries, livestock production is managed under small to large scale systems (**Table 3**). Small scale production systems include pastoral, agro-pastoral and mixed smallholder farming. Large scale systems include ranching, large scale commercial farming, cooperative farming and state owned farms. About 70% of livestock productivity occurs under the small scale systems characterized by small animal population sizes, low inputs and outputs, etc. Devising appropriate policies

<sup>2</sup>www.faang.org


∗Some of the

materials were obtained from

Ibrahim (1998), Catley et al. (2016), and

Majekodunmi

 et al. (2016).

for such systems with the right government support is of utmost importance in increasing livestock productivity for food production and income generation. The characteristics of the various systems of livestock production are summarized in **Table 3**. Under predominantly small scale farming systems, it is important to determine whether or not such systems are ready for genomic breeding. Genomic breeding implementation relies on available genetic resources/diversity, and genomic variation and its association with desired traits.

### African Livestock Genetic Resources, Diversity and Genomic Variation

Effective management of farm animal genetic resources requires adequate information on population size and structure, geographical distribution, the production environment, and within- and between-breed genetic diversity (Groeneveld et al., 2010). Assessment of diversity levels in breeds is necessary owing to husbandry systems which may affect diversity levels through inbreeding and high gene flow between breeds (Ibeagha-Awemu et al., 2004). Information on biodiversity is necessary for preparation of national action plan for improvement of animal genetic resources (Manirakiza et al., 2017). Meanwhile, a consideration for inclusion of genetic information in breed improvement requires knowledge of genomic variation and relationship with traits of interest.

### Cattle

Africa is home to about 150 cattle breeds distributed across the continent, with the exception of the Sahara and the river Congo basin (Mwai et al., 2015), majority of which are uncharacterized (Nyamushamba et al., 2017). Various categories of cattle are present in the continent including zebu or Bos indicus breeds (African humped cattle), taurine or Bos taurus breeds (African humpless cattle), hybrids between humpless and humped cattle (e.g., sanga) and sanga and zebu backcross (e.g., zenga). The highest population of cattle and products from cattle are found in the East African region (**Table 2** and **Figure 1**). Clear genetic divergence was revealed between B. taurus cattle and zebu breeds of West/Central Africa (Ibeagha-Awemu et al., 2004), and between South African indigenous and locally developed cattle breeds (Makina et al., 2014). However, the breed status of African cattle populations are in danger of disappearing rapidly following uncontrolled crossbreeding and breed replacements with exotic breeds (Ibeagha-Awemu et al., 2004; Mwai et al., 2015; Traoré et al., 2017).

Using microsatellite markers, candidate gene and genome wide approaches, genomic variation in some African cattle populations have been assessed and in some cases associated with production traits. Using 28 autosomal markers, Ibeagha-Awemu et al. (2004) revealed that zebu breeds in Cameroon and Nigeria are highly diverse as well as closely related. Whole genome SNP panel indicated close relationships between South African indigenous and locally developed cattle breeds (Makina et al., 2014) as well as pure and crossbred cattle in Burundi (Manirakiza et al., 2017). Genome characterization by sequencing of five indigenous African cattle breeds representatives of the cattle diversity of the continent [namely N'Dama (West African taurine), Ankole (African sanga cattle), Boran (East African zebu), Kenana (East African zebu), and Ogaden (East African zebu)] revealed a high number of SNPs in the breeds as well as breed specific SNPs (Kim et al., 2017). On a genome-wide window scale of 10 Mb, all indigenous African breeds had higher levels of nucleotide diversity compared to commercial European breeds (Angus, Jersey, and Holstein) which have been subjected to intensive artificial selection over generations (Kim et al., 2017). Genome wide characterization with Illumina BovineHD or BovineSNP50 Genotyping BeadChip of cattle breeds from East Africa, North Africa, South African, and West Africa revealed positional candidate positive selection regions which encompass genes and quantitative trait loci (QTL) for milk traits, reproduction and environmental stress (immunity and heat stress), candidate genes associated with biological pathways important for adaptation to marginal environments such as immunity, reproduction, development, and heat tolerance, copy number variations enriched for a number of biological processes, molecular functions and cellular components as well as potential to improve some of the breeds for dairy traits through breeding (Bahbahani et al., 2015, 2017, 2018; Pierce et al., 2018). Moreover, footprints of adaptive selection at the whole genome level (genotyping with 36,320 SNPs) were identified in nine West African cattle populations, including 53 genomic regions and 42 candidate genes enriched in physiological functions such as immune response, nervous system, and skin and hair properties (Gautier et al., 2009). From these data, high levels of genetic diversity is evident within African cattle populations which have been attributed to domestication, long history of migrations, selection and adaptation (Luikart et al., 2001; Groeneveld et al., 2010; Kim et al., 2017). Due to exposure to strong environmental pressures (hot, dry, or humid tropical climate conditions), diverse disease and nutritional challenges and water shortages, African livestock populations display unique adaptive traits (**Table 4**) which are necessary to support productivity and survivability in the different ecological zones.

### Goat

The domestic goat, Capra hircus, is an important livestock species that is well suited to small-holder production systems throughout the entire African continent. Unique to West Africa is a great genetic diversity of goat types; the long-legged and trypanosusceptible types (e.g., Sahel and Red Sokoto goats) found in tsetse free areas and the trypano-tolerant type (West African Dwarf goat) found in the humid zone. According to Missohou et al. (2011), different ecotypes have emerged under varying selection pressures and diversified climate and topography in different countries. The largest goat populations are found in the Eastern and Western African regions (**Table 2**). Genetic diversity study on African goats is generally limited compared to other continents (Groeneveld et al., 2010). Microsatellite studies revealed a substantial amount of within breed diversity based on mean number of alleles observed (Muema et al., 2009; Missohou et al., 2011; Traore et al., 2012; Murital et al., 2015). Using genome-wide SNP data, Mdladla et al. (2016) reported high level of genetic diversity in South African indigenous

### TABLE 4 | Adaptive characteristics of some African livestock breeds.


goats including three locally developed meat type breeds of Boer, Savanna and Kalahari Red, a feral breed of Tankwa and unimproved non-descript village ecotypes. Some African goats have been characterized for polymorphisms in genes that control economically important traits (milk traits and litter size) (Bemji et al., 2006; Missohou et al., 2006; Caroli et al., 2007; Isa et al., 2017; Bemji et al., 2018), pointing to their potential application for genetic improvement for these traits.

## Sheep

Diverse populations of sheep are found in the African continent with about 170 breeds of domestic sheep found in sub-Saharan Africa (Kemp et al., 2007). Present-day African sheep population is about 352 million (FAOSTATS, 2018) out of which ∼62% are found in Northern and Western Africa (**Table 2**). Investigations from different African countries based on microsatellite markers (Gaouar et al., 2015), mitochondrial DNA (Agaviezor et al., 2012; Brahi et al., 2015) and genome-wide SNP chip (Edea et al., 2017) revealed high within breed than between breed genetic diversity with clear evidence of admixture between breeds of sheep. The latter authors further observed that North African sheep breeds showed higher levels of within-breed diversity but were less differentiated than breeds from Eastern and Southern Africa, confirming previous reports that sheep from South Africa showed low to moderate genetic diversity (Qwabe et al., 2013). The initially domesticated sheep breeds in West Africa have also been genetically mixed with European breeds (Brahi et al., 2015). Using the OvineSNP50 beadchip, Molotsi et al. (2017) reported that the smallholder Dorper sheep was introgressed with Namaqua Afrikaner, South African Mutton Merino and White Dorpers genes. They further reported that the smallholder Dorper population was more genetically diverse than the pure-bred Dorper, South African Mutton Merino and Namaqua Afrikaner. Sheriff and Alemayehu (2018) reported low observed and expected heterozygosity in Ethiopian, Kenyan, South African and Nigerian sheep populations. They opined that the low heterozygosity may be due to the effect of small population sizes, inbreeding and minimal or null immigration of new genetic materials into the close populations. These data suggest close relationships and high levels of genetic admixture between African sheep breeds, especially among populations in the same geographic area.

## Chicken

The domestic chicken with an estimated population of more than 1.9 billion in 2016 (FAOSTATS, 2018) is the most common and widespread domestic animal species kept mainly for food (meat and eggs) by resource poor farmers in Africa. Large-scale analyses involving microsatellite loci in domestic chickens, commercial lines and chickens sampled from the European region revealed high mean numbers of alleles and high degree of heterozygosity in Asian and African chickens as well as in Red Jungle fowl (Lyimo et al., 2014). Lower degree of population stratification as well as high withinbreed genetic diversity in African chickens are supported by analyses with microsatellite markers (Muchadeyi et al., 2007; Adebambo et al., 2010; Mtileni et al., 2011), mtDNA (Wani et al., 2014; Hassaballah et al., 2015; Eltanany and Hemeda, 2016) and genome-wide SNP chips (Khanyile et al., 2015a,b; Fleming et al., 2016, 2017). Reduced genetic diversity was, however, witnessed with conservation flocks in South Africa which represented a limited sample of the gene pool (Muchadeyi et al., 2007; Mtileni et al., 2016). Increasing expansion of the commercial chicken industry and intermixing of commercial hybrids with local strains in rural backyards are eroding the genetic uniqueness of native breeds and their potential to adapt to local conditions (reviewed by Eltanany et al., 2011). Lawal et al. (2018) reported the use of whole-genome resequencing data of Red Jungle fowl and Indigenous Village Chicken populations from Ethiopia, Saudi Arabia, and Sri Lanka to decipher regions of the genome with functions relating to adaptation to temperature gradient, reproduction and immunity. All these results indicate the presence of genetic variation that can be utilized in genomic breeding.

# Pig

The local African Pig is small in size and is likely the same breed in all African countries known under various names (African Union Inter-African Bureau for Animal Resources [AU-IBAR], 2015a), such as: Kolbroek (South Africa), Somo (Mali), Bakosi (Gabon and Cameroon), West African Dwarf pig (Nigeria), Ashanti Dwarf pig (Ghana), Bush pig (Togo), Mukota pig, or Zimbabwe Mukota pig (Zimbabwe). Despite cultural and religious influences in parts of the continent that limit pork production and consumption, pig farming is generally growing across West, East, Central and Southern Africa (African Union Inter-African Bureau for Animal Resources [AU-IBAR], 2015a) with the highest populations in Eastern and Western Africa (**Table 2**).

Findings based on joint analysis of mitochondrial, microsatellite and Y-chromosome polymorphisms in pigs and wild boars with a worldwide distribution revealed remarkably weak genetic differentiation between pigs and wild boars (Ramirez et al., 2009). This was attributed to a consequence of a sustained gene flow between both populations. More recent findings on pig populations indigenous to southern Africa based on different microsatellite loci (Halimani et al., 2012) similarly revealed lack of substructure in the pig populations, corroborating the general similarity in phenotypes commonly reported (Halimani et al., 2012). Sampled pigs in Ghana represented distinct populations with a moderate amount (12%) of genetic differentiation (Ayizanga et al., 2016). A study on the estimation of genetic parameters for growth performance and carcass traits in Mukota pigs in South Africa reported the presence of sufficient genetic variation that can support genetic improvement for many growth and carcass traits in the breed (Chimonyo and Dzama, 2007). Using the porcine genome wide SNP chip, Mujibi et al. (2018) observed a significant introgression of genes from international commercial breeds into Busia pigs from Busia county in Kenya. The authors also reported that pigs from Homabay county in Kenya are distinct from the international breeds and thus represent a local indigenous gene resource.

# CONSIDERATIONS AND STRATEGIES FOR IMPLEMENTATION OF ORGANIZED GENOMIC BREEDING IN AFRICA

For successful implementation of structured genomic breeding programs for African livestock populations, several factors deserve consideration as well as collective action and cooperation by all stakeholders (farmers, governments, research professionals,

research organizations, universities, breed societies, private businesses, and support organizations) working together to achieve a common goal as illustrated in **Figure 2**.

# Availability of Genetic Material for Breeding

As discussed above, the African continent is home to diverse livestock populations which also display rich genomic variations within and between breeds and have acquired special adaptive characteristics that support adaptation to poor quality feed, limited water supply, hot environments and disease (Psifidi et al., 2016; Mrode et al., 2018; Lawal et al., 2018). Some of these characteristics are summarized in **Table 4**. The indigenous breeds have acquired important characteristics for survival in their environments and in addition to being developed systematically, should be conserved for future survival and exploitation. Therefore, conservation of local breeds (highly utilized and less utilized) must be part of national breeding plans and should not be an exercise undertaken by individual farmers. Systematic breed development strategies including selection within breeds, controlled crossbreeding and upgrading programs and development of new breeds to exploit special adaptive and/or production traits must be done under organized systems with specific goals. This will address the practice of indiscriminate crossbreeding between local breeds and, between imported breeds and local breeds that is eroding the continent's animal genetic diversity, a much needed resource for present and future exploitation.

# Understand Production Systems, Production Potentials of Livestock and Needs of Farmers

There is still a lack of understanding about situating livestock development programs within prevailing low-input production systems, societal preferences and environmental conditions. Most international development programs have been based on 'top-down approaches,' considering single commodities and technology focused orientation with little or no participation of farmers nor formation of strong farmer-based institutions (Kaasschieter et al., 1992) and with no regard to prevailing environmental conditions. Majority of livestock are produced

under the small holder system composed of pastoralists, agro-pastoralist and small holder farm families which have different breed preferences, inputs and challenges. Therefore, the prevailing production systems and livestock production potential under the different systems must be characterized and their specific needs identified in relation to local preferences and market needs, with the participation of producers. Data on productivity of local breeds under their prevailing conditions of production are largely unavailable. Such data is necessary as it will form the basis for improvement plans for each system. Indigenous African breeds are generally considered as underproductive without giving thought to the low-input and harsh environments in which they are raised. For example, under farmer management, the Butana and Kenana zebu breeds of Sudan produce averagely 538.26 and 598.73 kg of milk per lactation, respectively, while under research station conditions, they produce 1,400–2,100 kg of milk per lactation, respectively (Musa et al., 2005, 2006; Yousif and El- Moula, 2006). This implies that, although Butana and Kenana seem to produce less under low-input systems, they have the ability to produce more given improved conditions of nutrition, health care and production management. Another factor is the bias in judging local breeds based on parameters that have been selected and developed in exotic breeds. An example is the focus on lean specialized pig breeds like Large White at the expense of local relatively fat breeds that are well adapted to local environmental conditions. As such, policies are put in place that disfavor production of local pigs but encourage their replacement with exotic breeds (Chimonyo and Dzama, 2007).

Furthermore, emphasis has been placed on realizing quick gains by adopting 'shortcut approaches' like introduction of exotic genes that have been developed over a long period of time in different environments. This exotic stock does experience genotype by environment interaction which dramatically lowers performance in the local environment where they are introduced. African livestock have acquired the characteristics necessary to produce under their prevailing environmental conditions and on minimal resources. Before introducing exotic genes in a controlled manner, firstly, the productivity of local breeds must be assessed under optimal conditions (e.g., adequate feeding, housing, and disease management). The main limiting factors of local breed productivity could just be management and limited feed resources and disease control measures, as exemplified in Butana and Kenana cattle (Musa et al., 2005, 2006; Yousif and El- Moula, 2006). Under optimal management conditions, local breeds could be selected for desired traits in their prevailing environmental conditions. The adoption of most 'shortcut approaches' utilizing exotic genes has generally not resulted in substantial gains and sustainable long-term increases in productivity or contributed to poverty alleviation. Most donors or policy makers are only interested in immediate short term gains (visible) with the result of reckless crossbreeding of indigenous cattle resources with exotic breeds or their complete replacement with exotic breeds. These 'short term gains' are usually lost when such programs end and usually, the offspring of crossbred animals underperform under the prevailing conditions or lose the adaptive productive ability of the local breeds.

# Breeding Goals

Setting a clear breeding goal is a prerequisite for animal improvement planning and implementation of genomic selection. A definition for animal breeding goals planning as a procedure with ethical priorities and weighing of market and non-market values has been suggested (Olesen et al., 2000). The decision to develop an animal for specific products has primarily been for the common interest of the farmer or the society or market demand. For example, increased market demand for milk and its products drove dairy breeding objectives in Western countries toward increased milk yield which unfortunately has resulted to problems of fertility and huge environmental footprints (Gill et al., 2010; Wall et al., 2010; Hayes et al., 2013; Knapp et al., 2014). Hence, sustainable animal breeding goals should consider market economic and non-market value traits, farmer specific needs, social, ecological and environmental needs supported by appropriate government policies, education, more cooperation between stakeholders and, short and long term needs, etc. The breeding goals must be adapted to fit each production system and environment. In recent times, breeding goals considering new phenotypes or non-traditional or production oriented traits and genetic traits of relevance to breeding sustainability have been proposed for cattle, sheep and pigs (Merks et al., 2012; Banga et al., 2014; Miglior et al., 2017; Molotsi et al., 2017). Recently, results of a survey of 160 farmers in southern Mali identified draft power and savings as the most important production objectives while preferred traits included fertility, draft ability and milk yield, in that order (Traoré et al., 2017).

# Feed Resources and Animal Health

Optimal animal productivity is supported by adequate nutrition and disease management. Options for quantitative and qualitative improvements of the feed resources according to the needs of the different production systems are required for sustainable livestock systems (Duncan et al., 2013; Thornton and Herrero, 2015). For example, under the pastoral system, communal access to rotational grazing pastures and fodder banks which should be maintained to ensure quality of feed resources will support sustained livestock production. Legislations instituting the development of watersheds, restrictions on indiscriminate burning of grazing land and use of such land for other purposes are of necessity. Other vital aspects include development of improved pastures and fodders, increased grain production, development of agricultural bi-products as feed resources, access to water resources, etc. Although some of the local breeds have adapted to the disease burdens of their environments, disease is still a major limiting factor to livestock productivity in the region (De Garine-Wichatitsky et al., 2013; Okuni, 2013; Vanderburg et al., 2014). Particular attention should be paid to disease control measures like access to drugs, vaccines and veterinarians, and sound management practices developed for each system (Maclachlan and Mayo, 2013; Miguel et al., 2013).

# Data Acquisition

Precise phenotypic data is crucial for genetic improvement. In Western countries, systems have been put in place to support

high throughput phenotyping (e.g., milk yield, milk component yields, feed intake, etc.) thus enabling the accurate and consistent collection of large amounts of data on animal productivity. The formation of livestock trade databases is worthy of consideration since livestock movement contributes to the spread of animal and zoonotic diseases. In Western countries, this database is important for researchers to describe mobility patterns, optimize disease surveillance and control and predict possible epidemic scenarios (Apolloni et al., 2018). Therefore, it is necessary to sensitize producers on the importance of data collection and record keeping on the productivity of their animals, as well as formation of data storage facilities that can facilitate data storage and sharing within and between countries.

# Infrastructure and Environmental Considerations

Besides the common issues with infrastructure for general development of the economy, infrastructural development to promote livestock production within the continent must be considered such as basic equipment for sample storage, data collection and data trace, livestock markets, slaughter facilities, animal housing and pasture development. With advances in genomics and other omics technologies, the livestock sector in several countries has moved to the area of big data research and application (VanderWaal et al., 2017; Morota et al., 2018; White et al., 2018). African livestock infrastructure must be developed to optimize the use of big data. Moreover, farmers need to be sensitized and prepared for adoption of these technologies. Technologies need to be adapted to farmers' specific needs according to the system of production since there are differences in farmers' access to farm resources, technological inputs and differences in access to output markets (Birhanu et al., 2017; Feder and Savastano, 2017). For instance, adapted dairy technologies varies widely among smallholders (Staal et al., 2002; Abdulai and Huffman, 2005; Amlaku et al., 2012) and also strongly affected by their social networks (Amlaku et al., 2012). The environmental impact is now a major concern for livestock production the world over due to its impact on greenhouse gas emissions and consequently climate change. The livestock sector in Africa also pose a challenge to the environment and climate change depending on the management and farming system. For instance, the semi-arid region is faced with the problem of overgrazing of rangelands which is caused by population pressure and a decline in traditional management systems (Fratkin, 2001; Ngongoni et al., 2006).

# Development of National and Regional Policies and Priorities That Support Effective Production and Utilization of Livestock

The success of sustainable livestock development in any country or region hinges on development of national and or regional policies or guiding principles in the conduct of affairs. The decision by an international support organization or by a farmer group to import specific germplasm for crossbreeding with local breeds must be backed by national polices and priorities. Supply and demand policies favoring local production and supply chains will stimulate local production. Recognizing the important contribution of livestock production to the livelihood of farm families and to the nutrition and economy of the state/country necessitates a political commitment to stimulate, develop and financially sustain livestock development. The African Union has in place a Livestock Development Strategy (LiDeSa) for Africa (2015–2035) which was developed through an inclusive consultation process involving experts and stakeholders at national, regional, and continental levels (African Union Inter-African Bureau for Animal Resources AU-IBAR [AU-IBAR], 2015b). The strategy recognizes the central role played by livestock as a livelihood sustainer for rural Africa and with the support of a grant from the Bill and Melinda Gates foundation, seeks to transform the livestock sector by invigorating its untapped potentials. This is a laudable process that if implemented could truly transform lives. However, country level initiatives must follow suit for a transformed livestock sector to emerge in the continent. Today, the South African government is the only African government that is playing an active role in the conservation of animal genetic resources (Nyamushamba et al., 2017) and in supporting livestock breeding programs (Van Marle-Köster et al., 2017). Moreover, it is also important to take into account farmer's preferences in the development of breeding polices (Wale and Yalew, 2007). Development of national policies and regional priorities should also focus on mitigation in the livestock sector due to the impact of climate change which varies with location. A program called climate-smart agriculture (CSA) has been implemented recently in the West African region and sub-Saharan Africa in general (Amole and Ayantunde, 2016). CSA is an approach that provides a conceptual basis for assessing the effectiveness of agricultural practice change to support food security under changing climatic conditions (Amole and Ayantunde, 2016).

# Creation of Markets and Facilitation of Access to Markets

Appropriate economic incentives are important for livestock genetic improvement. Breeding programs should be marketoriented and the government should provide the right incentives. Several countries have made efforts on the extension of market access as well as to encourage foreign trade. For example, the Ethiopian government has completely strategized to encourage foreign trade for sheep and goat products which has led to the creation of employment opportunities for its citizens (Nwogwugwu et al., 2018) or the emergence of livestock feed market in Ghana (Konlan et al., 2018).

# Education and Training, and Information Sharing

Education and training, and information sharing are vital aspects in sustainable livestock improvement breeding. The training curriculum in higher institutions should be adapted to fully address the needs of the various production systems. Formal training of students and informal training of the producers

is vital. Greater cooperation between universities, research organizations (international, national, and regional), producer groups, non-governmental organizations and governments will ensure the flow and sharing of information and knowledge (**Figure 2**). The International Livestock Research Institute (ILRI) in Kenya has and continues to train students, research professionals and farm groups in various aspects of livestock breeding and production and molecular biology/genomics techniques. ILRI's work in consultation with and tailored to meet the needs of farm families has resulted to initiatives like the Dairy Genetics East Africa project (DGEA), African Dairy Genetic Gains (ADGG), etc<sup>3</sup> . These programs were tailored to increase farmer productivity and profitability through the use of cross-bred animal types supported by extension and training systems tailored to their needs. The influence of ILRI amidst other successes led to rapid increase in cow milk production between 2011 and 2012 in the East African region (**Figure 1**). A national milk recording scheme has been instituted and supported by the government of Kenya<sup>4</sup> . The challenges faced by the program include limited number of breed inspectors, unawareness by many farmers of the importance of livestock registration, delay in issuance of livestock certificates and poor record keeping by farmers. Some of the suggested solutions include: training of more livestock inspectors by breed societies in conjunction with government, create farmer awareness using sensitization campaigns through mass media, exhibitions, shows, field days and direct consultations with interested farm groups, decentralization of services and investment in manpower and infrastructure. National animal production research institutes and universities in the various countries can emulate some of the practices of ILRI given that farmer's participation in the development of projects tailored to their needs is a vital aspect in the successes of such programs.

Regional and continent wide sharing of information is vital for the sustainability of the livestock sector. The Forum for Agricultural Research in Africa<sup>5</sup>,<sup>6</sup> , a technical arm of the African union, coordinates and advocates for agricultural research-for-development in the continent. Regular meetings of stake holders (professionals, farmers, students, and industry) interested in the act of animal production in forums like the All African Conference on Animal Agriculture, country and regional conferences on animal production all serve vital roles in the flow of information and technology advancements. However, producer focused meetings that provide informal training to farmers are generally lacking.

# APPLICATION OF MODERN GENOMIC BREEDING TECHNOLOGIES IN AFRICAN LIVESTOCK

Rapid improvement of African livestock productivity can benefit from current modern breeding technologies but many limitations abound. Some breeding programs that have been implemented for genetic improvement of livestock in Africa and the challenges faced are summarized in **Table 5**.

Livestock in the African continent are highly adapted to the prevailing environmental conditions characterized by heavy disease burden and marginal feed resources, but with marginal productivity because they are still largely unselected. African countries can benefit from genomic selection because it could be done even without pedigree information which is essential to traditional best linear unbiased prediction (BLUP)-EBV and the selection of candidates does not necessarily have to be based on trait records. The potential to generate GEBV using molecular information makes genomic selection a very attractive alternative to improving livestock in developing countries where adequate phenotypes and pedigree records are lacking. Genomic breeding has been reported to be more accurate than traditional BLUP because genomic relationships are more accurate than pedigree relationships (Meuwissen et al., 2016). Moreover, understanding of the fundamental genetic mechanisms influencing traits can be useful for setting up priors for (genetic) variances to increase the accuracy of genomic selection. Several successful approaches have been introduced such as BLUP| GA (BLUP-given genetic architecture; Zhang et al., 2014) or BayesRC (which adapted BayesR methods) incorporating prior biological information in the analysis by defining classes of variants likely to be enriched for causal mutations (MacLeod et al., 2016) or single step GBLUP with prior information (Fragomeni et al., 2017). These methods can be particularly useful for genomic selection in Africa with some prior biological knowledge of traits obtained from studies in the populations and other populations. Using genomic selection, Pitchford et al. (2017) concluded that heterozygosity effects were substantial for reproduction and growth in a tropically adapted composite beef program.

Our high enthusiasm about the potential application of genomic selection in African countries is immediately dampened with the reality that animals are held in small populations and in many small holder units. Furthermore, male animals that drive the genetic gain are often sold to generate income for farm families. These caveats can be overcome by the formation and practice of communal management and breeding systems.

Lack of phenotypes recorded in accurately defined contemporary groups is one of the constraints to the implementation of genomic selection in Africa and many developing countries (Burrow et al., 2017). Acquiring the genomic information for genomic selection is limited because genotyping is still expensive in many developing countries because incomes are very low compared to developed countries. The few studies on genomic selection in developing countries are characterized by small population sizes and validations were undertaken with test day data sets (Neves et al., 2014; Brown et al., 2016; Kariuki et al., 2017; Ducrocq et al., 2018; Mrode et al., 2019).

Traditional animal breeding requires the use of pedigree records to support selection decisions but most small holder farms in Africa do not have these types of records and the measure of relationships between animals are merely speculative. Furthermore, the application of genomic selection will require the use of reference populations which are generally lacking in

<sup>3</sup>www.ilri.org

<sup>4</sup>http://www.nafis.go.ke/livestock/livestock-registration/milk-recording/

<sup>5</sup>https://faraafrica.org/

<sup>6</sup>https://dgroups.org/fara-net


TABLE 5


Sample breeding programs for genetic improvement

 of livestock in Africa.

Africa and many developing countries (Burrow et al., 2017). Mrode et al. (2019) reported the presence of small reference populations of between 500 and 3,000 animals (composed of mostly cows) in dairy and beef cattle in developing countries. The use of small reference populations that combined both bull and cow data, as in the case in Africa, has implications for the accuracy of genomic prediction, which is lower when compared to those obtained in Western countries given the limited information of the response variables when using cow records. It is important to state here that the inclusion of cows in the reference population has resulted to up to fivefold increase in the size of the reference population in some cases and increases of up to 12% in accuracy of selection compared to using bulls alone (Boison et al., 2017; Mrode et al., 2019). Mrode et al. (2018) reported some success by modeling and pooling data on the accuracy of genomic prediction in limited dairy data in East Africa. Brown et al. (2016) specifically reported the practice of genomic selection in a crossbred cattle population using data from the dairy genetic project of East Africa.

The cost of genotyping is a major issue limiting the adoption of genomic selection in Africa and to overcome this problem, the use of low density SNP panels have been suggested and this can be followed with imputation to improve the accuracy of genomic predictions (Meuwissen et al., 2016; Boison et al., 2017). Furthermore, low cost genome wide genotyping solution like genotyping-by-sequencing can generate high numbers of population specific SNPs (De Donato et al., 2013; Ibeagha-Awemu et al., 2016; Gurgul et al., 2018) that can support genomic selection in African livestock populations. Illumina<sup>7</sup> and Affymetrix<sup>8</sup> commercial SNP panels used for genotyping contains SNPs discovered in breeds and population of animals of Western origin and only very few breeds of African origin were included in the discovery of SNPs. This is the reason for ascertainment bias, which may affect accuracies of genomic selection from the use of commercially available SNP panels to genotype African indigenous livestock. Thus, the development of genotyping solutions specific for African breeds is necessary and the genotyping-by-sequencing approach can play a major role.

Some notable developments in the use of genomic tools include the sequencing of some indigenous cattle in Africa (Kim et al., 2017), developments on the genomic selection for disease resistance (Hanotte et al., 2010; Mwai et al., 2015) and for adaptation to hot arid condition (Kim et al., 2016). Other important efforts that may increase the quality of data includes the project of epidemiology of the Infectious Diseases of East African Livestock and a longitudinal calf cohort study in western Kenya (de Clare Bronsvoort et al., 2013) and strategies for bridging the gap between the developed and developing livestock sector (Van Marle-Koster and Visser, 2018). Recently, Canovas et al. (2017) discussed the application of new genomic technologies including transcriptomics, metagenomics, metabolomics, and epigenomics that are pertinent to speed-up genetic improvement of cattle. As a matter of priority, Burrow et al. (2017) suggested that research to improve grazing livestock should include cross-country genetic/genomic evaluations, use of sequence data in genetic evaluations, multi-breed genomic evaluations, selection index and genotype × environment interactions. Furthermore, numerous studies in Nellore, an indicine beef cattle breed suggests that genomic selection is a realistic alternative to traditional selection strategies (Neves et al., 2014). In small ruminants like sheep and goats, Mrode et al. (2018) observed that innovative genetic selection strategies will be needed to ensure adaptive balance between production and adaptation.

Emerging gene editing technologies like transcription activator-like effector nucleases (TALEN), zinc finger nucleases (ZFN), and clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9 which can achieve any change in the genome, including introduction of alleles of interest into a recipient genome and switching on/off genes of interest can also play vital roles in rapid genomic improvement of African livestock traits. These tools offer an opportunity to intensify the frequency of desired alleles in a population through geneedited individuals more rapidly than conventional breeding (Bhat et al., 2017). Genome editing in livestock has been reported for the double muscling gene in cattle, sheep, and pigs (Proudfoot et al., 2015; Qian et al., 2015), the polled allele introduction in dairy cattle (Tan et al., 2013; Carlson et al., 2016); gene edits that confer resistance to African Swine fever virus in pigs (Lillico et al., 2013; Whitworth et al., 2016) and the low-density lipoprotein receptor gene in a pig model of atherosclerosis (Carlson et al., 2012). These examples indicate that attempts at gene editing in livestock have targeted traits controlled by few variants with major effects. However, most livestock traits of economic importance are quantitatively controlled by many genes each contributing small effects, suggesting potential pitfalls in the implementation of these technologies for such traits. However, a recent simulation study indicated that editing for fewer casual variants of polygenic traits can double the rate of both short term and long term genetic gains when compared to conventional genomic selection (Jenko et al., 2015).

# CHALLENGES AND WAY FORWARD

As mentioned above, most countries in Africa lack functional breeding programs due to lack of involvement and engagement of farmers or producers and other stakeholders. Therefore, it is important to have long-term plans for breeding programs, which can meet present and anticipated future market needs (Zonabend et al., 2013). The major constraints to implementation of genomic breeding approaches for African livestock populations and the way forward have been discussed in Section "Considerations and Strategies for Implementation of Organized Genomic Breeding in Africa" and summarized in **Table 6** and the major roles of each stakeholder are summarized in **Figure 2**.

For farmers/producers to play central roles in the success of any breeding program, they need support from different

<sup>7</sup>www.illumina.com/

<sup>8</sup>www.affymetrix.com/

### TABLE 6 | Major concerns and possible solutions for development of improved livestock breeding programs in Africa.


organizations such as (i) government (to put in place enabling polices, infrastructure, funding, incentives, and markets for their products), (ii) universities and research institutions (to guide, develop up-to-date curriculum, train and provide necessary information for breeding programs, setup and implementation), (iii) international organizations (funding and technological support), (iv) breed societies (maintain records and production characteristics for specific breeds, provide farmers with breed specific information and maintain purebreds). However, the producers themselves need to be actively involved in different breed associations as well as form farmer associations so they can work together to define their priorities (short, medium and long term goals) for implementation in breeding programs. For example; the South African government through its Technology Innovation Agency- TIA initiated a "Beef Genomics Program" in 2014 and a similar program for Dairy was started in 2016 with the goal of expanding to other species in the future (Burrow et al., 2017). Under this scheme, breed associations were expected to develop their own strategy with respect to use of genomic information. This type of approach can be replicated throughout Africa and most developing countries. Unfortunately, there is currently lack of leading roles by most African governments on issues related to livestock development.

The preferences of smallholder farmers is governed by their contextual household characteristics, institutional, and socioeconomic factors (Wale and Yalew, 2007) so

their involvement in designing breeding programs is a must. In fact, community based breeding program (CBBP), which refers to improving livestock genetics with the incorporation of farmer participation in selection and breeding activities, has been successfully implemented for several breeds in different countries (Mueller et al., 2015). The CBBP place the farmer's views, needs and decisions as the most important values and encourage them to participate through the life-cycle of the program from the interception to implementation. The CBBP also allow optimized use of genetic resources and genomic data to support breeding programs suited to specific regions (Kahi et al., 2005; Muniz et al., 2016).

Data collection and storage pose great challenges for African smallholders and even for commercial producers due to the nature of the farming systems (**Table 3**). At country levels, national improvement schemes to help farmers register and collect data on herd's performance is scarce. A national milk recording scheme is operational in South Africa and Kenya. In Kenya, however, the willingness of farmers to register with the milk recording scheme and collect data on the productivity of their animals is low. The infrastructure for sample storage is also important for genetic materials. For example, DNA and biological samples need special procedures and materials for collection. The necessary infrastructure to carry out genetic improvement operations is severely constrained in Africa in general. Moreover, lack of baseline epidemiological data on the dynamics and impact of infectious cattle diseases in east Africa seriously limits animal improvement decisions (de Clare Bronsvoort et al., 2013). It is evident that the basic prerequisites for carrying out sample collection in livestock disease outbreaks is lacking for most African countries. It is worthy of note that the current animal health research focus on specific major infectious diseases, particularly tick-borne and tsetse-borne diseases, does not adequately address animal health issues because livestock in the continent are routinely exposed to a wide variety of pathogens. Therefore, the ability to determine correct pathogen effect is important for disease control and quality of data collection.

Most countries have recognized the importance of livestock breeding policy for direction of priorities and activities to be conducted in livestock breeding (Zonabend et al., 2013). However, questions regarding efficiency of implementation of policies and the frequency with which policies are updated to adapt to frequent changes in livestock breeding situations abound. Governments are not only required to draft policies but also to make sure that they are properly implemented. Governments are also required to create access to markets. However, many market problems exist for African countries such as lack of marketing facilities, inadequate marketing organization and methods, and inadequate government policies and marketing-facilitating services.

There is a chronic lack of skilled animal breeders in the African continent which limits the roles of research institutions and universities in designing breeding programs. Universities with Animal Breeding and Genetics programs need to update their curricula to reflect the current state of knowledge in animal breeding and genetics. Students need to be trained in statistics and on how to handle big data associated with advances in the application of knowledge of biotechnology to identify the best animals and make those the parents of the next generation. Also, lack of funding and promotion of research are limitations of African continent based researchers. Moreover, pressure to realize short-term benefits/outcome from research projects impacts negatively sustained gain that can accrue from effective long-term breeding programs. For certain traits, the breeding program needs a long time to realize gains or the impact is slowly accumulated through the years and it is hard to visualize, therefore the need for appropriate methods for measuring the success of breeding programs are required.

Non-governmental organizations (NGOs) are important stakeholders that contribute consultation services, support grass root livestock development programs and are vital partners in tailoring/implementing sustainable breeding programs. NGOs like Heifer Project International<sup>9</sup> , Vétérinaires sans Frontìeres Germany<sup>10</sup>, Send a Cow<sup>11</sup>, etc., have been supporting livestock development projects in the continent. However, greater cooperation between NGOs, international research organizations, national research organizations, universities and farmers will facilitate livestock development programs and widespread adoption of genomic breeding on the continent of Africa.

# CONCLUSION

The African continent is home to diverse populations of livestock breeds that possess extremely valuable genetic materials but which are not utilized effectively to support economic development or to meet up with increasing demand. Owing to the rich genetic resources and availability of advanced breeding technologies, genomic breeding can be used to speedup livestock development on the continent of Africa. However, the promise and usefulness of genomic tools (especially genomic selection), which have supported livestock gains in many Western countries are yet to be implemented in most of Africa; the major constraints being lack of supportive government policies, funding, nutrition/health challenges, infrastructure and human knowhow. Thus, national governments need to recognize the contribution of livestock production to economic development and the wellbeing of citizens, and put in place enabling policies, necessary infrastructure and funding. Farmers must organize while universities and research institutions should tailor training to the needs of students and farmers. Furthermore, to design effective and sustainable livestock development programs, current production state of breeds and production systems must be adequately

<sup>9</sup>https://www.heifer.org/

<sup>10</sup>http://www.vsfg.org/

<sup>11</sup>https://www.sendacow.org/

characterized through carefully designed investigations for production, reproduction, robustness and fitness traits, and all stakeholders must work together to achieve common goals. The notable success of the community based breeding program could be extended with the inclusion of genomic data as well as by better integration of other stakeholders and clearer government policies. Great opportunities for livestock development exist but all stakeholders must work together to leverage genetic resources for improvement of livestock breeding in Africa.

# REFERENCES


# AUTHOR CONTRIBUTIONS

EI-A conceptualized the review, followed by equal distribution of the different sections by EI-A, SP, MB, MA, and DD.

# FUNDING

Funding was provided by Agriculture and Agri-Food Canada.


Boichard, D., and Brochard, M. (2012). New phenotypes for new breeding goals in dairy cattle. Animal 6, 544–550. doi: 10.1017/S1751731112000018






**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Ibeagha-Awemu, Peters, Bemji, Adeleke and Do. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Performance Evaluation of Highly Admixed Tanzanian Smallholder Dairy Cattle Using SNP Derived Kinship Matrix

Fidalis D. N. Mujibi1,2 \*, James Rao<sup>3</sup> , Morris Agaba<sup>2</sup> , Devotha Nyambo<sup>2</sup> , Evans K. Cheruiyot1,4, Absolomon Kihara3,5, Yi Zhang<sup>6</sup> and Raphael Mrode3,7

<sup>1</sup> USOMI Limited, Nairobi, Kenya, <sup>2</sup> Nelson Mandela African Institution of Science and Technology, Arusha, Tanzania, 3 International Livestock Research Institute, Nairobi, Kenya, <sup>4</sup> Department of Animal Production, College of Agriculture and Veterinary Sciences, University of Nairobi, Nairobi, Kenya, <sup>5</sup> Badili Innovations Limited, Nairobi, Kenya, <sup>6</sup> College of Animal Science and Technology, China Agricultural University, Beijing, China, <sup>7</sup> Scotland's Rural College, Edinburgh, United Kingdom

### Edited by:

Eveline M. Ibeagha-Awemu, Agriculture and Agri-Food Canada (AAFC), Canada

### Reviewed by:

Michael Schutz, Purdue University, United States Sunday O. Peters, Berry College, United States

> \*Correspondence: Fidalis D. N. Mujibi denis.mujibi@usomi.com

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 25 April 2018 Accepted: 09 April 2019 Published: 26 April 2019

### Citation:

Mujibi FDN, Rao J, Agaba M, Nyambo D, Cheruiyot EK, Kihara A, Zhang Y and Mrode R (2019) Performance Evaluation of Highly Admixed Tanzanian Smallholder Dairy Cattle Using SNP Derived Kinship Matrix. Front. Genet. 10:375. doi: 10.3389/fgene.2019.00375 The main purpose of this study was to understand the type of dairy cattle that can be optimally used by smallholder farmers in various production environments such that they will maximize their yields without increasing the level of inputs. Anecdotal evidence and previous research suggests that the optimal level of taurine inheritance in crossbred animals lies between 50 and 75% when considering total productivity in tropical management clusters. We set out to assess the relationship between breed composition and productivity for various smallholder production systems in Tanzania. We surveyed 654 smallholder dairy households over a 1-year period and grouped them into production clusters. Based on supplementary feeding, milk productivity and sale as well as household wealth status four clusters were described: low-feed–lowoutput subsistence, medium-feed–low-output subsistence, maize germ intensive semicommercial and feed intensive commercial management clusters. About 839 crossbred cows were genotyped at approximately 150,000 single nucleotide polymorphism (SNP) loci and their breed composition determined. Percentage dairyness (proportion of genes from international dairy breeds) was estimated through admixture analysis with Holstein, Friesian, Norwegian Red, Jersey, Guernsey, N'Dama, Gir, and Zebu as references. Four breed types were defined as RED–GUE (Norwegian Red/Friesian– Guernsey; Norwegian Red/Friesian–Jersey), RED–HOL (Norwegian Red/Friesian– Holstein), RED–Zebu (Norwegian Red/Friesian–Zebu), Zebu–RED (Zebu–Norwegian Red/Friesian) based on the combination of breeds that make up the top 76% breed composition. A fixed regression model using a genomic kinship matrix was used to analyze milk yield records. The fitted model accounted for year-month-test-date, parity, age, breed type and the production clusters as fixed effects in the model in addition to random effects of animal and permanent environment effect. Results suggested that RED–Zebu breed type with dairyness between 75 and 85% is the most appropriate for a majority of smallholder management clusters. Additionally, for farmers in the feed

intensive management group, animals with a Holstein genetic background with at least 75% dairy composition were the best performing. These results indicate that matching breed type to production management group is central to maximizing productivity in smallholder systems. The findings from this study can serve as a basis to inform the development of the dairy sector in Tanzania and beyond.

Keywords: SNP, dairy, performance, cluster, smallholder, admixture, EBV, BLUP

# INTRODUCTION

The use of crossbred animals continues to be the basis for most dairy enterprises in Eastern Africa. However, the indiscriminate crossbreeding practiced in these systems produces highly admixed animals with large variability in productivity (Ojango et al., 2014). Additionally, since the breed composition of the animals is unknown, there is often a mismatch between production environment and animal breed type, which often reduces productivity. This situation cannot sustain the growth and expansion of the local dairy sector in many of the countries in the region. With the increased demand for livestock products and the need to bridge productivity gaps in developing countries, poorly planned crossbreeding of locally adapted breeds with imported exotic breeds have been widely adopted yielding animals with unknown breed composition (Weerasinghe et al., 2013). Suitability of these crosses to various production environments is largely unknown.

Anecdotal evidence and previous research suggests that the optimal level of taurine inheritance in crossbred animals lies between 50 and 75% when considering total productivity in terms of fertility, survival, growth rate and milk yield (Bee et al., 2006). However, the mismatch between genotype and environment as a result of unplanned crossbreeding contributes to depress performance mimicking indigenous cattle production (∼1.6 l/day; Mwacharo and Rege, 2002). Even though it is clear that increasing the exotic percentage of cattle results in more milk, the cumulative benefits relative to farmer socio-economic status, input level and production environment are not clear. This study sought to assess the incremental benefit from use of crossbred cattle, given the two sites with varying market orientations and markedly different improved cattle populations.

The study was undertaken in Tanzania, being an emerging dairy region where significant crossbreeding efforts are taking place. The country has a small population of improved dairy animals, (about 800,000) such that the demand for milk currently outstrips available supply. Most dairies and milk processing facilities are running below capacity. According to FAO data series, the quantity of dairy output (milk and butter) in Tanzania has grown by 4.4% per annum, barely keeping up with the population growth rate of about 4.5% since 1980. This has led to stagnation in per capita milk consumption at 39 kg/year (National Bureau of Statistics [NBS], 2007). The supply scenario points to low productivity with a modest annual growth in milk productivity of 1.1% from 160 kg in 1965 to 239 kg/cow in 2010. In view of the above, the government of Tanzania has embarked on developing a national dairy strategic plan with a view of increasing milk production from the current 1.6 billion liters of milk to 8 billion liters. It is estimated that three million head of improved cattle will be required to achieve this target in 12 years, starting from 2014. This will be a tall order given the modest increases of about 400,000 head of improved cattle between 1984 and 2005 (Swai et al., 1992; Kurwijila and Bennett, 2011). Such massive increase in the herd can only be achieved by increasing crossbreeding, especially through innovative use of estrus synchronization and artificial insemination, followed by improved calf management to enable rapid multiplication and increased survival of the desired cattle. Understanding the implications of breed by environment interactions, as this project seeks to do, will modulate the speed at which the milk production target is achieved. Smallholder farmers are the backbone of the dairy sector in Tanzania. It is generally agreed that a successful dairy operation should utilize improved breed types given the low productivity of local zebu cattle. This desire for increased production drives farmers into crossbreeding, the general sense being that a purebred exotic animal isn't suitable either for a majority of smallholder farmers. However, there is little information or evidence to support what should be the ideal grade cattle for various smallholder production situations (Msanga, 1994). Because there are no planned programs to aid farmers in this grading up process, the resulting animals constitute a mixture of breeds whose composition is unknown; animals that require much more intensive management are as a consequence managed similarly with animals of low genetic potential, which naturally make do with minimum care. Since not all breed types are well adapted to extant production environments, milk yields continue to be low. Knowledge of breed composition is therefore critical in matching breeds to the production environment as well as predicting genetic effects of heterosis (VanRaden and Sanders, 2003).

Pedigree data has been the main source of information for determining breed composition. However, the availability of dense genome-wide single nucleotide polymorphism (SNP) arrays has enabled accurate establishment of kinship and genetic composition of animals in a herd and in their native environments (in situ). The use of genetic markers, and especially SNPs in determining breed composition of cattle has attracted great interest in recent years especially in developing countries which are mostly characterized by lack of or incomplete pedigree records (Rege et al., 2001; Gorbach et al., 2010). Previous studies have demonstrated the utility of SNP markers in providing highly reliable estimates of kinship and relationships between animals (Strucken et al., 2017). Additionally, application of SNP markers in deciphering the breed composition of

crossbred animals is increasingly gaining popularity. Knowledge on breed composition will be important for farmers who can then start planned crossbreeding since they will know the level of exotic 'blood' in their animals. By identifying the exact breed composition of animals and associating this with individual animal productivity, it is envisaged that appropriate recommendations can be made for farmers and others intending to maximize productivity of their enterprises.

The purpose of this study was to determine the differential performance of various dairy genotypes and grade levels under varying resource bases and management clusters in two regions of Tanzania. The results from this study can serve as a basis to inform the development of the dairy sector in Tanzania and Eastern Africa in general.

# MATERIALS AND METHODS

# Ethics Statement

This study was performed following the International Livestock Research Institute (ILRI) Institutional Animal Care and Use Committee (IACUC) guidelines, with approval reference number 2014.35. Animals were handled by experienced animal health professionals to minimize discomfort and injury.

# Sampling Site Selection and Inclusion Criteria

Data used in this study was obtained from a baseline survey of smallholder dairy farmers in the Northern and Southern highlands of Tanzania. The project covered two sites namely: Rungwe district in Southern highlands and Lushoto district in the Northern highlands that were selected through a stakeholder engagement process. Within each of these sites, wards were selected based on the dairy cattle density data obtained from the regional government offices. Villages were then randomly selected within each selected ward (12 wards in Lushoto and 16 wards in Rungwe). From each of the villages, households were purposively recruited depending on whether they met certain inclusion criteria.

# Inclusion Criteria and Sample Size

To qualify for inclusion in the study, target dairy farmers had to be smallholders rearing between 2 and 10 dairy cows. Qualifying households had to have at least two cows, one of which had to be lactating having calved recently. Additionally, based on farmer knowledge, unrelated animals were recruited to maximize observable breed diversity within the household. Additional criteria for target animals required selected cows to be either pregnant heifers, cows in the third trimester of pregnancy or be a cow that had calved within 3 months of the recruitment date. This increased the chances that recruited cows would be in milk within a significant portion of the study period to allow collection of data on milk yield, calving and reproductive performance. This selection process yielded 654 households which were interviewed by way of a baseline survey regarding general farm and household socioeconomic conditions, animal husbandry and management practices as well as breeding practices among others. In total, 1,255 animals were recruited for the study.

# Production Cluster Characterization

In order to classify and characterize smallholder dairy farmers across the two project sites, we undertook cluster analysis. Farms were grouped based on common characteristics using agglomerative hierarchical clustering. The method groups farms such that individual farms in the same clusters are more alike than they are to farms in other clusters. Cluster analysis was preceded by an exploratory factor analysis (EFA) of all the variables that represented the various themes in the baseline survey. Variables related to livestock feeding and management as well as wealth indicators were considered as relevant variables for inclusion in cluster analysis. We also included variables linked to household endowment with livestock, particularly ownership of lactating cows. Sampling adequacy and data suitability for clustering was measured using the Kaiser–Meyer–Olkin (KMO) statistic. Factor extraction was achieved through principal axis factoring (PAF), to characterize interrelationships between respective variables related to smallholder dairy farming systems. Parallel analysis was used to determine the exact number of factors to be retained. Varimax rotation with Kaiser normalization was used to increase the interpretability of the retained factors. Extracted factors were then subjected to an agglomerative hierarchical clustering procedure using the squared Euclidean distance criterion in conjunction with Ward's linkage method. The Duda-Hart index and its associated pseudo-T-squared as well as inspection of the clustering dendrogram were used to decide on the optimal number of clusters to retain. Clustering was done using SPSS software (SPSS Inc., Chicago, IL, United States).

# Blood Sampling

Qualified veterinary and animal health personnel undertook blood sampling through jugular venipuncture using approved procedures. Hair samples were collected from the tail switch of the animals. Samples were collected from all animals in the study.

# Genotyping and Quality Control

About 839 animals (490 from Rungwe and 349 from Lushoto) were genotyped using the Geneseek Genomic Profiler (GGP) High Density (HD) SNP array consisting of 150,000 SNPs, while genotypes for the reference breeds were derived from sample sets genotyped using the Illumina HD Bovine Chip (777K SNPs). Since pedigree records were not available for these animals, and in order to aid in breed composition determination, a panel of reference genotypes consisting of Friesian (28 animals), Holstein (63), Norwegian Red (17), Jersey (36), and Guernsey (21), N'Dama (24), East African Zebu (50), and Gir (30) were included in the analysis. A total of 134,295 SNPs were common across study and reference datasets. Data quality control was undertaken using PLINK v 1.9 (Purcell et al., 2007) and included removal of SNPs with less than 90% call rate, less than 5% minor allele frequency (MAF) and samples with more than 10% missing genotypes. A total of 4,324 SNPs were removed, leaving 129,971 SNPs available for analysis. Similarly, eight samples did not meet the above quality thresholds and were removed from the final

dataset. The average genotyping rate in the remaining samples was 0.9964. For the purposes of developing a kinship matrix, the SNP data were further validated, excluding SNPs with GC score of less than 60% and those in the sex and mitochondrial chromosomes. Computation of the genomic kinship matrix (**G** matrix) was based on 112,856 SNPs after validation using method one of VanRaden (2008).

## Admixture Analysis and Dairyness

Breed composition of individual animals was estimated using the unsupervised model-based clustering method implemented by the program ADMIXTURE v. 1.3.0 (Alexander et al., 2009). The number of distinct breeds was set to a minimum of 2 and maximum of 9 to reflect the basic cross (indicine and taurine cross) and total number of the populations in the analysis, respectively, given the eight reference breeds. Ten-fold crossvalidation (CV = 10) was used, with the error profile subsequently used to determine the most appropriate number of distinct clusters (K), as described by Alexander et al. (2009).

# Daily Milk Yield Data

A total of 539 cows had records on milk yield. About 300 animals either were sold, had dried up or were from farmers who did not collect milk records at all. The data was obtained from individual animals over a period of 7 months. Each animal was visited approximately every 1.5 months for a test day record to be obtained. The analysis of daily milk yield data was undertaken using about 1328 test day records from 539 cows. Test day data ranged between one to six records per animal, with a majority of animals (80%) having less than four records (**Table 1**). A fixed regression animal model was fitted as shown below (Brown et al., 2016):

$$\mathbf{y}\_{\text{tij}} = \text{Fixed}\_{\text{i}} + \sum\_{\mathbf{k}=\mathbf{0}}^{\mathbf{3}} \phi\_{\text{tjkm}} \beta\_{\text{km}} + \mathbf{u}\_{\text{j}} + \mathbf{p} \mathbf{e}\_{\text{tij}} $$

where ytij is the test day record of cow j made on day t; Fixed<sup>i</sup> are the ith fixed effects consisting, year-month of testday, lactation number (eight levels), and age at calving as a covariate nested within the lactation number, βkm are kth fixed regressions coefficients of breed type nested within a herd management group; u<sup>j</sup> and pe<sup>j</sup> are vectors of animal additive genetic and permanent environmental effects, respectively, for animal j; φtjk is the vector of the kth Legendre polynomials

TABLE 1 | Distribution of the number of test records available for analysis.


of order three, for the test day record of cow j made on day t and etij is the random residual. The relationship among animals was taken into account in the analysis by fitting a **G** matrix, thus the variance of **u** was assumed to be equal to var(**u**) = Gσ 2 <sup>u</sup>. The analysis was carried out using ASREML (Gilmour et al., 2009).

# Breed Type Suitability Assessment

The suitability of breed types for each of the four management clusters characterized was first determined by computing the mean of the raw daily milk yield for each breed type in each management group as well as mean milk production corrected for the fixed effects affecting milk yield fitted in the model. Additionally, the ranking of animals based on their EBVs and breed composition for each management system was also used to determine the best breed type in each management system.

# RESULTS

# Cluster Analysis and Farm Typologies

Sampling adequacy analysis yielded a KMO statistic value of 0.661 indicating that the data was suitable for EFA (Kaiser, 1970). After eliminating variables exhibiting low variation, 11 variables were entered into EFA. Factor analysis resulted in five factors, accounting for 66% of the total variability being retained (**Table 2**), while cluster analysis yielded a 4-cluster solution (**Table 3**). **Table 3** indicates that from the p-value of the F-test, the clusters differed



Extraction method: principal axis factoring. Rotation method: varimax with Kaiser normalization. Bolded values are the highest for each extracted factor and represent the determinants with the highest loading.

significantly with respect to the weights assigned for the extracted factors.

Cluster one contained about 27% of households consisting of "medium-feed-low-output subsistence oriented dairy farmers," characterized by low productivity and sale of milk as well as low use of maize germ supplement. Majority of households were grouped in cluster two, which had 33% of households that were "feed intensive commercially oriented dairy farmers." Households in this cluster used a diversity of supplements such as maize bran and oil seed by-products. These households were also characterized by higher milk sales. Cluster three, which accounted for about 24% of the sampled households consisted of "low-feed low-output subsistence oriented dairy farmers" being characterized by low diversity and intensity of supplement use. Cluster four accounted for 17% of the households which exhibited higher intensity in the use of maize germ but less diversity and intensity of usage for other supplements. These "Maize germ intensive semi-commercial dairy farmers" also had moderate milk productivity and sale.

Households from Rungwe district in the Southern highlands were grouped in clusters one and two, while households from Lushoto district in the Northern highlands were grouped in clusters three and four. The more intensive and commercially oriented farmers in Rungwe also recorded higher overall milk production as did the more intensive and semi-commercial dairy farmers in Lushoto. The disparate classification of households for the two sites in distinct clusters was largely related to the feeding plane and commercial orientation differences between these two sites.

# Genetic Diversity and Admixture

### Minor Allele Frequencies (MAF)

The distribution of minor allele frequencies in each breed is presented in **Figure 1**. The Tanzanian population had the highest TABLE 3 | Factor loadings for various production system variables used to define management clusters.


P-values compare the difference between clusters with regards to weights of the factors. <sup>∗</sup>Percentage of household. Bolded values are the highest for each extracted factor and represent the determinants with the highest loading.

proportion of the SNPs with high MAF (>0.3). In contrast, the Gir and N'Dama had the highest proportion of SNPs in the lowest MAF band.

## Admixture Analysis

Results from ADMIXTURE runs for K = 2 to K = 9 are presented in **Figure 2**. Seven clusters were deemed the most optimal given that increasing K to 8 did not reveal any new distinct breed clusters or patterns. Based on available genotypes, Friesian and Norwegian Red breeds could not be distinguished apart and formed one cluster. The breed composition of the Tanzanian cattle was largely influenced by Friesian and Norwegian Red breeds. Overall, the predicted exotic taurine breed content (dairyness) in the Tanzania population varied from 7 to 100% and averaged 70%. The subpopulation of cows from Rungwe showed higher levels of taurine admixture (mean 78.3 ± 13%; n = 489) than the Lushoto subpopulation (mean 56.4 ± 16%; n = 346).

# Breed Group and Breed Type Definition

Based on the admixture results, the proportion of genes for Holsteins (HOL), Norwegian Red Friesians (RED), Jersey (JER), and Guernsey (GUE), Zebu, N'Dama, and Gir were determined for each of the 539 cows with daily milk records. Initially the percentage dairyness (which is a sum of gene proportions derived from the international dairy breeds used as references) in each animal was computed as the proportion of genes for HOL, RED, JER, and GUE in an animal as determined by the admixture analysis. This was based on the assumption that these four breeds are primarily dairy animals compared to the Zebu, N'Dama, and Gir. Four classes of cows were then created on the basis of the percentage dairyness: animals with >84%, 84–75%, 74–35%, and <34% dairyness, which roughly

corresponds to pedigree animal, F2 cross, F1 cross and a backcross or indicine animal, respectively. Within each of the four classes, animals were then grouped on the basis of the order of the breed or breeds with accounted for 76% of the genes in each animal (**Table 4**). For instance, considering animals with >84% dairyness in **Table 4**, animals classified as group 1 (RED–GUE) implies the genes from RED or genes from the RED and then GUE or genes from the RED and then Jersey accounted for more than 76% of genes in the animal with the highest proportion coming from the RED. Whereas for animals classified as group 4 (Zebu–RED), genes for the Zebu or genes from the Zebu and then Gir or genes from the Zebu and then the RED or genes from the Zebu and then HOL breed accounted for more than 76% of genes in the animal but with the Zebu accounting for the highest proportion of genes. Note that the choice of 76% genes as the proportion contributed by one or more breeds in classifying animals to breed types was arrived after trying several values so as to get an optimal distribution of genotypes. On the basis of the results in **Table 4**, nine breed types were defined based on the percentage of the dairyness and the order of breeds accounting for most of the genes in the animal.

### Performance and Breed Suitability Assessment

The determination of performance for each breed type and their suitability in the four management clusters was based on the mean values for milk yield computed using the solutions of management clusters nested within the breed types from the fixed regression model as well as the mean breeding values and solutions of permanent environmental effects of each cow in the management system. The distribution of cows based on their dairyness and breed composition is shown in **Table 4**. Given the average dairyness of 70%, majority of the animals had a breed composition in the 74–35% dairyness range. Most animals were predominantly crosses between Friesian-Norwegian Red breeds and local Zebu cattle.

Generally, the milk yield obtained from the study cows was low averaging 5.90 l per day. The mean daily milk yield for cows in Lushoto was 4.69 l while that of Rungwe was 6.61 l. Cows in breed group 4 (Zebu–RED crosses) had the lowest milk yields ranging between 1.4 and 3.5 l per day (**Table 5**). Given that this group

TABLE 4 | Number of cows included in the analysis, grouped based on a combination of breed composition and percent dairyness.



Frontiers in Genetics | www.frontiersin.org

TABLE 5 | The mean daily milk yields (

±SD) of various breed groups of varying dairyness in different management

 clusters.

consisted of cows with the highest proportion of Zebu genes and that the East African Zebu is not improved for milk yield, the low milk yield conforms to expectations. Additionally, majority of low dairyness cows (43% of all Zebu–RED crosses) were kept in the low-feed–low-output management system. Farmers practicing low-feed–low-output subsistence dairy farming were also the only ones who kept animals with dairyness <34% as well as not having animals in the >84% dairyness category. The RED–GUE crosses tended to be the best performing with a narrower range of performance (4.7–6.8). However, these crosses were very few and were not well represented in all management clusters. The RED–HOL group were second highest with a yield range of 3.9–6.7 l per day. The third best group was the RED–Zebu, which had the widest range of performance at 2.1–7.2 l per day. This group also had the highest yields for the medium-feed–low-output subsistence-oriented management system. Raw means and means corrected for fixed effects are provided in **Table 3**.

**Table 6** indicates the breed composition of the top 10 cows in terms of EBVs in each management group. Each of the four management clusters had a total of 130, 203, 105, and 101 cows, respectively, such that the top 10 cows represented the top 8, 5, 10, and 10%, respectively, in each group. For all management clusters (except the feed-intensive commercially-oriented management group), cows whose composition was dominated by crosses of the Friesian-Norwegian Red and Zebu (RED–Zebu, either as ZR or RZ genotypes) dominated the list of top 10 animals based on EBV ranking (**Table 6**). Conversely, crosses of Friesian-Norwegian Red and Holstein (RHZ, RH, RZH) featured mostly in the feedintensive commercially-oriented and maize-germ-intensive semicommercial management clusters.

### Genetic Parameters

Following variance component analysis, the direct additive heritability estimate obtained for milk yield was 0.24 ± 0.13 while repeatability was 0.32 ± 0.04. The heritability estimates fell within the range (0.18–0.51) estimated for taurine cattle (Van Tassell et al., 1999). The genetic parameter estimates were well within values obtained from tropical smallholder systems (Msanga et al., 2000).

# DISCUSSION

The purpose of the project was to characterize the smallholder dairy system and identify how various breed types performed under varying management clusters. By identifying the exact breed composition of target cows and associating the observed profile with individual animal productivity, it is envisaged that appropriate recommendations can be made for farmers and others intending to maximize productivity of these systems.

# Management Group Clustering

Central to matching breed types to production environments is the need to characterize the production environments. This is critical because most smallholder dairy farmers have small herd sizes averaging two to three animals. Additionally, the management practices in these farms are very divergent, making evaluation of performance potentially difficult. A strategy to overcome such heterogeneity in management practices is to find commonalities in practices between households. These clusters would then represent some fairly homogenous groups of households (ostensibly undertake somewhat similar management practices) within which the performance of extant cohorts of animals can be evaluated. Each cluster would then be considered a management group. This was achieved through first a factor analysis of various variables collected in the baseline survey such as farm income, land area owned, and type of feed used, among others followed by a cluster analysis of the five extracted factors. Given the four management clusters defined, many subliminal factors are implied and contribute to the observed differences in the productivity of households therein. The variety and intensity of supplementation characterized in the feedintensive commercially oriented management system and the maize germ intensive semi-commercial dairy farmers implied more labor input, in the search and preparation of the materials. Additionally, given that most of these materials are mostly not purchased but sourced from own farms, variable sources and types of supplements would reflect a larger land area planted and potentially higher household income obtained from the sale of a diverse crop base.

Classification based on inter-farm differences can potentially enable identification of farms with similar practices and circumstances for which a given recommendation would be broadly appropriate (Byerlee and Collinson, 1980). Similarity among households within a management system is no doubt determined by constraints and opportunities faced by the farmers and these are expected to vary according to agroecological and socioeconomic conditions under which farmers operate. Even within the same agro-ecological conditions, individual households may still differ due to socio-economic conditions and inherent knowledge. There will often therefore be need for targeted solutions that take into account diversity in farm resource endowment and farm practices in spite of similarities in agro-ecological conditions. This fact is demonstrated by farmers in the same geographic regions being classified in disparate management clusters. Membership in each of these four management clusters was driven by feeding practices, productivity and commercial orientation of dairy farm households.

# Admixture and Breed Composition

In order to establish the breed composition of the animals, we collected blood and hair samples from a total of 839 cows from Lushoto and Rungwe in Northern and Southern highlands of Tanzania, respectively. The choice of the genotyping platform used (the Geneseek Genomic Profiler Dairy) was informed by the need to minimize the cost of genotyping, as well as access genotypes that can be pooled with available reference genotypes, which were genotyped by the Illumina 700K SNP array. However, the SNP array that was used to genotype the animals had no power to discriminate between Norwegian Red and Friesian breeds. Additionally, the panel had a significant number of polymorphisms that had very low minor allele


TABLE 6 | The top 10 cows in breeding values for milk yield in each management group with their percentage dairyness and breed composition.

<sup>∗</sup>Liters per day. Breed is the breed group (the combination of breeds that contribute 76% of the breed makeup), with the first letter representing the breed having the highest proportion. D% represents percent dairyness (the cumulative proportion of taurine dairy breed composition in the cow). RZ, RED–Zebu cross; RH, RED–Holstein; RZH, RED–Zebu–Holstein cross; RZJ, RED–Zebu–Jersey; RHZ, RED–Holstein–Zebu; HRZ, Holstein–RED–Zebu; RJ, RED–Jersey; ZR, Zebu–RED; ZRH, Zebu–RED–Holstein; ZHR, Zebu–Holstein–RED; ZRG, Zebu–RED–Guernsey.

frequencies in indicine breeds, indicating that it may lack the power to detect subtle difference between genetic signatures derived from the indicus background. This 'ascertainment' bias compromises the definitive determination of the exact breed composition, especially relating to indicine genetic composition. However, for our purposes, the goal of determining dairyness was largely achieved.

Breed groups were defined based on a combination of percentage dairyness and the number of breeds making up 76% dairyness. The dairyness classes represent grade levels with respect to crossbreeding with indigenous breeds. Typically, an animal is assigned to a specific breed if its genes are composed of at least 87.5% from that breed. In our case, using this as a cutoff resulted in skewed distribution of animals and genotypes. The best possible distribution was arrived at with a cutoff of 76%. On the basis of this, four breed groups were defined, giving a total of nine breed types when combining dairyness and breed group. It should be noted that based on the genotyping array used, it was not possible to distinguish between the Norwegian Red and the Friesian breeds. The foregoing discussion will treat these two breeds as equivalent.

Based on the results from breed composition analysis, it is evident that the range of admixture in Tanzanian dairy cattle is quite wide given the spectrum of taurine introgression observed. For cows in Lushoto, the proportion of taurine genes ranged from less than 20% to greater than 85%. In Rungwe more than 95% of all cows had a taurine gene composition of above 50%. The variety of breeds used in crossbreeding was quite narrow compared to what has been observed in other East African countries (Weerasinghe, 2014). The predominant breed was the Holstein-Friesian, with a bias toward a Friesian signature. There appeared to be limited or no use of the Jersey, Guernsey or Ayrshire breeds. These breeds are often smaller than the Holstein and would be easier to handle in smallholder farming systems given their lower feed requirements. This result is consistent with the dominant importation of black and white genetics as the main breed for dairy farming. However, it was surprising to see that the predominance of Holstein, as expected is not reflected in the breed composition results. Holstein is the main breed imported into East African dairy systems.

Despite the fact that Lushoto and Rungwe are quite similar with regard to elevation and climate, (both being in highland areas), the fodder density, feed availability, and farmer practices were quite different. Additionally, even though we did not collect body weight or heart girth data on the study cows, differences in animal stature were evident. Cows in Lushoto were smaller, were more horned, and had prominent dewlaps compared to those in Rungwe. Based on the breed composition results observed, and the fact that on average, Lushoto animals had about 50% Zebu signature, the differences can be confidently attributed to differential taurine gene introgression. The feed density available in Lushoto and associated management practices can hardly support higher grade exotics for majority of the farmers, who would prefer lower grade crosses that require less rigorous maintenance. Additionally, the terrain in Lushoto is also quite steep in many places, reducing capacity of the land to hold enough fodder for the animals, while also presenting a soil nutrition challenge. Soils in Lushoto are less fertile compared to Rungwe and hence the feed mix available would be poorer. In Lushoto, most farmers feed crop residues (maize stover, guatemala grass, and grain products), which are offered seasonally, mostly after harvest. However, farmers in Rungwe have a larger diversity including purchased feeds, banana stalks, Napier grass among others as the main feed source.

# Recommendations for Appropriate Breed Type

Usually, milk yields in small holder farms do not follow the typical lactation curve mostly due to poor management associated with erratic sub-optimal feeding and other constraints found in tropical production systems. To deal with this problem, and to increase the flexibility of resultant curves, a single trait animal model with Legendre polynomials of order 3 (with fixed

curves nested within breed types) was fitted (**Supplementary Figure S1**). Legendre polynomials have been shown to perform well in such situations (Eva Strucken, personal communication). The mean production seen in Tanzania (5.9 l per day) is very similar to what has been recorded in Kenya and Uganda. A similar study carried out over a 2-year period in Kenya and Uganda (and with 39,000 milk yield records) resulted in very similar performance in smallholder systems, averaging 5.39 and 5.62 l, respectively (Unpublished). Smallholder farmers are the backbone of the dairy sector in Tanzania and East Africa. It is generally agreed that a successful dairy operation should utilize improved breed types given the low productivity of local zebu cattle. This desire for increased production drives farmers into crossbreeding, the general sense being that a pure bred exotic animal isn't suitable either for a majority of smallholder farmers. However, there is little information or evidence to support what should be the ideal breed type for various smallholder production situations. By evaluating the performance of various breed types within diverse management clusters, it is possible to provide general recommendations of the breed type most effective for each circumstance.

Given the estimated breeding values obtained in this study and the top 10 ranked animals, it is clear that Holstein genetics are not well suited for the smallholder system of the kind profiled in this study. It is difficult to say whether the alternative is Friesian or Norwegian Red given the ineffectual separation of these two breeds in the study. However, we expect that since there is significant representation of Friesian in the Norwegian Red breed, hence the lack of differentiation with the number of markers on the GGP SNP array. However, based on the breed utilization pattern in the region, the breed in question would mostly be Friesian, since most farmers either prefer or have easy access to the black and white cattle. A similar phenomenon was observed by Weerasinghe (2014), where exclusion of Ayrshire as a reference breed resulted in Ayrshire animals having higher Jersey or Guernsey composition. However, that animals with substantial Holstein background were performing inferior to smaller bodied alternatives is not surprising. Anecdotal evidence and common sense would dictate that in the face of limiting feed resources, sub-optimal management practices and extant disease pressure in smallholder systems, cows that are smaller framed would be preferred, not least because of the lower feeding requirements. However, as farmers chase larger milk yields, preference has fast shifted to Holsteins and their promise of huge milk production. One of the most illuminating outcomes in this analysis was the fact that some of the Zebu–RED cows, those of the low dairyness class, were ranked amongst the best performers in some management clusters. These animals, with dairyness less than 60%, typify the benefits that may be derived through regular performance recording and evaluation. It would be interesting to identify the genetic signature of such animals, because they would best exemplify the model cow for smallholder systems – resistant to diseases, hardy enough to withstand poor feeds and ravages of the tropical smallholder system, but still competitive in terms of milk yield. However, because farmers do not routinely collect performance records, nor is there a consistent mechanism for performance evaluation, any hidden gem in the national herd is soon lost in pursuit of higher yields through inappropriate upgrading.

The results obtained in this study seem to suggest that the RED–Zebu with exotic genes between 75 and 85% are the most appropriate genotype for these systems followed by the RED–GUE. For farmers in the feed-intensive-commercially oriented dairy management group, the RED–HOL or RED–GUE crosses with at least 75% exotic genes were the best performing cows. Farmers in the low-feed–low-output subsistence oriented dairy farming would be best served with animals with breed composition of no more than 65% RED genes. This means that dairy farmers who are able to provide the feeding plane and management inputs for the Holstein, can still be well served by that breed type. However, this group does not represent the vast majority of smallholder farmers.

Collecting data from smallholder dairy systems is an enormously expensive and tasking exercise. Typically, routine collection of test day milk yield records does not happen and such data is the preserve of research institutions. There is no incentive for collecting such data for smallholder farmers because genetic evaluation programs are lacking. Where these systems exists, they are only done for large scale commercial farmers with large herd sizes. The extremely small number of animals kept by smallholder farmers (most farmers keep two dairy cows), the cohort sizes are too small for meaningful genetic evaluation to be undertaken. Additionally, smallholder farmers do not raise their own animals for replacement, being content to buy replacement stock from established farms when needed. These limitations contributed greatly to the low data volumes experienced in this study. With the limited data available, we were able to demonstrate that combining genomic data with lactation and other production records can be a powerful way of identifying appropriate genotypes for farmers given their extant management system. The results obtained in this study can serve as a basis to inform the development of the dairy sector in Tanzania. This is particularly important because the Tanzanian government has resolved to increase the number of improved dairy cattle to three million head and milk production from 1.6 billion to 6 billion liters annually in the next 10 years. Knowledge of what breed combinations are best suited for which production systems is critical and will determine the success of this ambitious goal.

The recommendations of breed types most suitable for the management clusters described in this study reflects only the sample set which was surveyed and highly related systems and cannot be generalized across the diversity of smallholder farming enterprises. These are variable and are immensely influenced by socio-economic parameters, market orientation, available feed resources as well as other agro-ecological factors. Additionally, data for this study was collected over a 7-month period, and not a full lactation for each animal. The study duration was short and sample size limited. These results would gain tremendously from increasing the number of lactations, the number of test day records as well as larger sample sizes to solidify the recommendations proffered herein. However, such a study would be very costly. In practice, milk yield

recording is not an entrenched practice in smallholder dairy systems. Such data collection would primarily be driven by hired enumerators, making the cost very high. Owing to limited funding and competing needs for available resources, data can only be collected for limited durations of time.

The recommendations made in this study are based solely on performance in terms of daily milk yield and do not account for other important issues such as cost of health treatment, reproductive management or feed provision. An economic analysis that accounts for all these additional variables will be useful in defining the most profitable genotype for each system.

# CONCLUSION

The use of SNP data and genomic relationships for the animals under study enabled performance evaluation of milk yield data in smallholder dairy farms without the need for pedigree records. The breeding values estimates so obtained were instrumental in determining that the RED–Zebu breed type with exotic genes between 75 and 85% was the most appropriate genotype for majority of the management clusters except the high input clusters. Given that majority of smallholder farmers operate in circumstances where the intensity of input (especially feed) provision is quite limited, the recommended breed type would be the most applicable on a wide scale. These results indicate that matching breed type to production management group is central to sustainable intensification and maximizing productivity. The observations made in this study will serve as a basis to inform the development of the dairy sector in Tanzania and Eastern Africa at large.

# ETHICS STATEMENT

This study was performed following the International Livestock Research Institute (ILRI) Institutional Animal Care and Use

# REFERENCES


Kaiser, H. F. (1970). A second-generation little jiffy. Psychometrika 35, 401–415.

Kurwijila, L. R., and Bennett, A. (2011). Dairy Development Institutions in East Africa — Lessons Learned and Options. Rome: FAO.

Committee (IACUC) guidelines, with approval reference number 2014.35. Animals were handled by experienced animal health professionals to minimize discomfort and injury.

# AUTHOR CONTRIBUTIONS

FM conceived the project, designed the study, and obtained funding. AK, DN, and MA were involved in collection of field data. JR, EC, FM, and RM analyzed the data and contributed to drafting individual segments for the analyses. FM consolidated inputs and drafted the manuscript. JR, YZ, MA, and RM made suggestions and corrections. All authors read and approved the final manuscript.

# FUNDING

This study was made possible with funding obtained through AgriTT Research Challenge Fund from the DFID, United Kingdom.

# ACKNOWLEDGMENTS

Genotypes for the reference breeds were thankfully obtained from Olivier Hanotte (East African Shorthorn Zebu), Tad Sonstegard (Norwegian Red, Holstein, Guernsey, Jersey, N'Dama, Gir), and Edinburgh Genetic Evaluation Services (EGENES), Scotland's Rural College, Edinburgh (Friesian).

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00375/full#supplementary-material



**Conflict of Interest Statement:** FM and EC were employed by USOMI Limited, a private company during manuscript preparation. AK was employed by Badili Innovations during the manuscript preparation. All the work relating to this research was done prior to their employment by the respective companies. All other authors declare no competing interests.

Copyright © 2019 Mujibi, Rao, Agaba, Nyambo, Cheruiyot, Kihara, Zhang and Mrode. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Natural Selection Footprints Among African Chicken Breeds and Village Ecotypes

Ahmed R. Elbeltagy1,2 \*, Francesca Bertolini<sup>1</sup> , Damarius S. Fleming1,3 , Angelica Van Goor1,4, Chris M. Ashwell<sup>5</sup> , Carl J. Schmidt<sup>6</sup> , Donald R. Kugonza<sup>7</sup> , Susan J. Lamont<sup>1</sup> and Max. F. Rothschild<sup>1</sup>

<sup>1</sup> Department of Animal Science, Iowa State University, Ames, IA, United States, <sup>2</sup> Department of Animal Biotechnology, Animal Production Research Institute, Giza, Egypt, <sup>3</sup> Virus and Prion Diseases of Livestock Research Unit, National Animal Disease Center, Agricultural Research Service, United States Department of Agriculture, Ames, IA, United States, <sup>4</sup> Institute of Food Production and Sustainability, National Institute of Food and Agriculture, United States Department of Agriculture, Washington, DC, United States, <sup>5</sup> Department of Poultry Science, North Carolina State University, Raleigh, NC, United States, <sup>6</sup> Department of Animal and Food Sciences, University of Delaware, Newark, DE, United States, <sup>7</sup> Department of Agricultural Production, Makerere University, Kampala, Uganda

### Edited by:

Peter Dovc, University of Ljubljana, Slovenia

### Reviewed by:

Edgar Farai Dzomba, University of KwaZulu-Natal, South Africa Luca Fontanesi, University of Bologna, Italy

\*Correspondence:

Ahmed R. Elbeltagy ahmed\_elbeltagi@yahoo.com

### Specialty section:

This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics

Received: 19 May 2018 Accepted: 09 April 2019 Published: 08 May 2019

### Citation:

Elbeltagy AR, Bertolini F, Fleming DS, Van Goor A, Ashwell CM, Schmidt CJ, Kugonza DR, Lamont SJ and Rothschild MF (2019) Natural Selection Footprints Among African Chicken Breeds and Village Ecotypes. Front. Genet. 10:376. doi: 10.3389/fgene.2019.00376 Natural selection is likely a major factor in shaping genomic variation of the African indigenous rural chicken, driving the development of genetic footprints. Selection footprints are expected to be associated with adaptation to locally prevailing environmental stressors, which may include diverse factors as high altitude, disease resistance, poor nutrition, oxidative and heat stresses. To determine the existence of a selection footprint, 268 birds were randomly sampled from three indigenous ecotypes from East Africa (Rwanda and Uganda) and North Africa (Baladi), and two registered Egyptian breeds (Dandarawi and Fayoumi). Samples were genotyped using the chicken Affymetrix 600K Axiom <sup>R</sup> Array. A total of 494,332 SNPs were utilized in the downstream analysis after implementing quality control measures. The intra-population runs of homozygosity (ROH) that occurred in >50% of individuals of an ecotype or in >75% of a breed were studied. To identify inter-population differentiation due to genetic structure, FST was calculated for North- vs. East-African populations and Baladi and Fayoumi vs. Dandarawi for overlapping windows (500 kb with a step-size of 250 kb). The ROH and FST mapping detected several selective sweeps on different autosomes. Results reflected selection footprints of the environmental stresses, breed behavior, and management. Intra-population ROH of the Egyptian chickens showed selection footprints bearing genes for adaptation to heat, solar radiation, ion transport and immunity. The high-altitude-adapted East-African populations' ROH showed a selection signature with genes for angiogenesis, oxygen-heme binding and transport. The neuroglobin gene (GO:0019825 and GO:0015671) was detected on a Chromosome 5 ROH of Rwanda–Uganda ecotypes. The sodium-dependent noradrenaline transporter, SLC6A2 on a Chromosome 11 ROH in Fayoumi breed may reflect its active behavior. Inter-population FST among Egyptian populations reflected genetic mechanisms for the Fayoumi resistance to Newcastle Disease Virus (NDV), while FST between Egyptian and Rwanda–Uganda populations indicated the Secreted frizzled related protein 2, SFRP2,

**289**

on Chromosome 4, that contributes to melanogenic activity and most likely enhances the Dandarawi chicken adaptation to high-intensity of solar radiation in Southern Egypt. These results enhance our understanding of the natural selection forces role in shaping genomic structure for adaptation to the stressful African conditions.

Keywords: selection signatures, environmental stresses, African chicken, FST , runs of homozygosity

# INTRODUCTION

Chicken domestication began in Asia as a combination of several local domestication events between 6,000 and 8,000 years ago (Miao et al., 2013; Mwacharo et al., 2013). Meanwhile, intensive human-directed selection for economic traits and the development of breeds is much more recent. A study based on mitochondrial D-loop sequences (Osman et al., 2016) suggested that African chickens can be separated into two clades: the first includes North-African (e.g., Egypt), Central African, European, and West and Central Asian chickens, while the second clade includes East-African (e.g., Uganda and Rwanda) and the Pacific chickens. The authors suggested that the first clade group likely originated from South-Asia and migrated to West-Asia, then arrived in Africa through Egypt, while the second clade migrated from the Pacific to East-Africa through the Indian Ocean. Present Egyptian chicken populations, as an example of the North-Africa chickens, include pure native breeds, such as Fayoumi and Dandarawi, and admixed fowl ecotypes which originated from unplanned crossings among native populations and are identified by their geographic distribution (ecotypes), such as the Baladi (synonym of local) and its naked neck type (Hosny, 2006). The Fayoumi is a medium-sized breed (average 2 kg for male and 1.6 kg for female) characterized by early maturation (150 days), aggressive behavior, flying ability and resistance to several pathogens, including resistance to Rous Sarcoma (Prince, 1958), Marek's disease virus (Lamont et al., 1996) and E. tenella infection (coccidiosis) (Pinard-van der Laan et al., 2009; Bacciu et al., 2014). The Dandarawi is an auto-sexing bird and the smallest Egyptian breed (average 1.4 kg for male and 1.2 kg for female). This breed originated in Southern Egypt (Qena Governorate) which is characterized by hot (>40◦C) dry climate, with intensive solar radiation. In Uganda and Rwanda, representing East Africa, where chicken breeding programs are absent, there are different admixed chickens (ecotypes) that vary in phenotypic characteristics and performance (Fleming et al., 2016).

According to the Koppen climate classification (Peel et al., 2007), Egypt is located in the Warm desert climate zone, while Uganda and Rwanda are in the Tropical savanna zone. The main environmental differences between Egypt and both Eastern Africa countries are altitude, precipitation, and temperature. According to the World Meteorological Organization (WMO), World Weather Information Service<sup>1</sup> , the 30-year averages for the major meteorological parameters for the capital of each country are as follows: Egypt has the hottest and driest weather with larger diurnal variation. Average temperatures ranged between 18.9 and 34.7◦C and 2.47 ml of average annual precipitation rate. In Rwanda, average temperatures ranged from 25.9 to 28.2◦C with an average annual precipitation rate of 79.24 ml, while Ugandan average temperatures ranged between 26.9 and 29.3◦C with a precipitation rate of 105.24 ml. Altitude averages are, respectively, 75, 1,497 and 1,155 m in Egypt, Rwanda, and Uganda. For climatic variation among sampling locations of indigenous Egyptian chicken populations, Khalil et al. (2011) classified Egypt into six Agro-climatic zones according to the evapotranspiration (ETo) which considers major weather parameters, i.e., solar radiation, air temperature and humidity, and wind speed. According to the ETo mapping, Qalyubia (source of Baladi), Fayoum (source of Fayoumi) and Qena (source of Dandarawi) governorates belong to different ETo zones. The solar Atlas of Egypt (Khalil et al., 2010) indicated that average annual solar radiation ranges from 2,000 (North) to 3,200 (South) kWh/m<sup>2</sup> /year, and accordingly, Egypt was classified into 12 belts (zones). The Nile delta (including Qalyubia Governorate, source of Baladi ecotype) is located in a solar radiation belt that receives between 5.5 and 6.6 kWh/m<sup>2</sup> /day, while Fayoum (Mid-Egypt) receives 7.0–7.3 kWh/m<sup>2</sup> /day and Qena (Southern Egypt and source of Dandarawi) receives 8.3–8.5 kWh/m<sup>2</sup> /day. For solar radiation estimates in Rwanda, Batalla and Parellada (2015) reported a much lower variation than Egypt that ranged between 4.98 kWh/m<sup>2</sup> /day in Kayonza district and 5.28 kWh/m<sup>2</sup> /day in Bugesera district. While annual ETo (mm/day) varied between 4.49 in Kayonza and 4.9 in Bugesera districts. In Uganda, average solar radiation ranged between 17.2 MJ/m<sup>2</sup> (4.78 kW/m<sup>2</sup> /day) in Kabale and 21.5 NJ/m<sup>2</sup> (5.97 kWh/m<sup>2</sup> /day) in Soroti (Djaman et al., 2017). Under such wide spectrum of environmental variability in Egypt, which does not exist in Rwanda and Uganda, and absence of structural breeding plans, we speculate that rural chicken populations, in the study, are under different selection pressures driven by environmental stressors.

The current study aims to identify genomic footprints of natural selection of some North- vs. East-African chicken breeds and ecotypes raised and adapted to different local environments. The analytical approach combined high-density genotype-based, intra-population runs of homozygosity (ROH) and the allelefrequency-based inter-population genetic differentiation (FST). ROH exist when identical haplotypes are inherited from each parent. ROH analysis indicated the population history and trait architecture (Ceballos et al., 2018). The length of ROH reflects individual demographic history and level of inbreeding. Meanwhile, the homozygosity burden can be used to detect genetic architecture of complex traits (Ceballos et al., 2018). It was also reported that ROH are universally common in genomes, even among outbred individuals of human. In cattle, a large

<sup>1</sup>http://worldweather.wmo.int/en/home.html

proportion of ROH are likely the result of the accumulation of elite alleles from long-term selective breeding programs (Zhang et al., 2015). Therefore ROH was selected for studying population architecture and investigating selection signature resulted from natural selective forces in the indigenous African chicken breeds that are usually outbred and have been exposed to local natural selection forces for uncountable generations. FST is one of the most widely used measures for assessing genetic differentiation. It plays a major role in ecological and evolutionary genetic studies. Since the emergence of next generation sequencing data, it was proved that the large number of genetic markers can compensate for small sample sizes when estimating FST (Willing et al., 2012). With the variation in sample size among different chicken populations studied, FST was selected for assessing genetic variation and detecting of inter-population selection signature.

# MATERIALS AND METHODS

# Sample Collection, Genotyping, and Quality Control

A total of 268 blood samples were collected on FTA cards from birds of East Africa (EA; Rwanda and Uganda), and North Africa (NA; Egypt). Samples were collected by local veterinarians following the approved country standards of animal care practices. A total of 172 samples were collected in EA: 100 Rwandan and 72 Ugandan ecotypes. Rwandan samples were collected from the Huye (n = 25), Kicukiro, Kirehe, Musanze, Nyagatare, and Rubavu (n = 15 for each) districts. Ugandan samples were collected from three districts; Kamuli, Masaka, and Luweero (n = 24, for each). For more details on Ugandan and Rwandan samples see Fleming et al. (2016). A total of 96 samples were collected from Egypt: 31 Egyptian Native Naked Neck Baladi (will be referenced to as Baladi) from three villages in Qalyubia Governorate (30◦ 240 36<sup>00</sup> N, 31◦ 120 36<sup>00</sup> E, 19m) in the Delta; 31 Fayoumi from four villages in Mid-Egypt (Fayoum Governorate, 29◦ 210 48<sup>00</sup> N, 30◦ 440 45<sup>00</sup> E, 14m); and 34 Dandarawi from four villages in Southern Egypt (Qena Governorate, 26◦ 8 0 34.8<sup>00</sup> N, 32◦ 430 40.8<sup>00</sup> E, 76m). Chicken blood samples from Egypt, Rwanda, and Uganda were collected in accordance with the local veterinary guidelines in each country. All samples were collected with the consent of the owners of the chickens.

Genotyping of all samples was conducted at GeneSeek (Lincoln, NE, United States) using the Affymetrix Axiom <sup>R</sup> 600k Array (Kranis et al., 2013). A total number of 494,332 SNPs and 266 birds were utilized in the downstream analysis after QC measures of MAF >0.05 and call rate of >0.97 applied to all samples using PLINK 1.9 (Chang et al., 2015). The raw data supporting the conclusions of this manuscript will be made available by the authors, on request, without undue reservation, to any qualified researcher.

### Population Stratification and Structure

PLINK 1.9 (Chang et al., 2015) was used for constructing a multi-dimension scaling (MDS) plot based on a 266 × 266 matrix of genome-wide Identity-By-State (IBS) scores calculated based on pairwise comparisons of the genetic distances for all individuals, and the first two components. Ancestral modelbased clustering, with no prior knowledge on breed origins, was performed using ADMIXTURE 1.2.2 (Alexander et al., 2009) to investigate individual admixture proportions, for 1 < k < 10, where k is the number of expected subpopulations, and the best k was determined based on the cross-validation error for different numbers of ancestral genetic backgrounds.

### Runs of Homozygosity

Runs of homozygosity analyses were carried out for both individual populations and combined EA and NA breeds/ecotypes using PLINK 1.9 to examine overlapping genomic regions that harbored alleles driven to fixation within each population or group of populations using a SNP based sliding window approach. ROH requirements were defined as ≥300 SNPs, a minimum SNP density per ROH was set to one SNP per 50 kb, a maximum gap permitted between consecutive homozygous SNPs was set to 10 kb, three heterozygous calls were allowed within a run to account for genotyping errors and/or hitch-hiking events, and allelic match threshold of 0.95 identity and >20 SNPs. The overlapping ROH was considered as those overlapped across all populations, regardless their length, and consensus ROHs are those reached a consensus in either >50% of the individuals of an ecotype or in >75% of a breed, except for the Rwanda and Uganda ecotypes where a 40% consensus threshold was accepted. A gene ontology (GO) enrichment analysis was conducted for the list of genes located at the identified ROH consensus regions.

## Fixation Index, FST, for Inter-Breed Genetic Differentiation

To identify the regions under selection that are differentiated among breeds or ecotypes, an overlapping sliding window-based FST analysis was calculated according to Karlsson et al. (2007). The pairwise comparisons were performed for North-African (Baladi, Dandarawi, and Fayoumi) vs. East-African (Rwanda and Uganda) populations, and all population-pairwise combinations, for overlapping windows along each chromosome. Each FST window consisted of 500 kb with a step size of 250 kb. Only windows with ≥20 SNP were considered. Candidate genomic regions under selection were defined by a cutoff FST value >0.30, that exceeds the value of 0.25 defined as very great genetic differentiation according to Hartl and Clark (1997). The GO enrichment analysis was also conducted on those genes located at the identified FST windows.

### Annotation and Enrichment

Genes within the regions of high interest for both ROH and FST analyses were identified using the software bedtools v2.26.0 using the (Gallus\_gallus-5.0, GCA\_000002315.3) annotation genome<sup>2</sup> . GO for molecular function and biological processes for the identified genes were determined by PANTHER using the Gallus gallus reference genome<sup>3</sup> and enriched genes were identified using Enrichr (Chen et al., 2013). GO terms were considered

<sup>2</sup>http://useast.ensembl.org/Gallus\_gallus/Info/Index?db=core <sup>3</sup>http://www.pantherdb.org/

statistically significant at adjusted P < 0.05. Results were filtered using REVIGO<sup>4</sup> (Supek et al., 2011), for removing redundancy to best classify significant GO terms per biological function.

# RESULTS

# Population Stratification

The multi-dimensional scaling analysis (**Figure 1**) showed clear stratification and distinctive separation among the five populations studied. The first dimension (C1) separated the Egyptian (North-African) from both the Rwanda and Uganda (East-African) populations. The second dimension (C2) separated the Dandarawi (smallest-sized and tolerant to Southern Egypt extreme heat and solar radiation conditions) from both the Baladi (Nile Delta) and Fayoumi (Mid-Egypt). Baladi and Fayoumi (prevalent in similar environments of the Nile delta and Mid-Egypt) are genetically closer to each other than the Southern-Egypt Dandarawi breed. MDS also shows overlapping between the Rwandan and Ugandan populations, which was also reported by Fleming et al. (2016). For the admixture analysis, the best K (K = 5) was determined based on the cross-validation error for different numbers of ancestral genetic backgrounds. Admixture analysis (**Figure 2**) showed that Dandarawi and Fayoumi was the only population with minimal admixture. Baladi, Rwanda and Uganda are all ecotypes composed of an admixture of genetic backgrounds. Both Rwanda and Uganda chickens showed a composition of a one common main genetic background (ancestral genotypes) and four other minor backgrounds. Each of the ecotypes (Baladi, Rwanda, and Uganda) shares one of its minor genetic backgrounds with each of the Dandarawi and Fayoumi.

# Runs of Homozygosity (ROH) Mapping

Total individual ROH, regardless of consensus conditions, were classified according to length into three classes (**Supplementary Table S1**); short (300 kb–<1 Mb), medium (1–<1.5 Mb), and long (>1.5 Mb). The number and length of individual ROH differed widely among the populations in the study due to the nature of the population; e.g., breed or ecotype, number of samples and genetic structure. Breeds (Fayoumi and Dandarawi) showed higher average number of ROH than ecotypes. Egyptian Dandarawi showed the highest average number of ROH (180.8) and the highest percentage of medium (7.56%) and long (3.80%) ROH (**Supplementary Table S1**), indicating recent ancestral relationships and probably the highest inbreeding. For ecotypes, the Egyptian Baladi showed the lowest average number of ROH, and lowest number of long and medium-length ROH. This likely reflects ancestral relationships, low levels of inbreeding, a wider population gene pool, and geographical distribution in addition to genetic admixture.

# Intra-Population Footprints of Divergent Selection (Consensus ROH)

A total of 153 within-population consensuses ROH were detected with 41, 49, 35, and 28 in Baladi, Dandarawi, Fayoumi, and

African chicken populations are Baladi ecotype (N = 31), Dandarawi breed (N = 33), Fayoumi breed (N = 30), Rwanda ecotype (N = 100), and Uganda ecotype (N = 72).

<sup>4</sup>http://revigo.irb.hr/

Rwanda–Uganda populations, respectively. Consensus ROHs were found on Chromosomes 3, 5, and 8 in Rwanda– Uganda; 2, 3, 4, 8, and 11 in Fayoumi; 1, 4, and 8 in Dandarawi; and 2, 3, 8, and 11 in Baladi (**Supplementary Figure S1**). The number of genes enriched and annotated within the overlapping consensus ROH was 62, 33, 72, and 29 genes for Baladi, Dandarawi, Fayoumi, and Rwanda– Uganda populations, respectively. The genes' contribution to adaptation/tolerance performance is through their involvement in enzymatic (alpha amylase) and hormonal [corticosteroid and norepinephrine (NE)] activities; metabolism (lipid metabolism); reduction of oxidative stress (e.g., glutathione-S-transferase); tolerance to solar radiation (melanogenesis); ion binding and transport (sodium, potassium, and zinc); immunity and defense response (e.g., phagocytosis); oxygen-heme binding and transport; angiogenesis; apoptosis; tissue morphogenesis (e.g., bone trabecula formation); and tolerating acute heat stress (heat shock protein transcription factor).

# Signature of Selection Detected by ROH Mapping

The total 196 genes located on the consensus 153 ROH regions were used for detecting over-enriched GO terms. Enriched GO terms indicated biological processes and molecular functions promoting different mechanisms for adaptation to various cellular and environmental stressors (**Table 1**).

## (a) Selection Signatures Common in East-African (Rwanda–Uganda) and North-African (Fayoumi and Dandarawi) Populations

Genes annotated within ROH and enriched GO terms reflected a common signature of selection for energy generation and transport; and ion binding in both the East-African (Rwanda– Uganda) and North-African (Fayoumi and Dandarawi) chicken populations studied. The (GO:0004556); molecular function of alpha-amylase activity was enriched and the AMY2A (alpha amylase2) gene (located on Chromosome 8) was annotated in the three African populations (**Table 1**). AMY2A is involved in the biological process of carbohydrates and glycogen metabolism, indicating the selection forces for metabolism, energy availability and response to thermal stress. Molecular function of calcium binding (GO:0005509) was commonly enriched in the same three populations. The annotated SLC25A24 (solute carrier family 25 member 24, calcium-regulated mitochondrial ATP-Mg/Pi carrier), Chromosome 8, in both Rwanda–Uganda and Dandarawi (**Table 1**) is involved in the molecular function of calcium ion binding and energy (ATP) transmembrane transport. The (GO:0034599), physiological process of cellular response to oxidative stress was also commonly enriched in the same populations, indicating common signature of selection for responses to oxidative stresses.

# (b) Selection Signatures in the East-African Populations

According to the environmental conditions of the two East-African countries studied (Rwanda and Uganda), the major stresses on the local chicken populations were oxidative stress, which is a common denominator for other stresses; high-altitude accompanied with lower oxygen availability; and lack of vaccination and poor health care. GO terms for molecular function of Oxygen binding (GO:0019825) and heme binding (GO:0020037); and physiological process of angiogenesis (GO:0001525), and oxygen transport (GO:0015671), **Table 1**, reflected adaptation to lower oxygen availability due to high altitude. Annotated genes resulted from the ROH mapping included two associated genes on Chromosome 5; vasohibin-1 (VASH1) and neuroglobin (NGB). VASH1 gene is involved in the biological processes of angiogenesis (GO:0001525), response to

TABLE 1 | A subset<sup>1</sup> of gene ontology (GO) enrichment of consensus ROH analysis, and annotated genes in (a) East- and North-African populations, (b) East-African populations, and (c) North-African populations.


<sup>1</sup>The subset of GO that affect adaptation profile to cellular or environmental stressors and showed to be statistically significant.

wounding (GO:0009611) and regulation of lymphangiogenesis (GO:1901491). Neuroglobin (NGB) gene is associated with molecular functions of oxygen binding to heme (GO:0019825) and transport (GO:0015671) which contributes to the adaptation to high altitude and lack of oxygen stresses.

For the adaptation to oxidative-stress, the annotated glutathione-S-transferase zeta 1 (GSTZ1) increases the glutathione-S-transferase activity (GO:0004364) and the molecular functions of glutathione metabolic process (GO:0006749), and therefore decreases lipid oxidation products (Blackburn et al., 2006) as response to oxidative stress. Glutathione-S-transferase is also involved in a functional hepatic GST-mediated detoxification for the feed-borne mycotoxins.

## (c) Selection Signatures in the North-African Populations **In both Dandarawi and Fayoumi**

Two GO terms associated with chloride transport were enriched being the chloride channel activities (GO:0005254) and the chloride transmembrane transport (GO:1902476). The chloride channel CLIC like 1 (CLCC1) gene, Chromosome 8, was annotated in both GO terms in Dandarawi and Fayoumi (**Table 1**). CLCC1 is expressed in different organelles, including the endoplasmic reticulum (ER), Golgi apparatus, and nucleus in testis, spleen, liver, kidney, heart, brain, and lungs (Nagasawa et al., 2001), and involved in the biological processes of cation– anion (chloride) transport. The loss of CLCC1 leads to disruption of chloride anion homeostasis in the ER and therefore disruption of protein-folding capacity and ER stress (Jia et al., 2015).

## **In both Baladi and Fayoumi (originated from Delta and Mid-Egypt regions)**

Gene ontology terms for biological processes of NE transport (GO:0015874) and dopamine uptake involved in synaptic transmission (GO:0051583); and molecular function of oxidoreductase activity (GO:0016491) were commonly enriched (**Table 1**). For both NE transport and dopamine uptake the sodium-dependent noradrenaline transporter; solute carrier family 6 member 2 (SLC6A2), Chromosome 11, was annotated (**Table 1**). SLC6A2 is involved in NE transport and is associated with the pathophysiology of attention-deficit/hyperactivity disorder (ADHD) in children (Sengupta et al., 2012). Dopamine uptake involved in synaptic transmission indicates the directed movement of dopamine into a presynaptic neuron or glial cell, where dopamine is a catecholamine neurotransmitter and a metabolic precursor of noradrenaline and adrenaline. Dopamine level in plasma was found to be highly correlated with adaptation to cold and heat stresses (Felver-Gant et al., 2012). SLC6A2 then may contribute to the high physical activity and adaptation to heat stress in both Egyptian populations. In addition, the hydroxysteroid 11-beta dehydrogenase (HSD11B2), Chromosome 11, annotated for the oxidoreductase activity (GO:0016491) is a microsomal enzyme complex that oxidizes the glucocorticoid cortisol to the inactive metabolite cortisone. This activity limits the impact of cortisol and would support immunity and defense response of Fayoumi and Baladi populations.

## **In Fayoumi**

Gene ontology terms for physiological process of both anion transport (GO:0006820) and anion transmembrane transport (GO:0098656) were enriched. Common putative annotated genes for those GO terms were the Na+-Cl<sup>−</sup> cotransporter solute carrier family 12 member 3 (SLC12A3) and the K+- Cl<sup>−</sup> cotransporter (SLC12A4), Chromosome 11. SLC12A3 is a cotransporter in the kidney that is involved in sodium ion transport and chloride transmembrane transport. It reabsorbs sodium and chloride ions from the tubular fluid into the distal convoluted tubule cells of the nephron. SLC12A4 exhibits chloride symporter activity, playing key roles in electrolyte movement across epithelia and in intracellular chloride homeostasis of neurons and muscle cells (Payne, 2012). Annotated ion-transport related genes reflected the signature of selection for homeostasis that promotes adaptation in the Egyptian Fayoumi breed.

Three glucocorticoid associated GO terms were enriched, being the molecular function of 11-B hydroxysteroid dehydrogenase activity (GO:0003845) and both physiological processes of glucocorticoid biosynthesis (GO:0006704) and response to glucocorticoid (GO:0051384). The HSD11B2 was the commonly gene annotated on Chromosome 11 (**Table 1**), in the three GO terms. HSD11B2, as previously mentioned, is a microsomal enzyme complex that oxidizes the glucocorticoid cortisol to the inactive metabolite cortisone, which limits the impact of cortisol.

The enriched physiological processes GO term of bone trabecula formation (GO:0060346) is involved in Fayoumi bone and ligaments morphogenesis (**Table 1**). The Matrix metallopeptidase 2 (MMP2) gene (Chromosome 11), annotated in this GO terms, contributes to the biological process of tissue morphogenesis; e.g., collagen catabolism and bone trabecula formation. MMP2 may therefore, contribute to the distinctive morphogenesis characteristics of Fayoumi. Both GO terms of growth factor activity (GO:0008083) and regulation of apoptotic process (GO:0043065) were enriched in Fayoumi and the OSGIN1 (oxidative stress-induced growth inhibitor 1), Chromosome 11 (**Table 1**) was annotated for both terms. OSGIN1 encodes an oxidative stress response protein that regulates cell death and apoptosis by inducing cytochrome c release from mitochondria (Ott et al., 2002). OSGIN1 inhibits growth in several tissues, e.g., ovary, kidney and liver, due to different causes of stresses. The homozygous genotype of OSGIN1 could function in the Fayoumi stress response, including suppression of proliferation and the induction of apoptosis under the Egyptian stressful conditions.

### **In Dandarawi**

Natural selection forces of the extreme stressful environment in Southern Egypt include severe hot weather, high-intensity of solar radiation, and lack of vaccination and poor health care services. Effects of these selective forces were reflected in the enriched GO terms of molecular function of Melatonin receptor activity (GO:0008502), and response to radiation (GO:0009314). Expression of the annotated melatonin receptor type 1C (Mel1c), **Table 1**, was reported to be associated with light intensity (Li et al., 2013; Park et al., 2014). The high solar intensity of Qena; 8.3–8.5 kWh/m<sup>2</sup> /day (Khalil et al., 2010), the source of Dandarawi, could be the selection force that fixed the Mel1c homozygosity. On Chromosome 4, the secreted frizzledrelated protein 2 (SFRP2) was annotated (GO:009314; response to radiation). SFRP2 is involved in chicken embryogenesis; development of the neural system (brain tissue), muscles (myogenesis), and developing eyes particularly the pigmented layer of the retina and photoreceptors (Lin et al., 2007). SFRP2 stimulates melanogenesis through microphthalmiaassociated transcription factor and/or tyrosinase upregulation via β-catenin signaling.

## **In Baladi**

The Baladi is the only naked neck population (ecotype) in this study. The physiological process of protein homotrimerization (GO:0070207) enriched in this breed reflected the homotrimerization of heat shock protein factor. Heat shock factor proteins 1, 2, 3, and 4 were annotated on Baladi Chromosome 11 (**Table 1**), reflecting the population's adaptation to heat. Xie et al. (2014) reported that HSF4 exhibits tissue-specific expression with preferential expression in heart, brain, skeletal muscle, and pancreas, with two alternatively spliced isoform HSF4a and HSF4b. HSF4a acts as an inhibitor, while HSF4b as an activator of tissue specific heat shock gene expression.

# Fixation Index, FST, for Inter-Populations Genetic Differentiation

Population stratification analyses (**Figure 1**) and ROH results (**Table 1**) indicated three genetically differentiated chicken groups that were considered for fixation index (FST) analysis: (1) Dandarawi, (2) Baladi and Fayoumi (1 and 2 represent North-African populations), and (3) Rwanda and Uganda (East-African). Two comparisons were conducted; East-African vs. North-African and Baladi and Fayoumi vs. Dandarawi populations.

East-African vs. North-African FST indicated one selection sweep on Chromosomes 4 (20.2–20.3 Mb), **Figure 3A**. Determined FST regions, on Chromosome 4, indicated several genes playing roles in cell cycle, differentiation and proliferation, i.e., SFRP2, FGA, FGB and FGG (fibrinogen A, B, and G) and PLRG1 (pleiotropic regulator 1). GO enrichment analysis indicated the GO term for physiological process of cell differentiation (GO:0030154). Within this enriched GO term, annotated genes were found to be contributing to the development of muscular and neural systems [myogenin (MYOG), SFRP2, neuropilin 1 (NRP1), and nerve growth factor (NGF)]. Myogenin (MYOG) acts as a transcriptional activator that promotes transcription of muscle-specific target genes and plays a role in muscle differentiation. MYOG induces myogenesis (fibroblasts to differentiate into myoblasts), in a variety of cells and tissues. The SFRP2, as previously indicated, is involved in chicken embryogenesis; development of the neural system (brain tissue), muscles (myogenesis), and developing eyes particularly the pigmented layer of the retina and photoreceptors (Lin et al., 2007). The NRP1 is

involved in the development of the cardiovascular system, angiogenesis, the formation of certain neuronal circuits and in organogenesis outside the nervous system. NGF is a neurotrophic factor and neuropeptide primarily involved in the regulation of growth, maintenance, proliferation, and survival of certain neurons. In addition, the annotated MAPK9; mitogen-activated protein kinase is involved in a wide variety of cellular processes such as proliferation, differentiation, transcription regulation and development. It targets specific transcription factors and mediates immediate-early gene expression in response to various cell stimuli, and is involved in UV radiation-induced apoptosis.

Baladi and Fayoumi (Delta and Mid-Egypt) vs. Dandarawi (Southern Egypt) FST (**Figure 3B**) revealed two windows on Chromosome 11 (19.2–20.2 Mb). Functions of the annotated genes could explain some of the genetic variation among the Egyptian breeds focusing on the genetic uniqueness driven by extreme environmental stress and breeding practices in Southern Egypt (Dandarawi), and a distinctive immunity profile in Fayoumi. Enriched GO terms revealed the biological process and molecular functions associated with immunity, i.e., autophagy (GO:0006914) and positive regulation of natural killer cell mediated cytotoxicity (GO:0045954); regulation of skeletal muscle fiber development (GO:0048742); adaptation to oxidative stress, i.e., superoxide metabolic process (GO:0006801), cellular response to oxygen radical (GO:0071450) and nitric oxide biosynthetic process (GO:0006809); tolerance to irradiation and high-intensity of solar radiation, i.e., endosome to melanosome transport (GO:0035646) and melanosome organization (GO:0032438); and cell cycling and aging, i.e., regulation of telomere maintenance (GO:0032204) and negative regulation of telomere maintenance (GO:0032205).

The annotated putative genes also reflected the similar mechanisms of adaptation. For the immunity-relevant genes, the annotated GABA type A receptor associated protein like 2 (GABARAPL2), **Table 2**, is a member of the Atg8 (autophagy-related protein 8) family that contributes to the formation of autophagosomes. This indicated genetic variation in autophagy process between the two genetic groups. For the cell cycle and bird aging associated gens, the TERF2IP (TERF2 interacting protein or telomeric repeat binding factor 2, RAP1) gene (GO:0032204 and GO:0032205) encodes a protein that is part of a complex involved in the biological processes of telomere length and protection (telomere maintenance, telomere maintenance via telomere lengthening and regulation of telomere maintenance).

The AP1G1 (adaptor related protein complex 1 gamma 1 subunit) gene that was annotated in the two GO terms (GO:0035646 and GO:0032438) plays a major role in Dandarawi feather pigmentation (melanosome organization and transport). AP1G1 could reflect the sex-linked variation in feather coloring (Dandarawi males and females have different colors) and Dandarawi tolerance to intensive solar radiation in Southern Egypt. AP1G1 was also annotated in the GO:0045954 (positive regulation of natural killer), **Table 2**. ZFHX3 (zinc finger homeobox 3) gene affects the regulation of myoblast



<sup>1</sup>The subset of GO that affect adaptation profile to cellular or environmental stressors. <sup>2</sup>North-African populations group is composed of Baladi, Fayoumi, and Dandarawi populations' samples; while East-African population group is composed of Rwanda and Uganda populations' samples.

differentiation and fiber development (GO:0048742; regulation of skeletal muscle fiber development).

For adaptation to oxidative stress, the annotated SOD1, superoxide dismutase [Cu–Zn] contributes to the biological processes of cellular response to Oxygen radicals (GO:0071450), and Superoxide metabolic process (GO:0006801), **Table 2**. SOD1 role, as an anti-oxidizing enzyme, is to converts harmful superoxide radicals into less reactive oxygen species (ROS) and hydrogen peroxide, H2O<sup>2</sup> (Bosco, 2015). In addition, the annotated NQO1 [NAD(P)H dehydrogenase, quinone 1] is a major quinone reductases, that is highly inducible and plays multiple roles in cellular adaptation to stress. Reported roles of NQO1 included its ability for quinone detoxification, to function as a component of the plasma membrane redox system generating antioxidant forms of ubiquinone and vitamin E, and its function as a superoxide reductase (Ross and Siegel, 2017).

Concerning the adaptation to thermal stress, the annotated NOS1 (nitric-oxide synthase 1) contributes to the molecular function of nitric-oxide synthase activity (GO:0004517) and the nitric oxide biosynthetic process (GO:0006809). Yadav et al. (2016) demonstrated significantly higher expression of different types of NOS (P < 0.05) during summer and winter peaks, in goats, as compared to moderate season. Authors, therefore, indicated the possible involvement of the NOS family genes in ameliorating thermal stress and to maintain cellular integrity and homeostasis.

# DISCUSSION

# Population Stratification

In African developing countries that lack genetic improvement schemes applied to indigenous chicken genetic resources, the major forces driving genetic diversity are natural biotic/abiotic stresses, including flock management. In this study, the MDS reflected genetic divergence between the smallest Egyptian breed, Dandarawi, adapted to Southern Egypt's extreme heat and solar radiation conditions, from the other two populations belonging to a less stressful environment of Delta and Mid-Egypt. Both MDS and admixture analyses confirmed the genetic similarity between the Rwandan and Ugandan ecotypes that has been reported by Fleming et al. (2016). The admixture analysis confirmed that Baladi and Fayoumi share a major ancestral background. Gene flow from Fayoumi to indigenous Baladi ecotypes likely occurred as a result of indiscriminate breeding practices in the villages. East-African ecotypes (Rwanda and Uganda) share a portion of its genetic backgrounds with both the Dandarawi and Fayoumi breeds. Considering that (1) no genetic exchange was reported between the Egyptian and East-African populations, and (2) both North-African and East-African chickens belong to different clades (origins), according to the mitochondrial D-loop sequences study (Osman et al., 2016), such common genetic backgrounds could be due to the ancestral part of the genome. This may strengthen the hypothesis that ancient chickens were first introduced to Egypt, from Asia through the cinnamon trade and then transported to other parts of the African continent including Rwanda and Uganda

(MacDonald, 1992; Blench and MacDonald, 2000; Mwacharo et al., 2013), a hypothesis that needs more investigation.

# Runs of Homozygosity (ROH) Mapping

A total of 153 within-population consensuses ROH were detected; 41, 49, 35, and 28 in Baladi, Dandarawi, Fayoumi, and Rwanda–Uganda populations, respectively. Chromosomal distribution of the within-population consensus ROH varied among the five populations studied, i.e., the highest ROH signals were found on Chromosomes 2, 3, 8 and 11 in Baladi; 2, 3, 4, 8, and 11 in Fayoumi; 1, 4, and 8 in Dandarawi; and 3, 5, and 8 in Rwanda–Uganda populations. Chromosome 8 was common among all studied populations in bearing signatures of selection. Fleming et al. (2016), studied ROH in Rwanda and Uganda populations, and reported different chromosomal distribution of overlapping and consensus ROH. This is due to the utilization of different ROH analysis parameters, overlapping, and consensus conditions. Fleming et al. (2016) considered overlapping ROH as those overlapped across all populations and contained 10 or more individuals and interbreed consensus were those common to every bird, irrespective of length of the ROH.

# Signature of Selection Detected by ROH Mapping

### (a) Selection Signatures Common in East- and North-African Populations

Under African village conditions, with lack of standardized rations, chicken feeding is mainly based on scavenging (free range), household waste and some grain supplementation. Therefore, carbohydrates metabolism, energy generation and transport are important traits for adaptation. The enriched GO term (GO:0004556; alpha-amylase activity) and the annotated AMY2A (alpha amylase 2) gene, in Rwanda–Uganda, Fayoumi and Dandarawi populations suggested the signatures of selection forces of carbohydrates and glycogen metabolism, and response to thermal stress and unbalanced feeding. On the same chromosome, the SLC25A24 (solute carrier family 25 member 24, calcium-regulated mitochondrial ATP-Mg/Pi carrier) was annotated in both Rwanda–Uganda and Dandarawi populations. SLC25A24 is involved in calcium ion binding (GO:0005509) and cellular response to oxidative stress (GO:0034599) (Ehmke et al., 2017; Harborne et al., 2017). SLC25A24 imports adenine nucleotides from the cytosol into the mitochondrial matrix and exports phosphate to the cytosol. This process controls the size of the adenine nucleotide pool of the mitochondrial matrix in response to cellular energetic demands (Harborne et al., 2017) and supports the adenine-dependent mitochondrial activities including gluconeogenesis, mitochondrial biogenesis and mitochondrial DNA maintenance. Regulation of energy transport by SLC25A24 is crucial for adaptation to stressful conditions in African villages. Fleming et al. (2016) reported over-enrichment of molecular functions of calcium ion binding (GO:0005509) as related to oxidative stress induced function by the environment, in East-African (Rwanda and Uganda) ecotypes. In the same study, authors also reported enriched GO:0034599 (cellular response to oxidative stress) in Rwanda ecotype.

## (b) Selection Signature in the East-African Populations

In the absence of structured selection schemes for productive performance, stressful conditions are the major selection forces on indigenous East-African chicken populations. The stresses in the East-African countries include high-altitude and lower oxygen availability; and oxidative stress in addition to lack of vaccination and poor veterinary services. Altitude averages are 1,497 m and 1,155 m in Rwanda and Uganda, respectively, in comparison with 75 m in Egypt. High altitude is accompanied with lower partial oxygen pressure and less effective oxygen availability. Enriched GO terms indicated biological process of angiogenesis (GO:0001525), oxygen transport (GO:0015671); and molecular function of heme binding (GO:0020037), and oxygen binding (GO:0019825). Annotated genes within detected ROH in the Rwanda–Uganda populations reflected the effects of high altitude and management forces, e.g., feeding quality, on shaping genetic divergence. Vasohibin-1 (VASH1) gene is involved in angiogenesis, regulation of endothelial cell proliferation in response to wounding (GO:0009611), and regulation of lymphangiogenesis (GO:1901491) (Heishi et al., 2010; Miyashita et al., 2012; Affara et al., 2013; Sato, 2013). Fleming et al. (2017) also reported strong selection toward angiogenesis, and Fleming et al. (2016) reported the (GO:0042060; wound healing) in the Rwanda and Uganda populations. Neuroglobin, NGB, is a neuron-specific globin shown to protect against hypoxia, ischemia, oxidative stress and is associated with oxygen transport and oxygen-heme binding (Mammen et al., 2002; Milton et al., 2006; Hümmler et al., 2012). This reflected tolerance of Rwanda–Uganda chickens to high-altitude and wound healing. Oxidative stress resulted from various stressors, including heat, pathogen invasion, and highsolar radiation made oxidative stress a common denominator of stress responses in African chicken. The GO term of molecular function of glutathione transferase activity (GO:0004364) was enriched. GSTZ1 (glutathione-S-transferase zeta 1) annotated in Rwanda–Uganda, enables the molecular functions of glutathione transferase and peroxidase activities in response to oxidative stress. Fleming et al. (2016) reported signatures of selection related to genes and signaling pathways involved in the reduction of ROS through utilization of calcium ions, lipids, and kinases, as the mobilization of Ca2<sup>+</sup> is a part of the trade-off in Ca2<sup>+</sup> usage between the calcium needed for eggshell formation and that stored in the ER. Maize contamination with the feedborne mycotoxin aflatoxin B1 (AFB1) is a common problem in East-African humid environment. Nishimwe et al. (2017) reported that most of the animal feed containing maize has >100 µg/kg of AFB1 in Rwanda. AFB1 has a high hepatotoxic effect on different poultry species. It was found that domesticated turkeys (Meleagris gallopavo) was very susceptible to the AFB1 because it lacks a functional hepatic GST-mediated detoxification of AFBO (electrophilic exo-AFB1-8,9-epoxide), while the wild turkey (Meleagris silvestris) was resistant due to its hepatic ability for GST-mediated AFBO detoxification (Reed et al., 2018).

Annotated GSTZ1 could be reflecting natural selection for both reduction of oxidative stress and resistance to aflatoxins in Rwanda–Uganda ecotypes.

## (c) Selection Signature in the North-African (Egyptian) Chicken Populations

A crucial factor in stress tolerance is the dynamic relationship between cations and anions to maintain body fluid and cell homeostasis (Mongin, 1980). Calcium, potassium, and sodium are major cations while chloride is a major anion in chicken. Low chloride levels can affect the acid–base balance and increase blood pH. Several GO terms associated with cation/anion binding and transport were enriched in Egyptian populations: Chloride channel activities (GO:0005254) and chloride transmembrane transport (GO:1902476) in Dandarawi and Fayoumi; anion transport (GO: 0006820) and anion transmembrane transport (GO: 0098656) in Fayoumi. Fleming et al. (2016) reported the enrichment of GO terms of calcium ion transmembrane transport (GO:0070588) in consensus ROH in Rwanda and Uganda populations, which may indicate the common contribution of anion/cation balance in adaptation profile of African chicken populations. Several genes associated with cation/anion binding and transport were annotated in the consensus ROH of the Egyptian breeds: The chloride channel CLIC like 1 (CLCC1), annotated in Fayoumi and Dandarawi; the Na+-Cl<sup>−</sup> co-transporter solute carrier family 12 member 3 (SLC12A3) and the K+-Cl<sup>−</sup> co-transporter solute carrier family 12 member 4 (SLC12A4) annotated in Fayoumi. CLCC1 was reported by Nagasawa et al. (2001) to be expressed in different organelles, including the ER and kidney. Jia et al. (2015) proved that loss of CLCC1 and disruption of chloride homeostasis in the ER disrupted the protein-folding capacity of the ER and resulted in ER stress, misfolded protein accumulation, and neurodegeneration. SLC12A3 is a cotransporter in the kidney that re-absorbs sodium and chloride ions from the tubular fluid into the distal convoluted tubule cells of the nephron. SLC12A4 plays key roles in electrolyte movement across epithelia and in intracellular chloride homeostasis of neurons and muscle cells (Payne, 2012). It was also reported to contribute to the osmotic fragility of erythrocytes (Hanzawa et al., 2002). Annotated iontransport related genes reflected the signature of selection for homeostasis and metabolism that promoted stress tolerance in the Egyptian chicken populations.

In both Fayoumi and Baladi (sourced from Delta and Mid-Egypt and showed common ancestral background), results indicated that the adrenaline and noradrenaline play roles in their adaptation profiles, as both the biological processes of NE transport (GO:0015874) and dopamine uptake (GO:0051583) were enriched. Dopamine is a neurotransmitter and a precursor of adrenaline. Stress activates the hypothalamus– pituitary–adrenal (HPA) axis, which increases the release of glucocorticoids from the adrenal glands that in concert with other neuro-modulators, e.g., noradrenaline, promote cognitive adaptation to stressful conditions (Krugers et al., 2012). Sodium-dependent noradrenaline transporter (solute carrier family 6 member 2, SLC6A2) was annotated in both Fayoumi and Baladi, while 11β-hydroxysteroid dehydrogenase type 2 (HSD11B2) was annotated in Fayoumi. SLC6A2 is involved in NE transport and availability, while HSD11B2 oxidizes the glucocorticoid cortisol to the inactive metabolite cortisone, preventing illicit activation of the mineralocorticoid receptor. HSD11B2 is expressed in aldosterone-sensitive neurons and responsible for promoting appetite for sodium (feeding behavior), independently of thirst or hunger (Jarvie and Palmiter, 2017). Inhibition of HSD11B2 causes mineralocorticoid excess and hypertension due to inappropriate glucocorticoid activation of renal mineralocorticoid receptors (Chapman et al., 2013). HSD11B2 oxidization of the glucocorticoids would support immunity and defense response of Fayoumi (Prince, 1958; Lamont et al., 1996; Pinard-van der Laan et al., 2009; Bacciu et al., 2014).

Fayoumi is characterized with ability to fly which is expected to be reflected in their genome structure and selection footprints. Two GO terms of biological processes of bone trabeculae formation (GO:0060346) and regulation of bone remodeling (GO:0046850) were enriched in Fayoumi. MMP2 (matrix metallopeptidase 2) annotated in Fayoumi contributes to the biological process of tissue morphogenesis, collagen catabolism and bone trabecula formation (spongy bone that contains the red bone-marrow). The annotated TNFRSF11B, is a member of the TNF-receptor superfamily, which is responsible for the production of an osteoblast-secreted decoy receptor that functions as a negative regulator of bone resorption. Both annotated MMP2 and TNFRSF11B may be related to distinctive morphogenesis characteristic of mineral density and ability to fly in Fayoumi (Geleta et al., 2013).

Oxidative stress increases levels of lipid peroxidation along with elevating hydrogen peroxide levels in the cytosol and mitochondria (Chandrashekar and Muralidhara, 2010). To offset oxidative stress, cells respond with elevated glutathione levels, increased activities of glutathione-dependent enzymes and increased membrane permeability and intracellular Ca<sup>+</sup> levels. Multiple genes contributing to oxidative stress reduction were annotated; OSGIN1 and HSD11B2 (hydroxysteroid 11-beta dehydrogenase 2) in Fayoumi. OSGIN1 encodes an oxidative stress response protein that regulates cell death and apoptosis by inducing cytochrome c release from mitochondria (Ott et al., 2002). OSGIN1 expression is regulated by p53 and induced by DNA damage and inhibits growth in several tissues. The homozygous genotype of OSGIN1 could play role in the Fayoumi response to oxidative stress, with anti-proliferative function and the induction of apoptosis at the cost of growth performance under village stressful conditions.

Selection forces of the severe stressful hot-dry and high solar intensity conditions in Southern Egypt showed their signature on the Dandarawi genome. The two GO terms of molecular function of melatonin receptor activity (GO:0008502) and the biological process of response to radiation (GO:0009314) were enriched. Expression of melatonin receptor type 1C (Mel1c; ortholog of mammalian GPR50), is activated by monochromic light (green light) in several organs, and subsequently, activates several immune- and developmental-related processes within these organs. For instance, Mel1c activates B-lymphocyte proliferation in broiler bursa (Li et al., 2013), T-lymphocyte proliferation

in broiler thymus (Chen et al., 2016), development of the newly hatched chick's liver via an anti-oxidation pathway (Wang et al., 2014) and secretion of insulin-like growth factor 1 in chicks embryo liver (Li et al., 2016). The high solar intensity of Qena, the source of Dandarawi; 8.3–8.5 kWh/m<sup>2</sup> /day (Khalil et al., 2010) could be the selection force that fixes the Mel1c homozygosity in the Dandarawi breed to promote its adaptation and immunity characteristic. As previously indicated, the annotated SFRP2 stimulates melanogenesis through microphthalmia-associated transcription factor and/or tyrosinase upregulation via β-catenin signaling.

The Baladi ecotype has unique heat tolerance due to the naked neck phenotype, compared with the other populations in this study. The enriched GO of biological process of protein homotrimerization (GO:0070207) and the annotated heat shock transcription factors 1, 2, 3, and 4 (HSF1, 2, 3, and 4) revealed the population adaptation to the Egyptian heat condition. The heat shock proteins are chaperone proteins that effectively protect several proteins and cell organelles from stressors' negative effects, mainly heat. Heat shock transcription factors, e.g., HSF4 exhibits tissue-specific expression with two alternatively spliced isoforms; HSF4a and HSF4b. HSF4a acts as an inhibitor, while HSF4b as an activator of tissue specific heat shock gene expression (Xie et al., 2014).

# Fixation Index, FST, for Inter-Populations Genetic Differentiation

To study the genomic differentiation resulting from forces of natural environmental stresses, two scopes were proposed. First, genomic variation among North- vs. East-African chicken populations (hot arid desert vs. tropical Savana, according to Peel et al., 2007). The second is variation between Baladi and Fayoumi vs. Dandarawi, considering results of population stratification and similarity in ROH mapping between Baladi and Fayoumi. This allowed investigating both inter-population genomic variation and the possible signatures of selection due to climatic variation between delta/Mid Egypt and Southern-Egypt regions.

For the genomic differentiation resulting from selection forces of the distinct climates between North- and East-African countries studied, the GO term for biological process of cell differentiation (GO:0030154) was enriched. Multiple genes contributing to the development of muscular and neural systems were annotated (SFRP2, MAPK9, MYOG, NRP1, and NGF). The annotated SFRP2, as previously indicated, stimulates melanogenesis through microphthalmia-associated transcription factor and/or tyrosinase upregulation via β-catenin signaling (Kim et al., 2016). MAPK9 (mitogen-activated protein kinase 9) is involved in a wide variety of cellular processes such as proliferation, differentiation, transcription regulation and development. It targets specific transcription factors and mediates immediate-early gene expression in response to various cell stimuli, and is involved in UV radiation-induced apoptosis. Annotated SFRP2 and MAPK9 reflected selection footprints for the high-intensity of solar radiation in Southern Egypt governorate; Qena (source of Dandarawi), that receives

8.3–8.5 kWh/m<sup>2</sup> /day of solar radiation (Khalil et al., 2010). Scavenging Dandarawi is highly affected by higher intensity of solar radiation, and melanogenic activity of the SFRP2 very likely contributed to their adaptation. Phenotypic variations among North-African (Baladi, Dandarawi, and Fayoumi), and East-African (Rwanda and Uganda) populations were reflected in the annotated myogenin gene. Myogenin (MYOG) induces myogenesis (fibroblasts to differentiate into myoblasts), in a variety of cells and tissues, through its actions as a transcriptional activator that promotes transcription of muscle-specific target genes. Both NRP1 and NGF genes play role in neuron development. The NRP1 is involved in the development of the cardiovascular system, angiogenesis, the formation of certain neuronal circuits and in organogenesis outside the nervous system. NGF is a neurotrophic factor and neuropeptide primarily involved in the regulation of growth, maintenance, proliferation, and survival of certain neurons. Alteration in incubation conditions of developing chicks might change the developmental trajectories of some physiological regulation systems and may affect the quality of the young check during the first few days' post-hatching (Tzschentke and Plagemann, 2006). Tong et al. (2013) reported that incubation conditions, embryonic physiological parameters, and other environmental factors are important for prober differentiation and actual hatching times. Environmental variation between North- and Eastern-African sampling locations (Egypt vs. Rwanda and Uganda) and their effects on chicks' embryonic development and cell differentiation could be the selection forces for the annotated genes.

FST analyses compared between Delta/Mid-Egypt populations (Baladi and Fayoumi) and the Southern Egypt (Dandarawi) revealed genetic variation resulting from the different environmental stresses and breeding practices in Southern Egypt. Atg8 contributes to the formation of autophagosomes (Kabeya et al., 2000). The annotated GABA type A receptor associated protein like 2 (GABARAPL2) (GO:0006914; biological process of autophagy), is a member of this family. Sun et al. (2014) reported that Newcastle Disease Virus (NDV) triggers autophagy resulting in enhanced virus replication in chicken cells and tissues. Results of Hassan et al. (2004), based on the challenge of four Egyptian chicken breeds with NDV indicated that Dandarawi, along with the Gimmizah synthetic breed, were highly susceptible (100% mortality for both breeds) to NDV infection. Deist et al. (2017) reported that the Fayoumi showed a significantly less viral load than the Leghorns at 6 days-post-infection, indicating the Fayoumi potentiality to clearing the virus and possibly overcoming infection more efficiently than the Leghorns. The FST results could reflect the variation between Fayoumi and Dandarawi in autophagy, which indicated variation in their resistance to NDV.

Under rural poultry production in Southern Egypt, no regular culling (for genetic improvement) is practiced and birds are prone to extended production life, associated with an extended number of cell cycles, which could promote a signature of selection for telomere length and stability. FAO (2009) in a study for characterizing the domestic chicken and duck production systems in Egypt indicated

that 100% of the interviewed households (209 households) in Sohag (the Qena neighboring governorate) reported the longevity as a major criterion for selecting the birds that they buy. The enriched GO terms of biological processes of Negative regulation of telomere maintenance (GO:0032205) and regulation of telomere maintenance (GO:0032204) may reflect the variation in birds longevity between Egyptian populations under different production systems. TERF2IP gene, annotated in the two GO terms, encodes a protein that is part of a complex involved in telomere length and protection (O'Connor et al., 2004; Martinez et al., 2010; Chen et al., 2011). It is likely that the annotated TERF2IP was a signature of selection for "longevity" in Southern-Egypt Dandarawi breed.

Climatic variation among sampling locations of indigenous Egyptian chicken populations had been reported. Egypt was classified into 12 zones according to the solar Atlas of Egypt (Khalil et al., 2010). The Nile delta (source of Baladi) receives 5.5–6.6 kWh/m<sup>2</sup> /day; Fayoum (Mid-Egypt and source of Fayoumi population) receives 7.0–7.3 kWh/m<sup>2</sup> /day; and Qena (Southern Egypt and source of Dandarawi) receives 8.3–8.5 kWh/m<sup>2</sup> /day. With absence of structural breeding plans, we speculated that Egyptian rural chicken populations, in the study, are under different selection pressures driven by variations in solar radiation. The enriched GO terms for biological processes of endosome to melanosome transport (GO:0035646) and melanosome organization (GO:0032438) could emphasizes variation between Dandarawi and both Baladi and Fayoumi in their tolerance to solar radiation stress. The AP1G1 (adaptor related protein complex 1 gamma 1 subunit) gene, annotated in both GO terms, plays a major role in feather pigmentation (melanosome organization and transport). AP1G1 could reflect both the sex-linked variation in feather coloring (Dandarawi males and females have different colors), and tolerance to intensive solar radiation in Dandarawi under Southern Egypt stressful environment. Fleming et al. (2016) reported the enrichment of GO term for response to radiation (GO:0009314) and DNA repair (GO:0006281), in Rwanda and Uganda populations, justifying that as possibly a result of the birds living at the equator.

# CONCLUSION

In conclusion, results of this study indicated that environmental stresses played major roles in shaping genomic variation of African chicken populations. In Egypt, Baladi and Fayoumi were genetically closer to each other than the Southern-Egypt Dandarawi population, while Rwanda and Uganda chickens showed clear overlap in their genomic structure, being under very similar environmental conditions. Although, no genetic exchange was reported between Egyptian populations (Fayoumi and Dandarawi) and East-African ecotypes (Rwanda and Uganda), the existence of some common ancestral genetic backgrounds among the two groups of populations could be due to the ancestral part of the genome, according to the hypothesis that ancient chickens were first introduced to Egypt, from Asia through the Cinnamon trade and then transported to other parts of the African continent including Rwanda and Uganda.

Intra-population ROH and inter-population FST mapping revealed selection footprints of possible environmental stresses, breed characteristics and management. ROH of all native African populations showed selection footprints for energy transport, calcium ion binding, and reduction of oxidative stress. North-African (Egyptian) populations, under hot desert environment, showed likely selection footprints for adaptation to heat, solar radiation, ion transport and immunity. East-African populations, under tropical savanna and higher altitude conditions, showed signatures of selection for oxygen-heme binding and transport, and reduction of oxidative stress. Behavior and phenotypic characteristics were reflected by ROH mapping in the study. Genes associated with availability and transport of corticosteroid and NE could reflect the active behavior of the Fayoumi breed. FST mapping and its annotated genes emphasized the genetic variations likely generated by natural selective forces. Egyptian Fayoumi showed distinctive genetic mechanisms for their resistant to the endemic diseases, e.g., NDV. Management issues of chicken flocks, including extended bird longevity in the Southern-Egypt households was also reflected in terms of genes associated with telomere maintenance. These results enhance our understanding of the role of natural selection forces in shaping genomic variation, and genes contributing to adaptation under stressful African conditions.

# ETHICS STATEMENT

The study presented in the manuscript involve blood sample collection from rural chicken in some Egyptian, Rwandan and Ugandan villages. Samples were collected by local villages veterinarians following the approved country standards for minimizing any probable bird uncomfortability.

# AUTHOR CONTRIBUTIONS

AE, FB, and MR conceptualized and designed the work. DF, AVG, and DK collected the sample and data. AE, FB, SL, and MR analyzed and interpreted the data. AE drafted the manuscript. FB, DF, AVG, CA, CS, DK, SL, and MR made a critical revision of the manuscript. All authors have approved the final version of manuscript to be published.

# FUNDING

Multiple funding and supporting entities have contributed for carrying out the current study, which includes:

(1) Support from the Fulbright Foundation, in terms of funding the stay of the correspondence author (Visiting Scholar fellowship).

(2) Financial support for sample collection, lab analyses, and other operational costs was provided by ISU Ensminger Fund, State of Iowa and Hatch funding, Agricultural Research Service, Research Participation Program, administered by the Oak Ridge Institute for Science and Education (ORISE). The study was also supported by Agriculture and Food Research Initiative Competitive Grant 2011-67003-30228 from the United States Department of Agriculture, National Institute of Food and Agriculture (NIFA).

# ACKNOWLEDGMENTS

Appreciation is due to the indigenous chicken holders in Egypt, Rwanda, and Uganda for providing biological

# REFERENCES


samples and information used in this study. The authors are also thankful for the support from the Fulbright Foundation, ISU Ensminger Fund, State of Iowa and Hatch funding. Support provided by the Agricultural Research Service, Research Participation Program, administered by the Oak Ridge Institute for Science and Education (ORISE) is highly appreciated. ORISE is managed by Oak Ridge Associated Universities under DOE Contract No. DE-AC05-06OR23100.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00376/full#supplementary-material

and regulation in different organisms. Nat. Struct. Mol. Biol. 18, 213–221. doi: 10.1038/nsmb.1974



based on mitochondrial genomes. Heredity 110, 277–282. doi: 10.1038/hdy. 2012.83



(Gallus gallus). PLoS One 9:e102204. doi: 10.1371/journal.pone.010 2204


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2019 Elbeltagy, Bertolini, Fleming, Van Goor, Ashwell, Schmidt, Kugonza, Lamont and Rothschild. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Genome Analysis Reveals Genetic Admixture and Signature of Selection for Productivity and Environmental Traits in Iraqi Cattle

### *Akil Alshawi1,2\*, Abdulameer Essa3, Sahar Al-Bayatti3 and Olivier Hanotte1,4*

*1 Division of Cells, Organisms and Molecular Genetics, School of Life Sciences, Faculty of Medicine and Health Sciences, University Park Campus, University of Nottingham, Nottingham, United Kingdom, 2 Department of Internal and Preventive Veterinary Medicine, College of Veterinary Medicine, University of Baghdad, Iraqi Ministry of Higher Education and Scientific Research, Baghdad, Iraq, 3 Animal Genetics Resources Department, Directorate of Animal Resources, the Ministry of Iraqi Agriculture, Baghdad, Iraq, 4 LiveGene, International Livestock Research Institute (ILRI), Addis Ababa, Ethiopia*

### *Edited by:*

*Johann Sölkner, University of Natural Resources and Life Sciences Vienna, Austria*

### *Reviewed by:*

*Zhe Zhang, South China Agricultural University, China Edgar Farai Dzomba, University of KwaZulu-Natal, South Africa*

### *\*Correspondence:*

*Akil Alshawi akil.alshawi@nottingham.ac.uk; dr.akilalshawi1@gmail.com*

### *Specialty section:*

*This article was submitted to Livestock Genomics, a section of the journal Frontiers in Genetics*

*Received: 22 October 2018 Accepted: 11 June 2019 Published: 16 July 2019*

### *Citation:*

*Alshawi A, Essa A, Al-Bayatti S and Hanotte O (2019) Genome Analysis Reveals Genetic Admixture and Signature of Selection for Productivity and Environmental Traits in Iraqi Cattle. Front. Genet. 10:609. doi: 10.3389/fgene.2019.00609*

The Near East cattle are adapted to different agro-ecological zones including desert areas, mountains habitats, and humid regions along the Tigris and Euphrates rivers system. The region was one of the earliest and most significant areas of cattle husbandry. Currently, four main breeds of Iraqi cattle are recognized. Among these, the Jenoubi is found in the southern more humid part of Iraq, while the Rustaqi is found in the middle and drier region of the country. Despite their importance, Iraqi cattle have up to now been poorly characterized at the genome level. Here, we report at a genome-wide level the diversity and signature of positive selection in these two breeds. Thirty-five unrelated Jenoubi cattle, sampled in the Maysan and Basra regions, and 60 Rustaqi cattle, from around Baghdad and Babylon, were genotyped using the Illumina Bovine HD BeadChip (700K). Genetic population structure and diversity level were studied using principal component analysis (PCA), expected heterozygosity (*He*), observed heterozygosity (*Ho*), and admixture. Signatures of selection were studied using extended haplotype homozygosity (EHH) (*iHS* and *Rsb*) and inter-population Wright's *Fst*. The results of PCA and admixture analysis, including European taurine, Asian indicine, African indicine, and taurine indicate that the two breeds are crossbreed zebu × taurine, with more zebu background in Jenoubi cattle compared with Rustaqi. The Rustaqi has the greatest mean heterozygosity (*He =* 0.37) among all breeds. *iHS* and *Rsb* signatures of selection analyses identify 68 candidate genes under positive selection in the two Iraqi breeds, while *Fst* analysis identifies 220 candidate genes including genes related to the innate and acquired immunity responses, different environmental selection pressures (e.g., tick resistance and heat stress), and genes of commercial interest (e.g., marbling score).

Keywords: *Bos taurus*, *Bos indicus*, genetic structure, diversity, positive selection, immune responses, adaptive genes

# INTRODUCTION

Archeological and genetic studies support two centers of cattle domestication, the Fertile Crescent and the Northern part of the Indian subcontinent including the Indus Valley (Loftus et al., 1994; Bradley et al., 1998; Troy et al., 2001; Helmer et al., 2005; Bradley and Magee, 2006; Zeder, 2008; Chen et al., 2010; Magee et al., 2014). The earliest archeological evidence of humpless cattle was found in the Fertile Crescent, dated to around 10,000 bc. The first evidence of domestic humped cattle is from the Indus Valley region around 8,000 bc (Felius et al., 2015). From these two heartlands of domestication, two main cattle types, *Bos taurus* (humpless taurine) and *Bos indicus* (humped zebu), dispersed across the world, with taurine cattle reaching Africa, Europe, and East Asia and indicine cattle migrating to Africa, South Asia, and South-East Asia (Hanotte et al., 2002; Diamond and Bellwood, 2003; Freeman et al., 2006a; Gifford-Gonzalez and Hanotte, 2011; Magee et al., 2014). The domestication process of animals was essentially a form of symbiosis with humans enabling the dissemination of domesticated cattle throughout the world (Diamond and Bellwood, 2003).

Cattle husbandry was part of the ancient civilizations of Mesopotamia, modern-day Iraq, at an early time with the earliest available evidence of domestic cattle in this region dating to around 6000 bc. There are many archeological evidence of the antiquity and importance of cattle husbandry in central Mesopotamia, including cylinder seals (Read, 2015). These animals were of the humpless taurine type. Further south, closer to the Indus Valley center of cattle domestication, archeological evidence of domestic cattle in Mesopotamia is much fewer. It includes artistic depictions from the royal tombs of Ur (South of Iraq) including domestic animals (**Supplementary Figure S1A**, **B**), with most of the ancient agricultural settlements along the Tigris and Euphrates now buried under flooded plains (Read, 2015).

Despite the high cattle number worldwide, it is estimated that 17% of cattle breeds are facing extinction, following changing environmental and production conditions (Rischkowsky and Pilling, 2007; Taberlet et al., 2011; FAO, 2015; Felius et al., 2015). Endangered breeds of cattle are mostly found in developing countries (Joost et al., 2015). For instance, 32% of the recognized indigenous African breeds are at risk of extinction, and another 22% are already considered extinct (Mwai et al., 2015).

Several studies have explored the genome diversity and adaptation of cattle breeds using either high-density SNP chips (Bovine high density SNPs BeadChip (777,962 SNPs) (e.g., Xu et al., 2014; Bahbahani et al., 2017) or full genome sequence analysis (e.g., Choi et al., 2015; Kim et al., 2017). However, the cattle genomes of the Fertile Crescent (including Iraq) have not yet been characterized. The available studies are a microsatellite study (Ateş et al., 2014) and mitochondrial DNA analysis for a few breeds (Freeman et al., 2006a, Freeman et al., 2006b; Edwards et al., 2007; Ateş et al., 2014). These studies show evidence of zebu introgression within the Near East taurine, in particular within the Iraqi and Anatolian breeds.

Today, four cattle breeds are officially recognized in Iraq (Al-Murrani et al., 2003) (**Supplementary Table S1**). Karradi and Sharabi in the northern part of the country, Rustaqi in the central part, and further south, Jenoubi. Phenotypically, Rustaqi may be classified as taurine and Jenoubi as zebu. Living in different agro-ecologies, they will be expected to be adapted to different environmental challenges, including external–internal parasites or infectious diseases, heat, and humidity. Alongside, some commercial breeds have been introduced over the years, including Jersey, Hereford, Ayrshire, and Holstein-Friesian cattle (Al-Murrani et al., 2003) with crossbreeding between Sharabi and Friesian cattle documented (Dabdoub, 2005; Maaroof, 2011; Nasser et al., 2013; Nassar et al., 2014). Until now, none of these Iraqi native breeds have been documented at a genomewide level, despite the uniqueness of the country in the region of taurine cattle domestication, and its historic importance as major center of civilization in the past. We report for the first time at the autosomal genome-wide level, the genome diversity and candidate signatures of positive selection (e.g., tick resistance genes) in two Iraqi cattle breeds (Jenoubi and Rustaqi), providing new insights on the past and present breeding dynamics and evolutionary forces that shaped the genome of the cattle population in the region.

# MATERIALS AND METHODS

# Population Samples

We collected blood samples spotted on FTA paper (Whatman Technology®) from two indigenous Iraqi breeds, Rustaqi and Jenoubi (**Figure 1**). In particular, for the Rustaqi breed, 60 blood samples were collected: 20 samples from the Baghdad region, in central Iraq, and 40 samples from the Babylon area (80 km south of Baghdad). For the Jenoubi breed, 35 blood samples were collected from the southern regions of Iraq, including an area close to Basra (*n* = 9) (560 km South of Baghdad) and in Maysan (*n* = 26) (400 km South of Baghdad). The middle region of Iraq (Baghdad and Babylon) is characterized by a hot and arid climate, while the climate in the southern part of Iraq (Basra and Maysan regions) close to the marshes is hot and more humid (https://www.accuweather.com/en/iq/national/satellite). Samples were shipped to a private company (Deoxi Biotecnologia, http:// www.deoxi.com.br/) for genotyping using the Illumina Bovine HD Genotyping BeadChip (700K) (http://www.illumina. com). The geographic location and global positioning system (GPS) coordinates of the Iraqi cattle samples can be found at **Supplementary Table S2A**, **B**. Aerial distances in kilometers between sampling sites were calculated using the geographic information system (GIS) ArcGIS® software Esri (www.esri.com).

# SNP Genotyping

Ninety-five samples of Iraqi breeds (Rustaqi, *n* = 60; Jenoubi, *n* = 35) were genotyped with the Illumina Bovine HD Genotyping BeadChip (http://www.illumina.com) including 777,962 SNPs mapped to the UMD 3.1; please see section of Data Accessibility for more details (EVA: PRJEB32975; Datadryad.org:dryad. t35r32q). High-density SNP data for references cattle breeds, Holstein-Friesian (European taurine, *n* = 30), Jersey (European taurine, *n* = 32), Nellore (Asian zebu, *n* = 35), Gir (Asian zebu,

*n* = 30), Sahiwal (Asian zebu, *n* = 13), EASZ (East African Short horn zebu, *n* = 30), Sheko (East African zebu × taurine cross, *n* = 18), Ankole (East African, *n* = 25), Adamawa Gudali (West African zebu, *n* = 25), Muturu (African taurine, *n* = 12), Red Bororo (West African zebu, *n* = 22), and N'Dama Guinea (African taurine, *n* = 24) were obtained from Bahbahani et al. (2017).

# Data Analysis

Autosomal SNP datasets were prepared using R (https://www.rproject.org) and PLINK 1.9 (Purcell et al., 2007) software. SNPs with minor allele frequency (MAF) of less than 5% (–maf 0.05) and missing genotype data higher than 0.05% (–geno 0.05) were excluded from the analysis. It removed 63,451 variants and 34,387 markers, respectively, leaving 680,124 SNPs for analysis. Also, one sample from Rustaqi, EASZ, and Red Bororo breeds were removed due to their low call rate (<95%).

# Estimation of the Level of Genetic Diversity

To assess the level of genetic diversity, the mean of expected heterozygosity (*He*) and observed heterozygosity (*Ho*) were computed using PLINK 1.9 (Purcell et al., 2007). The genetic diversity was estimated for each population of Iraqi indigenous cattle and several breeds of references. These breeds are Iranian breeds (Sarabi, Kurdi, Taleshi, Pars, Sistani, and Najdi), East Asian native cattle [from Vietnam, Myanmar, Bangladesh, Bhutan, Korea (Hanwoo), Japan (Polled), and Mongolia (MON)], and African breeds [N'Dama Guinea and Sheko (Ethiopian breed)]. Additionally, two important main breeds that represent the main lineage of cattle were used: Holstein-Friesian (*Bos taurus*) and Nellore (*Bos indicus*) (Uzzaman et al., 2014; Karimi et al., 2016; Yonesaka et al., 2016). Furthermore, *He* was assessed under the assumption of Hardy–Weinberg equilibrium, and *Ho* was averaged over loci (Nei, 1978).

# Principal Component Analysis (PCA)

The PLINK 1.9 software was used for principal component analysis (PCA) (Purcell et al., 2007). The autosomal data of the two indigenous Iraqi breeds (Rustaqi and Jenoubi) and several reference breeds [European taurine (Holstein-Friesian, Jersey), Asian zebu (Nellore, Sahiwal, and Gir), African zebu and their crossbreed (East African Shorthorn zebu (EASZ), Sheko, Ankole, Adamawa Gudali, and Red Bororo], and African taurine (Muturu and N'Dama Guinea) were used. The plotting of the PCA results was done using the Genesis software version (0.2.6b) (https:// github.com/shaze/genesis).

# Admixture

In order to estimate the ancestry and the genetic structure of the Iraqi cattle population, we used Admixture version 1.3.0 (Alexander et al., 2015) using the following breeds: Rustaqi, Jenoubi, Holstein-Friesian, Jersey, Nellore, Sahiwal, Gir, Brahman, EASZ, Sheko, Ankole, Adamawa Gudali, Red Bororo, and N'Dama Guinea. The analysis was conducted at genomewide autosomal level, first with Iraqi breeds and four reference breeds (Holstein-Friesian, N'Dama Guinea, Sheko, and Nellore) and then including the entire set of breeds. We performed K = 2 to K = 10 as ancestral modes in order to identify the optimal number of ancestral populations by detecting the lowest value of cross-validation error. We plot our admixture results using the Genesis software (version 0.2.6b) (http://www.bioinf.wits.ac.za/ software/genesis/).

### Positive Candidate Signature of Selection

To construct haplotype files for signature selection analysis, haplotype data of Iraqi breeds (Rustaqi and Jenoubi) and other cattle references were reconstructed by phasing the genotyped SNPs using the SHAPEIT software (v2.8 37) (O'Connell et al., 2014).

# *Rehh* (*iHS* and *Rsb*) Analysis

Identification of signature selection was based on the extended haplotype homozygosity (EHH) tools, using the Rehh package in R. Two analyses were performed (i) based on withinpopulation statistics using Integrated Haplotype Score (iHS) (Voight et al. 2006); and (ii) relative integrated EHH of a site between populations (Rsb) (Sabeti et al., 2002). The *iHS* test was applied to Rustaqi and Jenoubi. *Rsb* test was conducted between i) Jenoubi and Rustaqi and ii) between each Iraqi cattle (Jenoubi and Rustaqi) with three reference breeds (Holstein-Friesian, N'Dama Guinea, and Nellore). The standardized *Rsb* and *iHS*  values were normally distributed, so a *Z*-test was applied to identify statistically significant SNPs under selection. Onesided upper-tail *P*-values were derived as 1 − Φ (*Rsb*) from the Gaussian cumulative density function Φ. For Iraqi breeds, we set a threshold of −log10 *P*-value *=* 4 and −4 for the *iHS* test and a threshold of −log10 *P*-value *=* 5 and −5 for the *Rsb* test for the candidate regions. All annotated genes within the region were considered as candidate changes. Then, we examined commonly detected *iHS* and *Rsb* genes for the Iraqi cattle (Jenoubi and Rustaqi) as well as the *Rsb* results of Iraqi cattle and the three reference breeds. A Venn diagram online tool (http://bioinfogp. cnb.csic.es/tools/venny/index.html) was used to check the overlap of candidate genes (Oliveros, 2015).

# Fst Analysis

Inter-population Wright's *Fst* analyses were conducted between the two Iraqi cattle breeds. *Fst* summarizes the genetic differentiation among populations, through estimation of the allele frequency between populations relative to the total variance of these populations (Wright, 1951; Holsinger and Weir, 2009). The calculation was performed on sliding windows of 60 SNPs, overlapping by 30 SNPs. The above 0.2 of the distribution of *Fst* values was arbitrarily chosen as the significant threshold.

# Gene Function and Gene Pathway Identification Within Candidate-Selected Regions

Gene identification was based on the database of Ensembl *Genes 91*—*Bos taurus* genes (UMD 3.1) using the BioMart tool (http:// www.ensembl.org/biomart). PANTHER 11.0 (http://www. pantherdb.org/) and the Enrichr (http://amp.pharm.mssm.edu/ Enrichr/) tools were used to explore protein families, molecular functions, biological processes, cellular components, and pathways (Chen et al., 2013; Kuleshov et al., 2016; Mi et al., 2017). Moreover, the list of bovine quantitative trait locus (QTL) regions was downloaded from http://www.animalgenome.org/ cgi-bin/QTLdb/index (Hu et al., 2016). Up-to-date information for some specific genes annotation was sourced from Gene Cards (http://www.genecards.org/) and most recent literature, integrating Google Scholar (https://scholar.google.com). To determine overrepresented ontology terms for candidate genes following *Fst* analysis, we used DAVID version 6.7 (https://david.ncifcrf.gov/), which detects enriched functional terms (Huang da et al., 2009a; Huang da et al., 2009b) with an enrichment score of 1.3, equivalent to the Fisher exact test *P =* 0.05, as the significant threshold.

# RESULTS

# Genomic Diversity

The highest values of mean *He* and *Ho* were found in Rustaqi and Sheko breeds, while the lowest values were obtained in N'Dama and Nellore (**Table 1** and **Supplementary Figure S2**). In particular, mean *He* and *Ho* were 0.37 and 0.36 for Rustaqi, respectively, while they were 0.32 (*He* and *Ho*) in Jenoubi. Our findings indicate that Iraqi breeds possess significant diversity compared to Asian (e.g., Korean Hanwoo), African (e.g., N'Dama), and European breeds (e.g., Holstein-Friesian) (**Supplementary Tables S3**, **S4**, and **S5**).

# Genetic Structure

The genetic structure across and within breeds was first assessed using PCA. We conducted two analyses, first between Iraqi breeds and the reference breeds, and then within Iraqi breeds only. For all the breed analysed (**Figure 2A**), PC1 accounts for 28.26% of the total variation. It separates the Asian reference zebu population (Nellore) from the taurine breed (Holstein-Friesian);

TABLE 1 | Number of animals, mean of expected heterozygosity (*He*), and observed heterozygosity (*Ho*).


between the two, we do find the Rustaqi animal to be closer to taurine Holstein-Friesian, and the Jenoubi animals to be closer to the indicine Nellore. PC2, which accounts for 15.63% of the total variation, separates the African zebu breeds (Ankole, Sheko, Adamawa Gudali, and EASZ) and the African taurine (N'Dama and Muturu) from the other cattle populations (non-African breeds). The second PCA implemented for the Iraqi breeds only (Rustaqi and Jenoubi) reveals substructuring within each breed (**Figure 2B**). The first component, which accounts for 5.34% of the total variation, separates the Jenoubi animals sampled in the Maysan region from Rustaqi sampled around Baghdad, while Jenoubi from the Basra region and Rustaqi animals sampled around Babylon region are positioned between these two populations. The second component, which explains 2.78% of the total variation, separates Rustaqi animals from Al-Qasim town (Babylon region) from the other animals.

# Admixture

Admixture analysis was based on 680,124 SNPs after QC. We performed two admixture analyses: first among Iraqi cattle and four main cattle reference breeds (European taurine Holstein-Friesian, Asian zebu Nellore, African crossbreed Sheko, and African taurine N'Dama Guinea) and then including also Jersey, Sahiwal, Gir, East African Shorthorn zebu (EASZ), Ankole, Adamawa Gudali, and Red Bororo. In the first analysis, the selected breeds are representative of the main lineages of cattle (European taurine, African taurine, zebu, and Asian zebu). The optimal number of clusters was here defined as *K* = 4 (it has the lowest cross-validation value). As shown in **Figure 3**, for ancestry *K* = 3, we do observe in Rustaqi and Jenoubi an European taurine and an African taurine shared ancestry as well as an Asian indicine one, for the latter higher in Jenoubi than in Rustaqi. At *K* = 4, shared ancestry with Holstein-Friesian is observed in Rustaqi but much less so in Jenoubi. Shared ancestries with the Nellore and African cattle are present in both Iraqi breeds, but these are low. At *K* = 5 and *K* = 6, both Rustaqi and Jenoubi show admixed background, however, less so for the former than the latter. For the second analysis, the optimal number of clusters was defined as *K* = 7 (**Supplementary Figure S3**). The results obtained here support the previous admixture results with more zebu ancestry in Jenoubi and more taurine ancestry in Rustaqi.

# Genetic Signature of Positive Selection in Iraqi Breeds

Footprints of selection for Iraqi breeds (Jenoubi and Rustaqi) were analyzed using *iHS* and *Rsb* statistics, based on extended haplotype homozygosity (EHH). Results are presented in **Figures 4** and **5**. In Jenoubi, we observe the strongest evidence of selection on BTA1 with an *iHS* score of −5.40 and on BTA26 with *iHS* score of −5.0. Several genes are present within these regions, on BTA1 *NCAM2*, *TMPRSS15*, and *CHODL* and on BTA26 *PRKG1*. In Rustaqi, we observe the strongest evidence of selection on BTA1 with *iHS* score of −5.60 and on BTA18 with *iHS* score of −5.03. Genes present within these significant regions include on BTA1 *PPM1L* and *IGSF5* and on BTA18 *PLCG2*, *CDH13*, *NOVEL*, *OSGIN1*, *TLDC1*, *CRISPLD2*, *IRF8*, *JPH3*, *KLHDC4*, *SLC7A5*, *CA5A*, *BANP*, *GALNS*, *CBFA2T3*, *ABCC12*, *ZNF423*, and *LPCAT2*. The Jenoubi breed has 13 candidate-selected regions derived and six ancestral, compared with 11 and 16, respectively, for Rustaqi (**Supplementary Figures S4** and **S5**).

*Rsb* results between Rustaqi and Jenoubi show a total of 209 SNPs in Jenoubi and 236 in Rustaqi above the significant threshold. The *Rsb* plots show strong signals of positive selection on BTA1, BTA6, BTA7, BTA8, BTA10, BTA17, BTA22, and BTA26 in Jenoubi, and on BTA1, BTA5, BTA13, BTA18, and BTA26 in Rustaqi (**Figure 6**).

# Candidate Genes at Genomic Regions

Candidate regions under positive selection in Jenoubi include 24 annotated genes (14 genes following *iHS* analysis, 17 genes following *Rsb* analysis, and 7 genes present in genome region commonly identified in both tests), while candidate regions under positive selection in Rustaqi include 45 annotated genes (43 genes following *iHS* analysis, 3 genes following *Rsb* analysis, and 1 gene present in genome region commonly identified in both tests); see **Supplementary Table S6**.

So a total of 68 annotated genes are present within the two Iraqi breeds at the candidate regions defined by the *iHS* and *Rsb* analyses (see **Figure 7** for a Venn diagram showing the number of unique and shared genes found within candidate signature selection regions). The BTA18 regions have the largest number of genes (18 genes), followed by BTA6 with 10 genes, and finally BTA26 with 8 genes.

Overall, the results of genes functions/annotation analysis reveal 19 genes related to the immune system response; 5 genes for Jenoubi, 15 genes for Rustaqi, and 1 overlapping gene between Jenoubi and Rustaqi breeds. This gene is *PRKG1* associated with tick resistance in cattle (Mapholi et al., 2016; Vajana et al., 2018). These immune response genes were involved in both the acquired and innate immune responses, for example, *IRF8* (Rustaqi; BTA18) and *ABCC2* (Rustaqi; BTA26) linked to the acquired immune response to protozoan and bacterial infections

Frontiers in Genetics | www.frontiersin.org 8 July 2019 | Volume 10 | Article 609

genes of iHS&Rsb)

(Giagu, 2016) and gastrointestinal nematodes (Li et al., 2011), respectively; *PARM1*, an innate immune response gene (Jenoubi breed; BTA6), associated with anti-apoptotic activity especially during fertility stage (Cochran et al., 2013); and *ATG7* (innate response gene, Jenoubi; BTA22), linked to the autophagy process (Aboelenain et al., 2015).

The remaining genes are related to other environmental adaptation or production characteristics. For instance, *SLC24A3* (Rustaqi breed; BTA13) is related to fertility traits (Moran et al., 2017), and *NCAM2* (Jenoubi, BTA1) is linked to fat, protein, and milk yield (Venturini et al., 2014). **Supplementary Table S7** summarizes the log (*P*-value) of the most significant SNPs within the different significant regions in both breeds. In Jenoubi, the most significant SNPs (*n* = 58 SNPs, maximum SNP log *P*-value *=* 13.01; BTA26) are within the *PCDH15* gene region involved in the maintenance of the integrity of the intestinal membrane. Then, NCAM2 (*n* = 20 SNPs, maximum SNP log *P*-value *=* 12.95, BTA1) and *TMPRSS15* (*n* = 58 SNPs, maximum SNP log *P*-value *=* 12.93, BTA1) are linked to fat, protein, and milk yield.

In Rustaqi, the most significant SNPs (*n* = 25 SNPs, maximum SNP log *P*-value *=* 8.01; BTA18) are within the *DNMBP* gene region. This gene has been shown to contribute the milk-fat composition (Buitenhuis et al., 2014). Last but not least, the highest number of significant SNP value found in both Rustaqi and Jenoubi is at *PRKG1* (*n* = 25 SNPs, maximum SNP log *P*-value *=* 7.93; BTA26).

For the comparisons of the Jenoubi with Holstein-Friesian, Nellore, and N'Dama, we used a threshold for the *Rsb* of >3.5, 4, and 4, respectively (**Figures 8A–C**). We identified 161, 100, and 272 candidate regions for the Jenoubi *Rsb* comparison with Holstein-Friesian, Nellore, and N'Dama, respectively. The most significant SNP values are 5.3 on BTA26 (Jenoubi versus Holstein-Friesian), 7.8 on BTA5 (Jenoubi versus Nellore), and 7.6 on BTA1 (Jenoubi versus N'Dama).

A total of 38 candidate annotated genes are present in significant Jenoubi versus Holstein-Friesian *Rsb* regions. Among these, four candidate genes overlap with previous Iraqi cattle analysis (*iHS* and *Rsb*). These candidate genes are *ATG7* (autophagy control), *PRKG1* (tick resistance), *PCDH15* (maintenance of intestine membrane), and *TMEM132B* (control of brain physiology). The remaining 34 genes were considered to be new genes not identified in our previous analysis of Iraqi breeds. Thirteen genes are included in the Jenoubi–N'Dama Guinea comparison, including eight genes overlapping with the Iraqi breeds analysis (*iHS* and *Rsb*) (e.g., *NCAM2*, *TMPRSS15*, *PCDH15*, *PRKG1*, and *FOCAD*). Only one gene *PCDH15* is present for *Rsb* Jenoubi versus Nellore analysis. (**Supplementary Table S8**).

The threshold for the Rustaqi *Rsb* analysis was >3.5, 3, and 3.5 for the Holstein-Friesian, Nellore, and N'Dama, respectively. Fifty-seven, 18, and 98 candidate regions were identified from the Rustaqi breed versus Holstein-Friesian, Nellore, and N'Dama analyses, respectively (**Figures 9A–C**). Additionally, the strongest SNP values were 5.3 on BTA20 (Rustaqi versus Holstein-Friesian), 6.8 on BTA5 (Rustaqi versus Nellore), and 7.2 on BTA6 (Rustaqi versus N'Dama). While 12 genes are present in the comparison Rustaqi versus Holstein-Friesian, none overlapped with previous genes identified in the Iraqi analysis (*iHS* and *Rsb*). *Rsb* analysis of Rustaqi against Nellore uncovered five genes, again with no shared genes with our previous Iraqi analysis (*iHS* and *Rsb*). *Rsb* Rustaqi versus N'Dama Guinea found 26 genes, with one gene (*CD96*) previously identified in Iraqi breeds (**Supplementary Table S9**).

# Gene Ontology Analysis

The PANTHER analysis of the biological processes for Jenoubi (iHS genes) reveals the following significant categories: biological regulation, molecular function, cellular components, protein class, and pathways (**Supplementary Figure S6**; **Supplementary Table S10A–D**). On the other hand, the Enrichr tool reveals the following biological processes: positive regulation of apoptotic process, sensory perception of light stimulus, and regulation of GTPase activity. Molecular function analysis shows three enriched levels of gene clusters [apoptotic process (GO: 0043065), transmembrane–ephrin receptor activity (GO:0005005), and calcium channel regular activity (GO:0005246)] (**Supplementary Figure S7A** and **B**; **Supplementary Table S11A** and **B**).

The PANTHER analysis for Rustaqi (iHS genes) identifies the same categories (biological regulation, molecular function, cellular components, protein class, and pathway) (**Supplementary Figure S8**). However, the Enrichr tool analysis reveals two enriched clusters: regulation of positive chemotaxis (GO:0050926) and glomerular epithelial cell development (GO:0072310). On the other hand, the results of molecular function analysis indicate one enriched cluster [O-acetyltransferase activity (GO: 0016413)] (**Supplementary Figure S9A** and **B**).

For the *Rsb* analysis, the PANTHER for Jenoubi indicates eight clusters in biological categories: biological regulation (GO:0065007) (5 genes), biological adhesion (GO:0022610) (3 genes), cellular process (GO:0009987) (12 genes), localization (GO:0051179) (6 genes), metabolic process (GO:000 8152) (5 genes), cellular component organization or biogenesis (GO:0071840) (2 genes), multicellular organismal process (GO:0032501) (1 gene), and response to stimulus (GO:0050896) (1 gene). On the other hand, molecular function shows binding (GO:0005488) (6 genes), catalytic activity (GO:00 03824) (3 genes), and transporter activity (GO:0005215) (3 genes). The Enrichr tool reveals the following biological processes: vitamin D metabolic process and growth factor activity. These ontologies and others (cellular components, protein class, and pathway) are further shown at **Supplementary Figures S10** and **11**.

The PANTHER analysis for Rustaqi (*Rsb* analysis) indicates the following two biological process supported with two genes, cellular process (GO:0009987) and metabolic process (GO:0008152), and the biological process binding (GO:0005488) with one gene. The Enrichr analysis identifies three enriched terms: small GTPase-mediated signal transduction (GO:0007264), cell–matrix adhesion (GO:0007160), and positive regulation of hydrolase activity (GO:0051345). On the other hand, the molecular function analysis recognizes one enriched term, Rho guanyl-nucleotide exchange factor activity (GO:0005089).

# *Fst* Candidate Gene Regions

The overall genome differentiation of *Fst* values between Iraqi breeds is *Fst =* 0.28 (**Figure 10**). The *Fst* analysis reveals regions with candidate genes differentiated between Jenoubi and Rustaqi (**Supplementary Table S12**). DAVID bioinformatics analysis for *Fst* results shows 16 annotation clusters, but only one of them (metal thiolate and mineral absorption cluster) representing 51 genes has an enrichment score of 4.89, largely above the threshold of 1.3 (*P =* 0.05), with the next cluster, enrichment score 1.29 (metal binding cluster), just below the significant threshold level considered.

# DISCUSSION

In this study, we report at a genome-wide level for the first time the genetic structure, diversity, and candidate signatures of positive selection in two Iraqi cattle breeds, Jenoubi and Rustaqi. At the crossroad of the zebu and taurine centers of domestication, Iraqi cattle may be expected to show high diversity of both taurine and zebu origin. This is confirmed in our study with the presence of both indicine and taurine ancestry in the two breeds, although in different proportions for each of them. Jenoubi cattle are classified as zebu following their humped cattle phenotypes. Our principle component and admixture analyses support such classification, but they also reveal a small proportion of taurine ancestry in their genetic backgrounds. Two factors may have contributed in the shaping of the genetic make-up of Jenoubi. The taurine background within this breed may be corresponding to ancient admixture events from the putative Near East cattle taurine domestication centers, and/or it is the consequence of recent exotic taurine introgression. At K = 3, the Jenoubi taurine background is shared with N'Dama and European taurine. However, at K = 4, the optimal K, the ancestry with European taurine largely disappears. The native habitat of the Jenoubi breed (South-Eastern of Iraq) is away from areas where exotic breeds (e.g., Holstein-Friesian) have been introduced in Iraq in the recent past (Al-Bayatti et al., 2016), supporting ancient taurine introgression events rather than a more recent one from European exotic taurine.

In contrast, with a significant proportion of European ancestry, as revealed in our admixture analysis, recent gene flow from exotic cattle origin likely occurred in Rustaqi. The geographic origin of this breed is central Iraq. It is close to the capital Baghdad, and crossbreeding with exotic taurine might have been driven by the pressures to increase milk production in response to the consumer demands from the city. Admixture analysis indicates also some low zebu background in Rustaqi. Likewise, with the taurine introgression in Jenoubi breed, it may be of ancient origin and the consequence of past trading networks not only between central Iraq and southern Iraq but also further north and south, linking the ancient civilization of the Fertile Crescent and the Indus Valley (Magee et al., 2014).

Interestingly, genetic studies in Anatolian cattle (Anatolian Black, South Anatolian Red, Anatolian Southern Yellow, and Turkish Grey) have also revealed taurine × zebu admixture (Ateş et al., 2014). Also, Karimi et al. (2016) have mentioned that indigenous Iranian cattle from the western part of the country, near the Iraqi border, have more taurine genetic background than southwest Iranian cattle on the south-east border with Iraq, which are more indicine in their genetic background. Our study together with previous ones illustrates the pattern and gradient of zebu and taurine genetic admixture in the region. In terms of genome diversity, the Rustaqi *He* is higher compared with all the Iranian cattle population examined by Karimi et al. (2016). Similarly, comparison of the Jenoubi and Rustaqi with African taurine (N'Dama Guinea), Asian zebu (Nellore), and European Holstein-Friesian shows that Iraqi breeds possess higher genome diversity. It is expected for a crossbred population compared with the non-admixed taurine and zebu population. We also do observe higher genome diversity in Jenoubi and Rustaqi compared with the Sheko, an African zebu × taurine admixed breed (Bahbahani et al., 2017). It may be explained by the closer proximity of Iraqi cattle to the centers of cattle domestication,

(dashed line) are set at above 0.2 of the *Fst* windows distribution.

and therefore the center diversities of domestic cattle compared with the African breeds.

Our signature of selection results in both Iraqi breeds, are suggesting that environmental challenges including diseases pressures have shaped the genomes of Rustaqi and Jenoubi breeds, but not in an identical way, with important differences between the two breeds according to our findings from the *iHS* and *Rsb* analyses.

The *iHS* results in Rustaqi indicate more candidate-selected regions with genes involved in innate and acquired immunities, compared with the results obtained in Jenoubi. Interestingly, among the 14 immune response-related genes unique to Rustaqi, we do find that *OSGIN1* and *CBFA2T3* previously showed to be part to the cattle immune response to mammary gland inflammation (Gilbert et al., 2012; Wang et al., 2012; Osińska et al., 2014). Furthermore, *IRF8* plays an important role against bacterial (e.g., *Salmonella*) and protozoan infections (Gautier et al., 2009; Porto-Neto et al., 2013; Giagu, 2016). The results suggest that the importance of the Rustaqi breed in milk production may have shaped, at least partly, the candidate signatures of selection observed here.

Nevertheless, the *iHS* results in Jenoubi breed have revealed five important immunity genes. *TNFAIP8* and *FOCAD* candidate genes are known to play a role in immune homeostasis and tumor suppression (Hadisaputri et al., 2012; Iwata, 2016). Another crucial gene is *ATG7*, an autophagy gene that contributes to the regulation of the cell death process through elimination of unwanted or dead cells (Aboelenain et al., 2015).

Among the genes identified in both breeds, *PRKG1*, which has been reported previously in two other studies (Mapholi et al., 2016; Vajana et al., 2018), is associated with the tick resistance– tolerance phenotype, a major issue in the pastoral areas of the middle and southern regions of Iraq (Al-Ramahi and Kshash, 2011; Mohammad, 2015). Our study is adding further support to the importance of this gene in relation to disease resistance traits. Similarly, *ABCC2* identified in Rustaqi breed has been related to resistance–tolerance to gastrointestinal nematode parasites infection.

We also identified several regions including genes that may be linked to environmental agro-climatic adaptation in both breeds following *iHS* analysis. Rustaqi animals are raised in a relatively dry and hot environment, and accordingly, adaptation to heat stress may be expected. Here, we do find *UCN3* involved in the genetic control of heat tolerance and oxidative stress, including in Holstein-Friesian cattle (Zheng et al., 2014). Also, in Jenoubi, we identified within candidate-selected regions two genes related to nutrition, *SLC4A4*, which plays a crucial role in the rumen development (Connor et al., 2013), and *EPHA5*, contributing to the improvement of the feed conversion rate from rumination (Santana et al., 2016). This suggests that the breed, largely free grazing, may be particularly adapted to the local availability of feeds.

Although overlap regions are few between Rustaqi and Jenoubi breeds, the same gene pathways may have been under selection pressures in both breeds. For example, *NCAM2*, *TMPRSS15*, and *SLC4A4* within candidate regions in Jenoubi breed have functions related to milk quality and production, with the latter also found under a candidate signature of selection region in Holstein cattle (Li et al., 2010; Venturini et al., 2014). The same regions are not significant in Rustaqi, but here, other genes linked to milk quality and production are found in other significant regions, such as *LPCAT2* and *DNMBP*, two genes linked to protein and fat content in milk (Ogorevc et al., 2009; Buitenhuis et al., 2014; Venturini et al., 2014). Interestingly, we note also the presence in Jenoubi of *PCDH15*, a gene involved in meat quality within a candidate-selected region (Ryu and Lee, 2014).

The outputs of the Enrichr analysis for Jenoubi (*iHS* analysis) indicate that the most enriched cluster among the biological process category is the gene ontology term apoptotic process. Programmed cell death (apoptosis) is part of the immune adaptive response of an organism. In particular, positive regulation of the apoptotic process has been shown to play a role in the immune response of blood cells to trypanosome infection in cattle (Hill et al., 2005), as well as meat quality through elimination of the dead cells, and in maintaining the rumination process of cattle through conserving rumen cells activity (Herrera-Mendez et al., 2006; Connor et al., 2013; Shabtay, 2015). In the Canchim Brazilian beef cattle, the apoptotic process was also among the most enriched clusters (Urbinati et al., 2016). Furthermore, Taye et al. (2017) have mentioned that apoptosis as a response to external stress may be involved in thermotolerance in cattle. Another enriched cluster is sensory perception to light stimulus, which reflects adaptation to vision, one of the cognitive functions of an animal. Such adaptation may be of relevance in particular for outdoor grazing animals (Kim et al., 2017). The GTPase activity cluster found in Jenoubi breed, which plays a significant role in inflammatory reaction following nematode infection (Huang et al., 2007; Kim et al., 2015), and in relation to milk and fertility traits (Kasarapu et al., 2017), has also been found in Yiling yellow Chinese cattle (Ling et al., 2017).

Enrichr results for Rustaqi (*iHS* analysis) indicated biological process related to gene upregulation (clusters regulation of positive chemotaxis and glomerular epithelial cell development) (Pokharel et al., 2018), as well as genes playing a crucial role in defense mechanism against bacterial infection, and growth function processes (Flori et al., 2009; Gautier et al., 2009; Porto-Neto et al., 2013; Giagu, 2016).

PANTHER analysis of Jenoubi *Rsb* results reveals several clusters linked to biological process. For instance, cluster genes of metabolic process are associated with milk production, metabolism of watersoluble vitamins, and regulation of actin cytoskeleton (Raven et al., 2016). Another important cluster is biological regulation including genes involved in rumen and muscle development (Feng et al., 2007; Li et al., 2010). Enrichr analysis identified one enriched category in the biological process cluster (vitamin D metabolic process) and one enriched category in molecular function cluster (growth factor activity). Both clusters may be linked to the health of the animals. For example, vitamin D contributes to the protection of the body from autoimmune diseases with deficiency in vitamin D linked to pathologies, such as osteoporosis and skin or coating diseases (Adorini and Penna, 2008; Bikle, 2014).

Regarding *Rsb* results of Rustaqi breed, the more important ontology term from the Enrichr analysis is the cell–matrix adhesion cluster that regulates tissue construction and cell activity (Lodish et al., 2000). For example, this cluster includes *MYRFL-201*, which protects the myelinated central nervous system, and *DNMBP*, responsible for milk quality traits such as fat composition (Buitenhuis et al., 2014; Koenning, 2015).

Interestingly, among the regions differentiated between the two breeds (*Fst* analysis), DAVID tool identifies the significance cluster metal-thiolate function important for metabolism detoxification activities (e.g., after zinc and copper ingestions) (Richards, 1989). It supports that the two breeds are exposed to different feeds with different toxicity, and they may have responded to such selection pressures accordingly.

In conclusion, we have reported here for the first time at a genome-wide level the genetic structure, diversity, and candidate signatures of positive selection in two Iraqi cattle breeds. Our results support the phenotypic classification of Jenoubi cattle as zebu, and Rustaqi cattle as taurine but with introgression from the other cattle subspecies in each of them. In addition, the results show a significant level of genetic diversity in indigenous Iraqi cattle in line with their history. Genome-wide analysis unravels the genes that play an important role in immunity and other environmental adaptive traits, including in relation to parasitic, bacterial disease challenges, and heat tolerance. This study illustrates the uniqueness of these two indigenous breeds, while the information obtained is expected to help the control of diseases, conservation, management, and utilization of the indigenous Iraqi cattle genetic resources.

# ETHICS STATEMENT

The animals used in this study are owned by farmers. Prior to sampling, the objectives of the study were explained to them in their local languages so that they could make an informed decision regarding giving consent to sample their animals. Government veterinary, animal welfare, and health regulations

# REFERENCES


were observed during sampling of the populations analyzed here. The procedures involving animal sample collection also followed the recommendation of directive 2010/63/EU. Collection of blood samples was permitted by the Iraqi Ministry of Agriculture.

# AUTHOR CONTRIBUTIONS

AA and OH conceived and designed the experiment. AA and OH performed the experiment. AA, AE, and SA-B collected samples. AA analyzed the data. AA and OH wrote the manuscript. All authors have agreed on the contents of the manuscript.

# FUNDING

We would like to extend our sincere gratitude to the Iraqi Ministry of Higher Education and Scientific Research (MOHESR, Grant 1362), Iraqi cultural attaché, for sponsoring this work

# ACKNOWLEDGMENTS

We gratefully acknowledge the Animal Genetics Resources Department, Directorate of Animal Resources – Baghdad, Directorate of Veterinary Medicine/Baghdad, the Ministry of Iraqi Agriculture for supporting the fieldworks of this research as well as the CGIAR CRP Research Program on Livestock.

# SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00609/ full#supplementary-material

hybrid fitness in East African Shorthorn Zebu. *Front. Genet.* 8, 68. doi: 10.3389/ fgene.2017.00068


reactome pathways, and data analysis tool enhancements. *Nucleic Acids Res.* 45 (D1), D183–D189. doi: 10.1093/nar/gkw1138


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2019 Alshawi, Essa, Al-Bayatti and Hanotte. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

Frontiers

Avenue du Tribunal-Fédéral 34 1005 Lausanne | Switzerland

Visit us: www.frontiersin.org

FOLLOW US @frontiersin

Contact us: frontiersin.org/about/contact

IMPACT METRICS Advanced article metrics track visibility across digital media

EXTENSIVE PROMOTION Marketing and promotion of impactful research

LOOP RESEARCH NETWORK Our network increases your article's readership