Edited by:
Reviewed by:
*Correspondence:
This article was submitted to Bioinformatics and Computational Biology, a section of the journal Frontiers in Genetics.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Mapping of chromosomal regions harboring genetic polymorphisms that regulate complex traits is usually followed by a search for the causative mutations underlying the observed effects. This is often a challenging task even after fine mapping, as millions of base pairs including many genes will typically need to be investigated. Thus to trace the causative mutation(s) there is a great need for efficient bioinformatic strategies. Here, we searched for genes and mutations regulating growth in the Virginia chicken lines – an experimental population comprising two lines that have been divergently selected for body weight at 56 days for more than 50 generations. Several quantitative trait loci (QTL) have been mapped in an F2 intercross between the lines, and the regions have subsequently been replicated and fine mapped using an Advanced Intercross Line. We have further analyzed the QTL regions where the largest genetic divergence between the High-Weight selected (HWS) and Low-Weight selected (LWS) lines was observed. Such regions, covering about 37% of the actual QTL regions, were identified by comparing the allele frequencies of the HWS and LWS lines using both individual 60K SNP chip genotyping of birds and analysis of read proportions from genome resequencing of DNA pools. Based on a combination of criteria including significance of the QTL, allele frequency difference of identified mutations between the selected lines, gene information on relevance for growth, and the predicted functional effects of identified mutations we propose here a subset of candidate mutations of highest priority for further evaluation in functional studies. The candidate mutations were identified within the
Economically important production traits in domestic animals are generally complex, i.e., determined by factors that may include both genetic and environmental regulators. This is also true for many diseases in humans and animals. Thus, while it is often highly desirable to understand the regulation of specific complex traits, the task can be extremely challenging. For example, regions identified by quantitative trait loci (QTL) analysis will even after fine mapping of the QTL typically indicate regions including millions of base pairs and hundreds of genes that need to be explored to find causative mutation(s).
In this study our aim was to develop a bioinformatics strategy to mine already identified QTL regions to identify candidate genes for growth trait in chicken. The QTLs have been identified for body weight at 56 days of age in the Virginia chicken lines – an experimental population comprising two lines that have been divergently selected for body weight at 56 days for more than 50 generations at Virginia Tech (
Here, we present a bioinformatic strategy that in a structured and objective way helps to prioritize candidate genes for further study in mapped QTL regions by integrating information from multiple sources. First, the region to be evaluated further is narrowed down by, at each SNP-location in the evaluated region, calculating a combined score for the potential that each part of the region harbors a mutation underlying the phenotype. This is done by combining the statistical support from significance of the QTL effect at the particular marker, which is a measurement of the effect of the alternative alleles on the studied phenotype, with two measures of the genetic divergence between the founder lines (i.e., allele-frequency differences) at the particular location, which is an indicator of the direct or indirect selective pressure on the region due to an association with the phenotypes of importance when generating the divergent founder lines. Then, all the polymorphisms in the prioritized region are evaluated in more detail to select the most likely genes affecting the analyzed trait and bioinformatically predict the potential functional effects of each identified polymorphism. The details of the procedure, and its application to our particular chicken dataset, are described with a flowchart in
We studied seven fine-mapped QTL on chicken chromosomes 1–5, 7, and 20,with previously observed effects on body-weight at selection age in a QTL mapping pedigree founded with HWS and LWS chickens from generation 41 (
Genome-wide 60K SNP chip genotypes of 20 individuals from each of the HWS and LWS lines, generation 41 (
Genome resequencing was performed in two separate runs using DNA pools from the HWS and LWS lines. The data from the two experiments were combined to maximize the sensitivity in the SNP detection.
For earlier studies DNA from two pools of genomic DNA, one from each of the HWS and LWS lines, were used to generate resequencing data with 5× average depth coverage for each line. The reads were aligned to the Red Jungle Fowl’s (RJF) reference genome assembly (WUGSC 2.1/galGal3, May 2006;
For the current and future studies the second round of resequencing was performed using two new pools of DNA samples. The individuals selected for each pool were guided by data from earlier performed 60K SNP-chip genome-wide genotyping. From each line, the eight individuals with the most non-representative genotype pattern in the QTL regions were selected to increase the possibilities for detection of variation within lines and thereby allow improved precision in the fine mapping of regions with high degree of between-line fixation. The ABI SOLiD resequencing was carried out by the Uppsala Genome Center using mate-pair libraries and 50 bases per read with ~7× depth coverage in each line. We aligned the reads to the RJF reference genome assembly (WUGSC 2.1/galGal3, May 2006) using the MOSAIK software (
We applied the flanking-SNP-value (FSV) method (
The allele frequency differences based on the individual SNP genotyping, the genetic divergence estimates (FSV) from the population-pool genome resequencing were plotted across the QTL regions together with the QTL support-curve from the QTL fine-mapping (
The CDS was plotted to provide an objective statistic to prioritize regions for further analysis and evaluations of candidate genes and mutations. In most cases the regions were selected above the QTL significance and with high CDS.
Genes were identified in the prioritized regions within the QTL using the Ensembl database (version 67;
All SNPs detected by resequencing in selected candidate regions were analyzed with variant effect predictor (VEP) from Ensembl (
The DAVID annotated gene list was then filtered to identify the most likely candidate genes for growth in each QTL region. This was done by selecting the genes that had been associated with any of the following growth-related keywords: growth, development, morphogenesis, formation, proliferation, differentiation, regeneration, mineralization, elongation, biosynthetic, biogenesis, and organization. This set of terms was selected arbitrarily from ontology literature. The whole annotated gene list description was also reviewed to ensure no obvious candidates for growth were omitted.
In an earlier study,
Fine-mapped growth QTL regions with significance according to
GGA |
QTL |
Region name | Start (Mbp |
End (Mbp |
Size (Mbp) |
---|---|---|---|---|---|
1 | Growth1 | C1G1 | 169.6 | 181.0 | 11.4 |
2 | Growth2 | C2G2 | 47.9 | 65.4 | 17.5 |
3 | Growth4 | C3G4 | 24.0 | 68.0 | 43.9 |
4 | Growth6 | C4G6 | 1.3 | 13.5 | 12.1 |
5 | Growth8 | C5G8 | 33.6 | 39.0 | 5.3 |
7 | Growth9 | C7G9 | 10.9 | 35.4 | 24.5 |
20 | Growth12 | C20G12 | 7.1 | 13.8 | 6.7 |
Total | 121.4 |
GGA:
QTL names as in
Coordinates based on the Chicken (
Using the prioritizations strategy described above, 44.7 Mbp of these original QTL were selected using the combined information from the QTL analysis and estimates of differences in allele frequencies between the lines inferred from SNP chip genotyping and FSV computation (
Candidate regions selected based on QTL data and allele frequency differences between the lines inferred from SNP chip genotyping and FSV computation from resequencing. The selected percentages of the QTL regions significant with model B, are given (
Region name | Start Mbp |
End Mbp | Size (Mbp) | QTL support |
Ensembl genes |
---|---|---|---|---|---|
C1G1 | 169.6 | 175.0 | 5.4 | 5.4 | 97 |
C2G2 | 59.7 | 65.4 | 5.7 | 2.1 | 52 |
C3G4 | 24.1 | 35.8 | 11.7 | 10.3 | 142 |
C4G6 | 10.6 | 12.9 | 2.3 | 0.0 | 62 |
C5G8 | 34.2 | 36.8. | 2.6 | 0.0 | 20 |
C5G8 | 38.2 | 39.0 | 0.8 | 0.0 | 16 |
C7G9 | 20.4 | 35.4 | 15.0 | 4.3 | 209 |
C20G12 | 8.3 | 9.5 | 1.2 | 1.2 | 38 |
Total | 44.7 | 23.3 | 636 |
Coordinates based on the Chicken (
Size of the selected regions significant with QTL model B (
Number of Ensembl genes in the initial list in the selected regions
In
The variant effect predictor summary of SNPs in selected candidate segments of the QTL regions (according to
Location within gene | Region |
|||||||
---|---|---|---|---|---|---|---|---|
C1G1 | C2G2 | C3G4 | C4G6 | C5G8 | C7G9 | C20G12 | Total | |
3Prime UTR | 200 | 93 | 200 | 153 | 73 | 348 | 75 | 1142 |
3Prime UTR, Splice site | 1 | 1 | 2 | |||||
5Prime UTR | 22 | 9 | 44 | 20 | 3 | 50 | 28 | 176 |
5Prime UTR, Splice site | 1 | 4 | 5 | |||||
Coding unknown | 1 | 1 | 2 | |||||
Downstream | 6118 | 2636 | 5318 | 2373 | 1395 | 7930 | 1384 | 27154 |
Essential splice site | 2 | 3 | 6 | 1 | 1 | 4 | 3 | 20 |
Non-synonymous coding | 215 | 82 | 255 | 92 | 60 | 470 | 80 | 1254 |
Non-synonymous coding, Splice site | 6 | 4 | 8 | 5 | 3 | 17 | 1 | 44 |
splice site, Intronic | 78 | 37 | 133 | 33 | 24 | 191 | 31 | 527 |
Stop gained | 5 | 7 | 3 | 2 | 10 | 27 | ||
Stop gained, Non-synonymous coding | 1 | 1 | ||||||
Synonymous coding | 350 | 208 | 543 | 165 | 99 | 1113 | 159 | 2637 |
Synonymous coding, splice site | 9 | 9 | 12 | 5 | 6 | 20 | 12 | 73 |
Upstream | 5506 | 2626 | 5755 | 2570 | 1200 | 8312 | 1479 | 27488 |
Within mature miRNA | 1 | 1 | ||||||
Within non-coding gene | 4 | 2 | 1 | 3 | 12 | 4 | 26 | |
Within non-coding gene, splice site | 1 | 1 | ||||||
Total | 12516 | 5708 | 12284 | 5422 | 2870 | 18484 | 3256 | 60540 |
Candidate mutations identified in the evaluated QTL regions.
Region | SNP (bp) |
Gene | SNP location |
No of AA |
Qual |
AFD |
PC Score |
EC Score |
PE Score |
---|---|---|---|---|---|---|---|---|---|
C1G1 | 174634021 | Asparagine-linked glycosylation 11 homolog ( |
CpG island, upstream | 7; 10 | 72 | 0.97 | N/A | N/A | N/A |
C2G2 | 63823523 | Endothelin 1( |
CpG island, upstream | 3; 13 | 53 | 0.95 | N/A | N/A | N/A |
C3G4 | 33678270 | Cysteine rich transmembrane BMP regulator 1 ( |
Protein code, NS K/I | 10; 19 | 182 | 0.97 | 0.67 | 0.63 | 0.42 |
C4G6 | 12044024 | Similar to receptor tyrosine kinase ( |
CpG island, upstream | 4; 8 | 82 | 0.97 | N/A | N/A | N/A |
C4G6 | 12902414 | Fibroblast growth factor 16 ( |
CpG island, downstream | 8; 16 | 175 | 0.95 | N/A | N/A | N/A |
C5G8 | 38316301 | Sorting nexin 6 ( |
CpG island, upstream | 8; 14 | 142 | 0.97 | N/A | N/A | N/A |
C7G9 | 21686625 | Growth factor receptor-bound protein 14 ( |
CpG island, downstream | 3; 12 | 52 | 0.97 | N/A | N/A | N/A |
C7G9 | 22711910 | Glucagon ( |
CpG island, downstream | 3; 9 | 46 | 0.87 | N/A | N/A | N/A |
C7G9 | 24802616 | Insulin-like growth factor binding protein 2 ( |
Protein code, synonymous, CpG island | 4; 8 | 69 | 0.95 | N/A | N/A | N/A |
C20G12 | 8715398 | Baculoviral IAP repeat-containing 7 ( |
Protein code, NS I/V | 5; 8 | 65 | 0.97 | 0.29 | 0.14 | 0.04 |
Coordinates based on the Chicken (
Location of the SNP in gene and also amino acid substitution in case of non-synonymous (NS) SNP;
Total number of reads in both lines representing the alternate allele (AA) versus the total depth coverage across the SNP position;
The Phred scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1 -
Allele frequency difference between the chicken lines as estimated using the GigaBayes software given that a total of 19 individuals from each line were included in the pools;
Physico-chemical score of amino acid substitution calculated using PASE (
Evolutionary conservation score of amino acid substitution calculated using PASE (
Combined score of PC and EC of amino acid substitution calculated using PASE (
In this study we have developed and applied a bioinformatic strategy to search for candidate mutations affecting body weight at 56 days in several QTL regions that were previously identified and fine-mapped in an intercross between two divergently selected chicken lines. Given the 40 generations of divergent selection for body weight it is reasonable to assume that many of the underlying functional mutations will display a relatively large allele frequency difference, or complete fixation, between the lines. This assumption is supported by earlier work with the lines that many regions across the genome have been driven to fixation for alternative alleles in the lines and that most selection has been on standing genetic variation present in the common base-population at the onset of selection (
To narrow down the target regions and identify the most plausible mutations, we used several independent sources of information. First, measurements of the genetic divergence between the founder lines of the intercross were used as indicators of regions that have been under strongest selection. Both individual SNP chip genotyping and genome resequencing of pools of individuals were used to provide stability and high-resolution in the estimates of the allele frequency difference between the lines.
The potential functional impact of genes and SNPs located within the target regions was bioinformatically evaluated to identify a set of candidate mutations to be further tested and evaluated in functional studies. In regions where there exist several possible candidate genes, our use of a combined and objective selection criteria helped to localize the most promising candidate genes and mutations. The genes and mutations listed in
Strong candidate genes and mutations were also found in QTL regions on chromosome 3 (C3G4) and 4 (C4G6). In the C3G4 region, the gene encoding the cysteine rich transmembrane BMP regulator 1 (
In the C4G6 region, candidate CpG island mutations were identified within the fibroblast growth factor 16 (
In the chromosome 1 QTL region (C1G1) we also found a candidate mutation, possibly regulatory, in the asparagine-linked glycosylation 11 homolog (
The chromosome 2 QTL region (C2G2) showed CpG island mutations at the endothelin 1 (
In the regions on chromosome 5 (C5G8) and 20 (C20G12) the genes found in the analysis were less obvious candidates. However, such genes may still have key roles in processes with complex and indirect effects on growth-related traits. Keeping this in mind, we consider mutations identified in the sorting nexin 6 (
In conclusion, the described combination of data from QTL mapping, next-generation sequencing, SNP chip genotyping and bioinformatic analysis has provided a list of plausible candidate genes and mutations that will facilitate further verification and experimental evaluation. The support for this list from different types of data and analysis enhances the probability that the selected genes and mutations underlying the QTL effects are an unbiased selection of genes and that the contributing gene(s) are included in the set. Further studies based on this list may therefore reveal mutations which underlie the observed QTL effects and can increase our understanding of growth regulation as well as be more emphasized in animal breeding programs with genomic selection.
Muhammad Ahsan and Xidan Li carried out the region-targeted computation and analysis using the different sources of data and took part in the planning of the study. Marcin Kierczak and Andreas E. Lundberg performed the assembly of the SOLID resequencing datasets. Stefan Marklund initiated and planned the study. Paul B. Siegel and Örjan Carlborg contributed with comments and advice. Muhammad Ahsan and Stefan Marklund drafted the manuscript and all co-authors contributed to the final version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We would like to thank the USDA Chicken GWMAS Consortium, Cobb Vantress, and Hendrix Genetics for access to the developed 60K SNP Illumina iSelect chicken array, DNA landmarks for 60K array genotyping and the Uppsala Genome Center for ABI SOLID sequencing.
This work was financially supported by a EURYI award to Örjan Carlborg and a Future research leader grant to Örjan Carlborg from the Swedish Foundation for Strategic Research. The contribution of Muhammad Ahsan was supported by his scholarship from the Higher Education Commission of Pakistan (HEC).