Edited by:
Reviewed by:
*Correspondence:
This article was submitted to Crop Science and Horticulture, a section of the journal Frontiers in Plant Science
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Advances in sequencing and genotyping methods have enable cost-effective production of high throughput single nucleotide polymorphism (SNP) markers, making them the choice for linkage mapping. As a result, many laboratories have developed high-throughput SNP assays and built high-density genetic maps. However, the number of markers may, by orders of magnitude, exceed the resolution of recombination for a given population size so that only a minority of markers can accurately be ordered. Another issue attached to the so-called ‘large p, small n’ problem is that high-density genetic maps inevitably result in many markers clustering at the same position (co-segregating markers). While there are a number of related papers, none have addressed the impact of co-segregating markers on genetic maps. In the present study, we investigated the effects of co-segregating markers on high-density genetic map length and marker order using empirical data from two populations of wheat, Mohawk × Cocorit (durum wheat) and Norstar × Cappelle Desprez (bread wheat). The maps of both populations consisted of 85% co-segregating markers. Our study clearly showed that excess of co-segregating markers can lead to map expansion, but has little effect on markers order. To estimate the inflation factor (IF), we generated a total of 24,473 linkage maps (8,203 maps for Mohawk × Cocorit and 16,270 maps for Norstar × Cappelle Desprez). Using seven machine learning algorithms, we were able to predict with an accuracy of 0.7 the map expansion due to the proportion of co-segregating markers. For example in Mohawk × Cocorit, with 10 and 80% co-segregating markers the length of the map inflated by 4.5 and 16.6%, respectively. Similarly, the map of Norstar × Cappelle Desprez expanded by 3.8 and 11.7% with 10 and 80% co-segregating markers. With the increasing number of markers on SNP-chips, the proportion of co-segregating markers in high-density maps will continue to increase making map expansion unavoidable. Therefore, we suggest developers improve linkage mapping algorithms for efficient analysis of high-throughput data. This study outlines a practical strategy to estimate the IF due to the proportion of co-segregating markers and outlines a method to scale the length of the map accordingly.
Genetic maps also known as linkage maps are constructed for several purposes (see
Allow identifying genomic regions that control the expression of qualitative and quantitative trait loci (QTL) (
Help in marker-assisted selection by facilitating the introgression of desirable QTL.
Allow phylogenetic analyses between different species for evaluating similarity between genes (
Help in the identification of chromosomal rearrangements (
Help in anchoring physical maps (
Facilitate
Where high-density maps are required, constitute the first step toward positional or map-based cloning of genes responsible for economically important traits, (
Genetic maps indicate the position and relative genetic distances between markers along chromosomes, which is analogous to signs or landmarks along a highway where the genes are “houses” (
Advances in sequencing and genotyping technologies have enabled the massive production of single nucleotide polymorphism (SNP) markers in a cost-effective way, making SNP markers the choice for linkage mapping. As a result, many laboratories have developed high-throughput SNP assays with continuously increasing marker numbers. For wheat, there are the 9K (
Indeed, a high number of markers are needed to build high-density genetic maps that are suitable for positional or map-based cloning of genes. However, the disproportion between the high number of markers and the relatively small population size, the so-called ‘large p, small n’ problem, markedly impact the resolution of recombination so that only a minority of markers can be actually ordered (
Machine learning (ML) is the study of data-driven, computational methods for making inferences and predictions (
In animal and crop breeding, ML algorithms have been widely used in the framework of genomic selection (GS), e.g., (
The objective of our study is to investigate the effects of co-segregating markers on high-density genetic map length and marker order using empirical data from durum and bread wheat. Ultimately, we aim to predict the inflation factor (IF) of the linkage maps, using ML algorithms.
Two doubled haploid mapping populations described elsewhere were used in this study: the durum wheat Mohawk × Cocorit (
As described in earlier publications (
Our approach consisted of two phases with the following steps:
For each population, all curated SNP data was used to build linkage maps using the MSTMap software (
For each LG, a skeleton map was built by keeping only one of the most informative (highest polymorphism information content, lowest number of missing data) markers randomly selected per cluster (group of markers located at the same position).
Then, using an in-house Ruby script, we built as many maps (hereafter referred to sequential maps) as there were co-segregating markers on each LG (see step 1) by adding one marker at a time (one after another), selected randomly from the list of co-segregating markers.
Because LGs had different sizes and the number of co-segregating markers varied among them, we computed the proportions of co-segregating markers relative to the total number of markers on each LG.
Eight levels of proportion, ranging from 10 to 80% were sampled for all LGs having ≥80% of co-segregating markers.
Each proportion level had 50 replicates. For example, for LG 1A we randomly selected 10% of co-segregating markers 50 times to build 50 ‘sequential’ maps. Then, we repeated the same process for 20, 30, 40, 50, 60, 70, and 80% of co-segregating markers. However, LGs 2A, 4A and 5A in Mohawk × Cocorit and 1D, 4D and 7D in Norstar × Cappelle Desprez had less than 80% of co-segregating markers and only six proportion levels (10–60%) with 20 replicates were used.
The length of these sequential maps and markers order were compared to those of the skeleton map.
Finally, for each sequential map the IF was estimated as:
IF = ((
Seven ML algorithms implemented in the Caret R package (
Linear regression model (LR): LR was developed in the field of statistics, but has been borrowed by ML. The LR algorithm is a family of model-based learning approaches that assume a linear relationship between the input variables (x) and the single output variable (y). The LR equation is built and trained, using different techniques, the most common of which is called Ordinary Least Squares (OLS). The OLS is a method for estimating the unknown parameters in a LR while minimizing the sum of the squares of the differences between the observed responses (values of the variable being predicted) in the given dataset and those predicted by a linear function of a set of explanatory variables.
Generalized linear model (GLM): The GLM provides flexible generalization of ordinary linear regression for response variables with error distribution models other than a Gaussian (normal) distribution. GLM unifies various other statistical models, including binomial, gamma, Poisson and logistic regression. Each serves a different purpose, and depending on distribution and link function, GLM can be used for prediction or classification.
Polynomial regression with degree 2 (POLY2) and 3 (POLY3): Polynomial regression is a form of linear regression in which the relationship between the input variables (x) and the output variable (y) is modeled as a polynomial. Although polynomial regression fits a non-linear model to the data, it is considered as a special case of multiple linear regression since it is linear in the regression coefficients. We only tried quadratic (POLY2) and cubic (POLY3) models to avoid overfitting.
K-nearest neighbors (KNN): The KNN algorithm is an instance-based learning where new data are classified based on stored, labeled instances. The rationale behind the KNN algorithm is learning by analogy. The distance between the stored data and the new instance is calculated using similarity measures such as the Euclidean distance, cosine similarity or the Manhattan distance. The similarity value is used to perform predictive modeling for classification or regression. In both cases, the input consists of the k closest training examples in the feature space. For classification, the output is a class membership while for regression, it is the property value for the object. This value is the average of the values of its k nearest neighbors.
Support vector machine (SVM) (
Classification and regression trees (CART) (
Random forest (RF) (
To evaluate the map expansion, only maps generated using different proportion levels (10–80%) of co-segregating markers were used, 4800 and 7580 maps for Mohawk × Cocorit and Norstar × Cappelle Desprez, respectively. Two types of partition designs were used to build the prediction models. In the first partition design, the whole set of sequential maps for each population was split into training and test sets containing 80 and 20% of the maps, respectively. The second partition design was a 10-fold cross-validation scheme with 5 replicates (
The models were fitted using the training sample, and the fitted models were used to predict outcomes in the test set. The goodness-of-fit of the models was evaluated using the root mean square error (RSME). The prediction accuracy was estimated as a Pearson correlation between the predicted and the observed map length in the test set.
A total 24,473 linkage maps were built for this study: 8,203 maps for Mohawk × Cocorit and 16,270 maps for Norstar × Cappelle Desprez populations.
The features of Mohawk × Cocorit and Norstar × Cappelle Desprez maps that were built in step 1 of phase I are presented in
Features of the Mohawk × Cocorit linkage map.
Full map |
Skeleton map |
Co-segregating markers | ||||
---|---|---|---|---|---|---|
Chromosomes | Markers | Map size (cM) | Markers | Map size (cM) | Number | Proportion (%) |
1A | 348 | 154.4 | 58 | 137.3 | 290 | 83 |
1B | 277 | 205.9 | 46 | 197.2 | 231 | 83 |
2A | 90 | 183.6 | 28 | 148.7 | 62 | 69 |
2B | 334 | 161.1 | 51 | 145.0 | 283 | 85 |
3A | 269 | 90.3 | 27 | 82.2 | 242 | 90 |
3B | 323 | 231.8 | 55 | 180.3 | 268 | 83 |
4A | 76 | 141.7 | 24 | 128.9 | 52 | 68 |
4B | 340 | 156.9 | 40 | 119.9 | 300 | 88 |
5A | 91 | 71.7 | 36 | 63.7 | 55 | 60 |
5B | 323 | 207.8 | 36 | 178.4 | 287 | 89 |
6A | 300 | 200.5 | 63 | 165.7 | 237 | 79 |
6B | 529 | 215.7 | 53 | 188.9 | 486 | 92 |
7A | 330 | 247.2 | 64 | 192.6 | 276 | 84 |
7B | 369 | 152.5 | 49 | 123.2 | 320 | 87 |
Genome A | 1504 | 1089.4 | 300 | 919.1 | 1214 | 81 |
Genome B | 2495 | 1331.7 | 330 | 1132.9 | 2175 | 87 |
Total | 3999 | 2421.1 | 630 | 2052.0 | 3389 | 85 |
Features of the Norstar × Cappelle Desprez linkage map.
Full map |
Skeleton map |
Co-segregating markers | ||||
---|---|---|---|---|---|---|
Chr | Markers | Map size (cM) | Markers | Map size (cM) | Number | Proportion (%) |
1A | 909 | 107.1 | 85 | 90.6 | 824 | 91 |
1B | 673 | 235.9 | 122 | 217.9 | 551 | 82 |
1D | 102 | 110.8 | 31 | 103.0 | 71 | 70 |
2A | 483 | 228.5 | 95 | 212.9 | 388 | 80 |
2B | 864 | 230.7 | 96 | 216.2 | 768 | 89 |
2D | 498 | 198.8 | 42 | 185.9 | 456 | 92 |
3A | 593 | 246.2 | 94 | 219.3 | 499 | 84 |
3B | 681 | 253 | 129 | 226.1 | 552 | 81 |
3D | 76 | 18.7 | 6 | 17.8 | 70 | 92 |
4A | 398 | 188.6 | 70 | 179.0 | 328 | 82 |
4B | 424 | 130.1 | 68 | 127.9 | 356 | 84 |
4D | 29 | 15.3 | 8 | 14.4 | 21 | 72 |
5A | 636 | 281.4 | 117 | 278.9 | 519 | 82 |
5B | 1049 | 225.9 | 126 | 211.6 | 923 | 88 |
5D | 107 | 27.9 | 13 | 19.8 | 94 | 88 |
6A | 437 | 170.6 | 66 | 156.7 | 371 | 85 |
6B | 937 | 185.0 | 101 | 170.3 | 836 | 89 |
6D | 103 | 47.5 | 15 | 15.8 | 91 | 88 |
7A | 641 | 216.9 | 114 | 200.7 | 527 | 82 |
7B | 471 | 144.6 | 70 | 133.6 | 401 | 85 |
7D | 43 | 72.1 | 20 | 71.6 | 23 | 53 |
Genome A | 4097 | 1439.3 | 641 | 1338.1 | 3456 | 84 |
Genome B | 5099 | 1405.2 | 712 | 1303.6 | 4387 | 86 |
Genome D | 958 | 491.1 | 135 | 428.3 | 826 | 86 |
Total | 10154 | 3335.6 | 1488 | 3070.0 | 8669 | 85 |
For Norstar × Cappelle Desprez, 10,154 markers spanning 3335.6 cM were mapped on the 21 chromosomes of the bread wheat genome. The genome-wide proportion of co-segregating markers was 85% (8,669/10,154), ranging from 53 (chromosome 7D) to 92% (chromosomes 2D and 3D). Genome A displayed 84% of co-segregating markers while genomes B and D showed 86% of co-segregating markers.
Markers order analysis revealed a very high collinearity between sequential maps and the skeleton map for all chromosomes in both Mohawk × Cocorit and Norstar × Cappelle Desprez (
Spearman correlation coefficient of markers order between sequential maps and skeleton map in Mohawk × Cocorit and Norstar × Cappelle Desprez.
Chromosomes | Mohawk × Cocorit | Norstar × Cappelle Desprez |
---|---|---|
1A | 0.99 | 0.99 |
1B | 0.97 | 0.99 |
1D | 0.99 | |
2A | 0.94 | 0.99 |
2B | 0.99 | 0.99 |
2D | 0.98 | |
3A | 0.99 | 0.99 |
3B | 0.95 | 0.99 |
3D | 0.97 | |
4A | 0.97 | 0.99 |
4B | 0.96 | 0.99 |
4D | 0.98 | |
5A | 0.96 | 0.99 |
5B | 0.99 | 0.99 |
5D | 0.99 | |
6A | 0.97 | 0.99 |
6B | 0.98 | 0.99 |
6D | 0.97 | |
7A | 0.99 | 0.99 |
7B | 0.99 | 0.99 |
7D | 0.99 |
The length of the sequential maps expanded in proportion to the co-segregating markers for both Mohawk × Cocorit (
Genome-wide pattern of map length inflation factor in the Mohawk × Cocorit population.
Genome-wide pattern of map length inflation factor in the Norstar × Cappelle Desprez population.
The overall variation in IF was similar among genomes in Mohawk × Cocorit (
Boxplot of map length inflation factor per genome in the Mohawk × Cocorit population.
Boxplot of map length inflation factor per genome in the Norstar × Cappelle Desprez population.
Pattern of inflation factor for chromosomes and the proportions of co-segregating markers in the Mohawk × Cocorit population.
Pattern of inflation factor for chromosomes and the proportions of co-segregating markers in the Norstar × Cappelle Desprez population.
The prediction accuracies of the models are shown in
Prediction accuracy of different models in the Mohawk × Cocorit and Norstar × Cappelle Desprez populations.
Populations | Models1 | RMSE2 | Accuracy |
---|---|---|---|
Mohawk × Cocorit | LR | 4.631 | 0.654 |
GLM | 4.631 | 0.654 | |
KNN | 4.577 | 0.664 | |
POLY2 | 4.584 | 0.664 | |
POLY3 | 4.578 | 0.664 | |
SVM | 4.632 | 0.661 | |
CART | 4.694 | 0.638 | |
RF | 4.577 | 0.664 | |
Norstar × Cappelle Desprez | LR | 2.234 | 0.737 |
GLM | 2.234 | 0.737 | |
KNN | 2.225 | 0.742 | |
POLY2 | 2.234 | 0.737 | |
POLY3 | 2.229 | 0.739 | |
SVM | 2.227 | 0.743 | |
CART | 2.389 | 0.667 | |
RF | 2.225 | 0.742 |
Map inflation factor (mean ± standard deviation) relative to the proportion of co-segregating markers in the Mohawk × Cocorit and Norstar × Cappelle Desprez populations.
Mohawk × Cocorit |
Norstar × Cappelle Desprez | |||
---|---|---|---|---|
Co-segregating markers (%) | Number of maps | Inflation factor (%) | Number of maps | Inflation factor (%) |
10 | 700 | 4.48 (±3.63) | 990 | 3.77 (±1.99) |
20 | 700 | 6.85 (±3.94) | 990 | 5.43 (±2.02) |
30 | 700 | 9.34 (±3.79) | 990 | 6.71 (±2.02) |
40 | 700 | 11.11 (±4.08) | 990 | 7.39 (±1.95) |
50 | 700 | 13.78 (±4.95) | 990 | 8.24 (±2.34) |
60 | 700 | 14.86 (±5.06) | 970 | 9.35 (±2.46) |
70 | 650 | 16.59 (±5.52) | 970 | 10.85 (±2.65) |
80 | 650 | 16.62 (±5.58) | 950 | 11.70 (±2.28) |
All of the linkage maps were constructed using MSTMap software (
No single software harbors all the desirable features (e.g., ultra-fast, accurate in makers order, no map inflation, scalable) that one could expect for assembling a high quality high-density map in a relatively short time. Therefore, different combinations of software have been used to build high-density genetic maps (e.g.,
A strong collinearity in markers order (
“Map expansion is the phenomenon that genetic maps including a large number of genes are longer than the corresponding actual genetic distance between the genes involved” (
Many sources of map expansion have been reported, including genotyping errors and missing values (
Nonetheless, only the correction of genotyping errors and a reduction in missing values have led to substantial improvement of algorithms for the construction of high-density linkage maps (
The effect of co-segregating markers on linkage maps has received less attention. However, our study clearly showed that an excess of co-segregating markers leads to map expansion. The more co-segregating markers, the larger the map expansion. Using ML approaches, we were able to predict with an accuracy of 0.7 the map expansion relative to the proportion of co-segregating markers. Although we used both linear and non-linear methods, all of the ML algorithms gave similar results supporting evidence of a linear relationship between map expansion and the number of co-segregating markers. The proportion of co-segregating markers ranged from 60 to 92% in Mohawk × Cocorit (
To deal with map expansion, a common practice is to remove the double recombinants. However, the method of removing erroneous double recombinants could lead to irrelevant distances among markers (
We estimated the IF of each LG with respect to the length of its skeleton map. Because only a few markers can reliably be ordered in a context of high-density linkage mapping where the number of markers exceed by far the size of the population (
Machine learning algorithms are becoming more accepted in crop breeding and are presented as a worthwhile surrogate to traditional statistical methods (
Because of the continued increase in the size of high throughput SNP-chips, the disparity between the high number of markers and the relatively small population size is more likely to result in poor resolution maps (
Our study clearly showed that excess of co-segregating markers can lead to map expansion with little effect on markers order. Using various ML algorithms, we were able to predict with an accuracy of 0.7 map expansion relative to the proportion of co-segregating markers. Because co-segregating markers are inevitable in high-density linkage maps, it becomes necessary to improve linkage mapping algorithms for efficient analysis of high-throughput data. In the meantime, a practical strategy could be to estimate the IF related to the proportion of co-segregating markers and then scale the length of the map accordingly.
AN’D set up the experimental design, analyzed the data, and wrote the initial manuscript. JH edited the manuscript. DF and KA created the mapping populations and edited the manuscript. CP provided all resources including funding, designed the experiment and edited the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
We are thankful to the technical support of the Wheat and Genetics Molecular Lab and CIMMYT.