AUTHOR=Faux Pierre , Geurts Pierre , Druet Tom TITLE=A Random Forests Framework for Modeling Haplotypes as Mosaics of Reference Haplotypes JOURNAL=Frontiers in Genetics VOLUME=Volume 10 - 2019 YEAR=2019 URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2019.00562 DOI=10.3389/fgene.2019.00562 ISSN=1664-8021 ABSTRACT=Many genomic data analyses such as phasing, genotype imputation or local ancestry inference share a common core task: matching pairs of haplotypes at any position along the chromosome, inferring thereby a target haplotype as a mosaic of reference haplotypes. For that purpose, these analyses combine information provided by linkage disequilibrium, linkage and/or genealogy through a set of heuristic rules or, most often, by a hidden Markov model. Here, we develop a machine learning framework to address the issue of local haplotype matching. In our approach, a supervised classifier using extra-trees (a particular type of random forests) learns how to identify the best local matches between haplotypes using a collection of observed examples. For each example, various features related to the different sources of information are observed, such as the lengths of a segment shared between haplotypes or estimates of relationship between individuals, gametes and haplotypes. The machine learning framework was fed with thirty relevant features for local haplotype matching. Repeated cross-validations allowed ranking these features in regard to their importance for local haplotype matching. The distance to the edge of a segment shared by both haplotypes being matched was found to be the most important feature. Then, using whole-genome sequence data for comparisons, we found that the machine learning framework was more efficient than a hidden Markov model to infer a target haplotype as a mosaic of reference haplotypes. To further evaluate its efficiency, the machine learning framework was applied to imputation of whole-genome sequence from 50k genotypes and it yielded average reliabilities similar or slightly better than Impute2. Through this exploratory study, we lay the foundations of a new framework to automatically learn local haplotype matching and we show that machine learning is a promising approach for such purpose. The use of this new technique also lights up some useful lessons on relevant features for the purpose of haplotype matching. We eventually discuss potential improvements for routine implementation.