Family-Based Haplotype Estimation and Allele Dosage Correction for Polyploids Using Short Sequence Reads

DNA sequence reads contain information about the genomic variants located on a single chromosome. By extracting and extending this information using the overlaps between the reads, the haplotypes of an individual can be obtained. Using parent-offspring relationships in a population can considerably improve the quality of the haplotypes obtained from short reads, as pedigree information can be used to correct for spurious overlaps (due to sequencing errors) and insufficient overlaps (due to short read lengths, low genomic variation and shallow coverage). We developed a novel method, PopPoly, to estimate polyploid haplotypes in an F1-population from short sequence data by taking into consideration the transmission of the haplotypes from the parents to the offspring. In addition, this information is employed to improve genotype dosage estimation and to call missing genotypes in the population. Through simulations, we compare PopPoly to other haplotyping methods and show its better performance. We evaluate PopPoly by applying it to a tetraploid potato cross at nine genomic regions involved in tuber formation.

Inspired by the approach of Berger et al. (2014), we start at the first SNP position in the target region (s = 1), and extend the maternal and paternal genotypes of this SNP, G 1 m = H 1 m and G 1 f = H 1 f , respectively, to two-SNP phasings, H 2 m and H 2 f . We consider every possible phasing between H 1 m and H 1 f and SNP position s = 2 in the region, and obtain the joint conditional probability of each extension pair, (H s m , H s f ), at s = 2 given the sequence reads of the population and the parental genotypes, (G s m , G s f ), as well as the offspring genotypes G s c i for i = 1, . . . , n (with n representing the number of offspring). Keeping only those parental extensions whose conditional probability exceeds or equals a pre-set branching threshold, ρ ∈ (0, 1], we eliminate further the extensions whose probability is less than κP max , where κ ∈ [0, 1] is a pre-set pruning threshold and P max is the maximum probability assigned to the candidate parental extensions. The surviving extensions at s = 2 are used in the next step as base phasings to obtain the extensions at s = 3 in a similar manner, and this procedure is iterated until the last SNP s = l has been added to the parental extensions. As it is not straightforward to directly calculate the conditional extension probabilities (Motazedi et al., 2018), we calculate instead the probability of the sequence reads conditional on each possible phasing and convert these probabilities to the desired extension probabilities using Bayes' formula: where R set denotes the set of all of the reads in the population and set stands for the set of basecalling error vectors, j , associated with each r j ∈ R set (1 j |R set |). P (R set |H s m , H s f , set ) denotes the conditional probability of observing the reads given a pair of maternal and paternal extensions at s, (H s m , H s f ), and the base-calling error probabilities given by set . To calculate P (R set |H s m , H s f , set ), we assume conditional independence of each read, r j ∈ R set , from the other reads in R set given set , and use the fact that each read is either directly obtained from one of the parental samples or belongs to an offspring c i (i = 1, ..., n), in which latter case the read may have originated from either parent with equal probability. Under these assumptions, P (R set |H s m , H s f , set ) is determined according to: where the function δ(r j ) returns the origin of read r j : mother (m), father (f ), or one of the n offspring (c 1 , ...., c n ).
Assuming independence of the sequencing errors at the SNP positions within each read, P (r j |H s m ) and P (r j |H s f ) in Equation 2 can be calculated according to Motazedi et al. (2018): where j assigns a base-calling error probability to every SNP position in r j , and h stands for each of the k t homologues in the phasing extension H s p (p ∈ {m, f }). In Equation 3, we use the superscript τ in r τ j and τ j to represent the called base at SNP position τ and its associated error probability, respectively. Likewise, h τ denotes the allele assigned to homologue h at SNP position τ . We use r τ j = "-" and h τ = "-" to show that SNP position τ has not been called in r j or is missing in h.
In obtaining P (r j |h, j ) in Equation 3, we assume that an erroneously called base can with equal chance be any of the three wrong bases. Therefore, the probability of observing a specific wrong allele is 1 3 τ j . Also, the probability of no error is actually the probability that no error occurs (1− τ j ), conditional on having observed either the reference or the alternative allele (1 − 2 3 τ j ). Therefore, it is Equations 2 and 3 establish the procedure to calculate the likelihood in Bayes' formula in Equation 1. In order to solve Equation 1, one also needs to specify the prior, ). While several ways can be thought of to specify this prior, we obtain it as follows. As the parental extensions (H s m , H s f ) are confined to those compatible with G s m and G s f , we set this prior to zero for every incompatible extension. For the compatible extensions, we look into the possible transmissions of the extended haplotypes (ignoring phenomena like aneuploidy (Karp et al., 1982), preferential chromosome pairing (Bourke et al., 2017), recombination and double reduction (Bourke et al., 2015)) to the offspring and for each offspring, c i , we count the number of transmissions that agree with its genotype at s, G s c i . Dividing this number by the total number of possible transmissions, km . . , n, we obtain the average likelihood of an observed offspring genotype according to: where, for p ∈ {m, f }, Π p s−1 and Π p s are the number of possible permutations of the alleles at s − 1 and s, respectively, u p is the number of distinct homologues, i.e. haplotypes, in H s p regarding only positions s − 1 and s, and ω sp i for i ∈ {1, ..., u p } denotes the number of times an identical haplotype (regarding only positions s − 1 and s) is present in H s p . Although it is possible to normalise the priors obtained this way over all of the possible extensions (to obtain a proper prior mass function), one does not need to do so as the discrete posteriors are normalised anyway at the end.

Estimation of missing and erroneous genotypes
The SNP-by-SNP extension of the parental haplotypes using the sequencing reads of an F1population is explained in Appendix A, assuming the SNPs have been accurately called for all of the population members. However, in practice every haplotyping algorithm has to handle missing and wrongly estimated SNP genotypes caused by sequencing and variant calling errors.
In presence of wrongly estimated genotypes (wrong dosages), it can occur that all of the offspring genotypes are incompatible with the parental extensions at some SNP position s. At these positions, the extension should either be skipped, as the prior weight of all candidate phasings will be zero, or the genotypes must be estimated anew. The extension at s will also be impossible if one or both of the parental genotypes are missing at s. To include these SNP positions in the extension, it is necessary to impute the missing genotypes.
In order to estimate the population genotypes at the missing or incompatible positions, we assume that the parents come from an infinite-size population at Hardy-Weinberg equilibrium. Limiting the attention to bi-allelic SNPs, the reference and alternative allele frequencies of the parents at position s can be estimated from the observed reads under the above assumption. Assuming a fixed sequencing error rate for all of the reads and nucleotide positions, 0 ER < 0.5, the frequency of the alternative allele can be obtained assuming a binomial model for the observed count of the alternative allele according to: where ξ is the total sequencing coverage of the population at s and ψ is the proportion of the alternative allele among the observed alleles. As this observed frequency, ψ, depends on the latent true frequency,p, through ψ = (1− ER)p+ ER(1 −p), it is straightforward to show thatp can be obtained as shown in Equation 6, with a standard error equal to 1 (1−2 ER) · ψ(1−ψ) ξ .
In case a specific base-calling error rate s j is assigned at each position s to each read r j , e.g. by using the integer-rounded Phred (quality) scores reported by the sequencer (Edgar and Flyvbjerg, 2015), one can assume a Gaussian distribution for the probability of observing the alternative allele at s in each read, f s P (r j )|p,σ 2 = 1 √ 2πσ 2 e − (P (r j )−p) 2 2σ 2 , and obtainp at each s according to: σ 2 = (P (r j ) −p) 2 ξ − 1 P (r j ) = (1 − s j )r s j + s j (1 − r s j ) In case more than one set of parental haplotypes has the maximum probability (Appendix A), we infer the offspring haplotypes for each of them as explained above and finally choose the family whose total MEC score (summed over all offspring) is the smallest.