A maximum-likelihood method to estimate haplotype frequencies and prevalence alongside multiplicity of infection from SNP data

The introduction of genomic methods facilitated standardized molecular disease surveillance. For instance, SNP barcodes in Plasmodium vivax and Plasmodium falciparum malaria allows the characterization of haplotypes, their frequencies and prevalence to reveal temporal and spatial transmission patterns. A confounding factor is the presence of multiple genetically distinct pathogen variants within the same infection, known as multiplicity of infection (MOI). Disregarding ambiguous information, as usually done in ad-hoc approaches, leads to less confident and biased estimates. We introduce a statistical framework to obtain maximum-likelihood estimates (MLE) of haplotype frequencies and prevalence alongside MOI from malaria SNP data, i.e., multiple biallelic marker loci. The number of model parameters increases geometrically with the number of genetic markers considered and no closed-form solution exists for the MLE. Therefore, the MLE needs to be derived numerically. We use the Expectation-Maximization (EM) algorithm to derive the maximum-likelihood estimates, an efficient and easy-to-implement algorithm that yields a numerically stable solution. We also derive expressions for haplotype prevalence based on either all or just the unambiguous genetic information and compare both approaches. The latter corresponds to a biased ad-hoc estimate of prevalence. We assess the performance of our estimator by systematic numerical simulations assuming realistic sample sizes and various scenarios of transmission intensity. For reasonable sample sizes, and number of loci, the method has little bias. As an example, we apply the method to a dataset from Cameroon on sulfadoxine-pyrimethamine resistance in P. falciparum malaria. The method is not confined to malaria and can be applied to any infectious disease with similar transmission behavior. An easy-to-use implementation of the method as an R-script is provided.


Data structure
The method is applicable to biallelic data, e.g., SNP data.Data has to be entered in a standard format.Considering n biallelic markers (SNPs), an observation (sample) corresponds to a vector with n elements.The kth element indicates whether only the wildtype allele (=0), only the mutant allele (=1), or both wildype and mutant (=2) are observed at the kth biallelic maker.A data set of sample size N is represented by an N × n matrix with entries 0, 1, or 2. The entry in the lth row and kth column indicates which alleles are observed in sample l at marker k.
A dataset containing N = 100 samples characterized by n = 5 loci has the following form: The column containing the row labels (sample IDs) and the row containing the column labels (marker IDs) are optional.
Data can be entered directly in R as an array with N rows and n columns or imported from typical formats, e.g., '.xlsx', '.xls','.csv', '.txt'.

Importing data
Assume the data is available as an '.xlsx' file.An example data set is provided in the supplement ('exampleData.xlsx').
Assume the file 'exampleData.xlsx' is stored in the directory "/home/janedoe/Documents/".The file can be imported with the R package 'xlsx'.If the package is not yet installed run the code:

Tsoungui Obama and Schneider
How to use # I n s t a l l t h e n e c e s s a r y p a c k a g e s i n s t a l l .p a c k a g e s ( ' x l s x ' ) After successfully installing the package 'xlsx' data can be imported using the following code: # Load l i b r a r i e s l i b r a r y ( x l s x )

ESTIMATING FREQUENCIES, PREVALENCE, AND MOI
The function 'mle(<DATA>,. ..)' will derive the maximum likelihood estimate (MLE).The function outputs a 3-element list.The first element is the MLE for the MOI parameter λ, the second element is the matrix that contains the MLE for the haplotype frequencies.Only the frequencies larger than 0 are returned.The column labels indicate the corresponding haplotypes (0=wildtype allele, 1=mutant allele).The third element is a matrix that contains the haplotypes with strictly positive frequency estimates.

Tsoungui Obama and Schneider
How to use If the first column of the dataset does not contain the sample IDs, the option 'id=FALSE' has to be specified.If one does not want to estimate the Poisson parameter λ but use a plug-in estimate instead, the option 'plugin = λ' needs to be specified.For instance if one wishes to use λ = 0.7 as plug in estimate for the Poisson parameter, the following code needs to be run: > mle (DATA, i d =TRUE , p l u g i n = 0 .7 ) $ lambda 0 .7 $p 00 01 10 11 [ 1 , ] 0 . 2 2 0 9 3 3 0 .3 0 6 7 9 0 5 0 . 2 1 5 1 6 9 6 0 . 2 5 7 1 0 6 9 Biased corrected estimates can be obtained by setting the option 'BC = TRUE'.The default is a bootstrap bias correction (method="bootstrap") based on = 10 000 boostrap replicates (Bbias = 10 000).Alternatively a jackknife bias correction can be obtained by setting the option method="jackknife".To obtain the bias corrected estimates using the bootstrap method, one should run the following code: > mle ( Data , i d =TRUE , p l u g i n =NULL, BC=TRUE , method = ' b o o t s t r a p ' ) $ lambda 0 .5 4 8 3 3 5 4 $p 00 01 10 11 [ 1 , ] 0 . 2 2 1 1 0 4 0 .3 0 6 0 2 6 8 0 . 2 1 6 0 3 7 1 0 . 2 5 6 8 3 2 2 $ h a p l o t y p e s Moreover, bootstrap confidence interval are outputted alongside the estimates if the option 'CI = TRUE' is specified.The default are B = 10 000 bootstrap repeats.To obtain the estimates of MOI and haplotype frequencies with their corresponding 95% confidence interval, one should run the following code:

Frontiers
Tsoungui Obama and Schneider

How to use
The MLEs outputted by the function 'mle(<DATA>,. ..)' can be used as input for the functions 'estunobsprev(<MLE>)' and 'estcondprev(<MLE>)' to estimate unobserved and conditional prevalence, respectively.Importantly, only the estimates without confidence intervals are used to estimate prevalence.This can be done with the following code: If a plug-in estimate was used for λ in the function 'mle', and the output is used to calculate the unobserved or conditional prevalence, the plug in estimate will be used automatically.
Relative prevalence is estimated from 'DATA' by the function 'relprev(<DATA>,. ..)'.The code to obtain the estimates of prevalence is: ## R e l a t i v e p r e v a l e n c e r e l p r e v <− e s t r e l p r e v (DATA, i d =TRUE) If 'DATA' is provided by 'example.xlsx',the estimates for relative are: > e s t r e l p r e v (DATA, i d =TRUE) 00 10 01 11 [ 1 , ] 0 . 2 2 2 2 2 2 2 0 . 2 1 2 9 6 3 0 .3 0 5 5 5 5 6 0 . 2 5 9 2 5 9 3 If the first column of the dataset does not contain the sample IDs, the option 'id=FALSE' has to be specified.

#
F i n d t h e MLEs e s t <− mle (DATA, i d =TRUE , p l u g i n =NULL, CI=FALSE ) # E s t i m a t e p r e v a l e n c e ## U n o b s e r v a b l e p r e v a l e n c e u n o b s p r e v <− e s t u n o b s p r e v ( e s t ) ## C o n d i t i o n a l p r e v a l e n c e c o n d p r e v <− e s t c o n d p r e v ( e s t )If 'est' is the output of 'mle(<DATA>,. ..)' assuming that 'DATA' is the data provided by the file 'example.xlsx',the estimates of the unobserved and conditional prevalence are:> e s t u n o b s p r e v ( e s t Java installation.In case of problems, it is recommendable to import the data from a '.txt' file or '.csv' file using the functions 'read.csv'or 'read.table',respectively.