Statistical Analysis on Detecting Recombination Sites in DNA-β Satellites Associated with Old World Geminiviruses

Although exchange of genetic information by recombination plays an important role in the evolution of viruses, it is not clear how it generates diversity. Understanding recombination events helps with the study of the evolution of new virus strains or new viruses. Geminiviruses are plant viruses which have ambisense single-stranded circular DNA genomes and are one of the most economically important plant viruses in agricultural production. Small circular single-stranded DNA satellites, termed DNA-β, have recently been found to be associated with some geminivirus infections. In this paper we analyze several DNA-β sequences of geminiviruses for recombination events using phylogenetic and statistical analysis and we find that one strain from ToLCMaB has a recombination pattern and is a recombinant molecule between two strains from two species, PaLCuB-[IN:Chi:05] (major parent) and ToLCB-[IN:CP:04] (minor parent). We propose that this recombination event contributed to the evolution of the strain of ToLCMaB in South India. The Hidden Markov Chain (HMM) method developed by Webb et al. (2009) estimating phylogenetic tree through out the whole alignment provide us a recombination history of these DNA-β strains. It is the first time that this statistic method has been used on DNA-β recombination study and give a clear recombination history of DNA-β recombination.

DNA-β has relatively a large range of its selection on different species of the helper virus DNA-A (Mansoor et al., 2003a), it is proposed to co-evolve with the DNA-A component (Briddon et al., 2008).
Recombination plays an important role in geminivirus (Lefeuvre et al., 2009) and DNA-β evolution (Amin et al., 2006;Lefeuvre et al., 2007). A fragment of DNA-β genome infecting tomato was reported to migrate to cotton via recombination with other adaptive DNA-β molecules (Amin et al., 2006), indicating the role of a recombination event in evolution of DNA-β molecules.
Because of the important role of recombination in DNA-β evolution, analysis on recombination events of DNA-β becomes specially important for understanding this viral evolution and disease epidemic as well as development of potential control strategies.
In this paper, we apply a statistical phylogenetic analysis using a Bayesian stochastic method to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity developed by Webb et al. (2009) to estimate potential recombination spots of DNA-β. It is the first time that this statistic method has been used on DNA-β recombination study and give a clear recombination history of DNA-β recombination. In order to confirm our results, we also apply a statistical phylogenetic method developed by Martin et al. (2005b) to the same data sets. We find that the results with the method in Webb et al. (2009) and with the method in Martin et al. (2005b) are very similar to each other. One strain of Tomato leaf curl Maharashtra betasatellite (ToLCMaB) has a recombination pattern and is possibly recombinant molecule between two strains from two distinct species, Papaya leaf curl betasatellite (PaLCuB) and Tomato leaf curl betasatellite (ToLCB),

IntroductIon
Geminiviruses are emerging as one of the most economically important plant viruses in agricultural production (Abou-Jawdah et al., 2006;Briddon et al., 2008;Zhou et al., 2008). Begomovirus is the largest genus of the family of Geminiviridae and is phylogenetically and geographically divided into two groups; the Old World viruses and the New World viruses. The new world begomovirus consists of two viral genomes, DNA-A and DNA-B, while most of the Old World begomovirus just has one partite DNA-A (Briddon et al., 2008). About a decade ago, a satellite molecule called DNA-β was found to associate with some of the old world geminivirus (Saunders et al., 2000;Briddon et al., 2001).
DNA-β has a genome approximately 1.3-1.5 kb long, and depends on the helper virus DNA-A for its replication, movement, and transmission (Saunders et al., 2000;Briddon et al., 2001;Cui et al., 2004). It is grouped into sub-viral agents by the International Committee on Taxonomy of Viruses (ICTV). The most typical plant symptoms caused by geminivirus are due to an association of DNA-β with DNA-A, whereas DNA-A alone does not lead to severe damage to crops (Cui et al., 2004;Briddon et al., 2008). C1 gene encoded by DNA-β were found to suppress host defense systems (Cui et al., 2005) and modulate host development (Yang et al., 2008), and was believed to be one of the determining factors for geminivirus-induced disease symptom development (Briddon et al., 2008).
DNA-β has not been found in the New World (North American and South American continents) and is believed to be associated with Old World begomoviruses after the geographical divergence of "Old" and "New" continents (Mansoor et al., 2003b). Although site. The observed state space is {A, C, G, T, −}. Under the evolutionary model, the evolution of homologous DNA/RNA sequences (or protein-coding sequences where the state space is of size 61) can be described by continuous time Markov chains on a phylogenetic tree. A continuous time Markov chain is characterized by a substitution rate matrix, and the phylogenetic tree summarizes the relationships between the species in terms of edge lengths (times since divergence) and common ancestors. The DNA sequences are only observed in the leaves, and information on the phylogenetic tree, substitution events (time and type) and edge lengths is missing. The transition matrix P(t) for a continuous time Markov process can be written as exp(Qt), where Q is a parametrized substitution rate matrix which determines the Markov process. In this method the evolutionary model was set as Hasegawa-Kishino-Yano (HKY) model (Hasegawa et al., 1985).
The rate matrix Q under HKY model is written as the following: Let Σ = {A, C, G, T} and let π a , a ∈ Σ, Σ a π a = 1, denote the stationary distribution of the Markov chain. This distribution can be estimated from the nucleotide frequencies in a single sequence. HKY model has substitution rate matrix: (1) where the diagonal elements are such that each row sums to 0 and the two unknown parameters are α and β. The software from Webb et al. (2009) estimates the posterior distribution using Monte Carlo Markov Chain (MCMC) method under the HMM and then it outputs each tree topology with its posterior probability along each site (see Webb et al., 2009 for details).
We have used HKY model for phylogenetic analysis on our data sets in this paper, since the HMM software in Webb et al. (2009) uses HKY model. Also note that we have used the generalized time reversible (GTR) + gamma + invariant model, which is within the 95% confidence interval computed via Akaike's information criteria (AIC) in the software jModelTest (Guindon and Gascuel, 2003;Posada, 2008), to reconstructing a ML tree and the ML tree under the GTR + gamma + invariant model has the same tree topology as the ML tree under HKY model in Figure 5 as well as the consensus tree under HKY model in

data set
A proposed taxonomy of DNA-β using 78% nucleotide sequence identity as demarcation threshold was accepted and widely used for distinguishing species from strains of DNA-β (Briddon et al., 2008). This resulted in about 51 distinct species of DNA-β associated with begomoviruses.
Tomato leaf curl disease (ToLCD) is caused by begomoviruses associated with betasatellites. A recent report showed that different species of DNA-β associated with ToLCD in India are geographically isolated and distributed (Sivalingam et al., 2010). The DNA-β molecules in southern and central India are more closely related to each other than those in northern India.
To observe potential recombination events among these geographically related DNA-β species, we chose four strains from four distinct species of DNA-β associated with ToLCD in India In the same report as well as another report (Mazhar et al., 2009), species of ToLCBDB and ToLCB are closely related in phylogenetic tree, while PaLCuB and ToLCMaB are sisters (neighbors).
Another ToLCD associated DNA-β from Indonesia (taxon-4) was chosen as an out group. Other five species of non-ToLCD related DNA-β from eastern Asia and southeastern Asia (taxa-5, 6, 7, 8, and 9) were also chosen for the out group. See Table 1 for details.

MaterIals and Methods
First, a data set of 10 DNA-β genome sequences in fasta format was aligned using clustalw-multialign software with the following parameters: (Gap opening penalty 10.0, gap extension penalty 0.2, gap separation penalty range 8, DNA weight matrix: IUB) (Thompson et al., 1994).
To analyze recombination for DNA-β from geminiviruses, we used the software package from Webb et al. (2009). In this method they applied a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity. Under the HMM, the hidden states are all possible unrooted tree topologies with the number of leaves n fixed along each of the proportion of permutations of the polymorphic sites for which that pair of sequences has some fragment with the observed score or larger (Sawyer, 1989). The software BootScan takes two phases: "Scanning phase" and "Detection phase." In "Scanning phase" first they discard non-informative sites from the input data sets and in each window of userdefined width move among the given aligned sequences. It makes bootstrap samples and compute rooted UPGMAs by definition rooted or mid-pointed neighbor-joining (NJ) trees. In "Detection phase" every combination of triplets is individually examined for bootstrap evidence that one of the sequences may be alternatively more closely related to each of the other two sequences at different positions along its length. The probability that the pattern of sites within a potential recombinant region could have occurred by a chance distribution of mutations is approximated using a Bonferroni corrected version of the binomial distribution.
The software MaxChi considers only polymorphic sites: For a given position of the moving window on the input sequence alignment and for a given pair of sequences, a chi-square statistic is computed to compare two proportions: the proportion of sites at which the sequences agree in the left half-window and the proportion of sites at which the sequences agree in the right half-window. Discordance between these two proportions may reflect a recombination event in the history of the two sequences. The maximum chi-square over all sequence pairs is recorded as a summary of the evidence for recombination at the window center. Significance of observed chi-square statistics is assessed by a Monte Carlo permutation test.
The software Chimaera is also a modification of Maynard Smith's maximum χ 2 method (Wiuf et al., 2001) with only variable sites. The statistic is the maximum χ 2 in the original alignment. The p-value equals the number of times the original statistic is smaller than the statistic from permuted alignments divided by the number of permutations. For all calculations, a sliding window was used, with the width of the window set to the number of polymorphic sites divided by 1.5. This window moves in steps of one nucleotide at a time.
The software SIScan uses a similar idea as algorithms implemented in MaxChi and Chimaera, but instead of using contingency tables they use Gaussian distribution and use Z-score to compute the p-value.
The software 3Seq is similar to RDP: 3Seq discards non-informative sites from the input data sets and then for every triplet of taxa {A, B, C}, from the data set, it chooses the sister A and B: two parent sequences that may have recombined, with one or two breakpoints, to form the third sequence (the child sequence). Excess similarity of the child sequence to a candidate recombinant of the parents is a sign of recombination; they take the maximum value of this excess similarity as the test statistic. Then they rapidly calculate the distribution of the excess similarity and using this method they estimate the p-value.

results
The most consensus trees found with the 1-1505 and 1000-1505 nt alignment were the same as the most dominant tree found with the HMM software (the pink tree in Figure 1).
Then we estimated the maximum likelihood (ML) tree from the whole alignment (including position 1 through position 1505). Next we infer phylogenetic tree using maximum likelihoods method, The generated alignment file in phylip format was put in to the HMM software (Webb et al., 2009) using the command "java -jar ST-HMM.jar" with the following parameter (iterations: 50000, burn-in: 25000, rates: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10.0, 100.0, lambda: 5, kappa: 2.0, tuningpar 0.4). Command "java -jar STHMMPosterior.jar" was used to summarize the posterior distribution, and trees with posterior probability above 0.05 were selected using the command "java -jar TreeSummary.jar". The region 1-1000 nucleotide (nt) was found to have a clear pattern of recombination, while the region 1000-1505 nt seems to have a massive pattern of tree probability.
In order to apply phylogenetic analysis to the sequences of 1-1505 and 1000-1505 nt of the 10 viral sequences after aligning with the clustalw-multialign software into nexus format, we estimated the posterior distribution under the generalized time reversible (GTR) + Γ model and HKY model, and we estimated the maximum likelihood estimators. First we applied a software MrBayes (Bronquist and Huelsenbeck, 2003) to analyze the split of different taxa on the most consensus tree under the GTR + Γ and HKY models. 647300 generations were sampled for 1-1505 nt alignment, while 3600000 generations were sampled for 1000-1505 nt alignment. The first 25% of the data was burn-in. We ran four Markov chains for each model. We followed the recommendation of MrBayes which suggests running the chains until the standard deviation of the chains' split frequencies is less than 0.01.
The software RDP takes basically three steps: First they discard non-informative sites from the input data sets and then for every triplet of taxa {A, B, C…}, from the data set, choose the sister A and B. Second, they use a window of user-defined width moved among the aligned sub-sequences one nucleotide at a time and take an average percentage identifying each of the three possible sequence pairs among {A, B, C} at the each position. Third, the probability that the nucleotide arrangement in the identified region that results in A, B appearing more closely related to C may have occurred by chance is computed using a binomial distribution.
The software GeneConv is based on an earlier statistical approach for detecting gene conversion (Sawyer, 1989). They use the term fragment for an aligned or homologous pair of segments in the input alignment. In the process, the highestscoring fragments in the given alignment are listed and assigned p-values based on the assumption of a random distribution of polymorphic sites. They assign scores as follows: First, all sites that are monomorphic in the alignment are discarded so that only polymorphic sites are considered. Secondly, for a given pair of sequences, matching bases are scored as +1 and mismatches as −m, where m depends on the pair of sequences. Fragments are assigned p-values similar to the BLAST procedure (Altschul et al., 1990;Karlin and Altschul, 1993). This p-value is an approximation  Figure 2. This is an unrooted tree. This is the most likely tree topology from position 1 to 140 and position 300 to 1000. The software from Webb et al. (2009) and RDP3 Martin et al. (2005b) indicate a potential recombination event among taxa 0, 1, 2, and 3 in the red rectangle. Also the ML tree estimated by the software PHYML has the same tree topology under HKY model as well as the consensus tree estimated by the software MrBayes under HKY and GTR + Γ. topology. The y-axis represents the probability for each tree topology and the x-axis represents position number. The tree written in pink is in Figure 1 and the tree written in the dark blue dominating from position 140 to 300 is in Figure 3. using PHYML v3.0 software (Guindon and Gascuel, 2003), with all settings default, namely the evolutionary model is HKY model, the tree topology search operation method is Nearest Neighbor Interchange (NNI), and the starting tree was computed using BIONJ (Gascuel, 1997). To analyze the splits of different taxa on the ML tree we applied bootstrapping on the columns of each alignment with the bootstrap sample size 1000. The ML tree found with the 1-1505 nt alignment was the same as the most dominant tree found with the HMM software (the pink tree in Figure 1).
From position 1 to position 141 and from position 312 to position 1000, the tree topology in Figure 1 has almost probability 1.0 (see Figure 2). Note that the estimated ML tree and the estimated consensus tree reconstructed with the whole sequences from an estimated posterior distribution have the same tree topology. However, from position 141 to position 311 in the alignment, the tree topology in Figure 3 has almost probability 1.0 (see Figure 2). The Robinson-Foulds (RF) distance (Robinson and Foulds, 1981) between the tree topology in Figure 3 and tree topology in Figure 1 is 6. Note that the largest possible RF distance for trees with n taxa is 2n − 6 which is 14 in our case (the normalized RF distance between these tree topologies is 0.43). Thus we do not think this happened because of the low support of a split but this seems to indicate strongly that around position 142 and position 311 there are possible recombination sites.
In order to compute the support for each split we have also computed the consensus tree using the software MrBayes (Figure 4) and the ML tree using PHYML (Figure 5). For the consensus tree we used the posterior distribution and for the ML tree we use the bootstrap with the sample size 1000 to compute the support for each split. They have the same tree topology as the tree in Figure 1 and the support for each split in the ML tree and the consensus tree has very high probability. Especially, the probability of each split on the consensus tree estimated with the whole sequences under HKY is 1.0 (100%). (Even though one of the splits on the ML tree reconstructed with the whole sequences under HKY has about 90% of its support all other splits have strong support; Figure 5.).
The mutation rates along each site are also estimated by the software from Webb et al. (2009) and it seems that the mutation rates are between 0.1 and 0.3 (Figure 6).  (Figure 8). However, although the recombination event lead to the possible emergence of a new strain in a different epidemic location in India, it still has a stronger relationship within its parents geographically and phlegmatically than other strains which are epidemic in other Asian countries.
DNA-β was known to be capable to adapt to a new helper virus from distinct geographic location by modifying its genome (Nawaz-ul Rehman et al., 2009) . This is an unrooted tree. The number in each split represents the probability of the split estimated by bootstrapping with the bootstrap sample size 1000. Note that the tree topology of the ML tree is the same as the tree topology of the consensus tree in Figure 4 and the tree topology in Figure 1.  . This is an unrooted tree. The number in each split represent the probability of the split. The consensus tree estimated under the GTR + Γ also has the same tree topology but it has smaller probabilities of some splits. Note that the tree topology of the consensus tree is the same as the tree topology of the ML tree in Figure 5 and the tree topology in Figure 1.   We used RDP (Martin and Rybicki, 2000), GENECONV (Padidam et al., 1999), BootScan (Martin et al., 2005a), MaxChi (Smith, 1992), Chimaera (Posada and Crandall, 2001), and 3Seq (Boni et al., 2007

dIscussIon
The advantage of our study is that estimating of phylogenetic tree through out the alignment by HMM method provide a clear history of DNA-β recombination. It is the first time that researches on DNA-β recombination use such statistic method and give this clear recombination history.
Our study also provides a way to understand DNA virus evolution through recombination events. From our results, it is likely that the specie of ToLCMaB is a result of recombination from two different species, namely ToLCB and PaLCuB. Such recombination event contributed to the occurrence of new DNA-β species as well as the evolution of DNA-β. By providing the recombination history together with geographic information, we could link the phylogeny information to the geographic information of DNA-β strains, thus help us understand evolution and epidemic of the virus. acknowledgMents Ruriko Yoshida is supported by NIH R01 grant 5R01GM086888. We thank David Haws for computations.