Phylogenomic analysis and molecular identification of true fruit flies

The family Tephritidae in the order Diptera, known as true fruit flies, are agriculturally important insect pests. However, the phylogenetic relationships of true fruit flies, remain controversial. Moreover, rapid identification of important invasive true fruit flies is essential for plant quarantine but is still challenging. To this end, we sequenced the genome of 16 true fruit fly species at coverage of 47–228×. Together with the previously reported genomes of nine species, we reconstructed phylogenetic trees of the Tephritidae using benchmarking universal single-copy ortholog (BUSCO), ultraconserved element (UCE) and anchored hybrid enrichment (AHE) gene sets, respectively. The resulting trees of 50% taxon-occupancy dataset for each marker type were generally congruent at 88% nodes for both concatenation and coalescent analyses. At the subfamily level, both Dacinae and Trypetinae are monophyletic. At the species level, Bactrocera dorsalis is more closely related to Bactrocera latifrons than Bactrocera tryoni. This is inconsistent with previous conclusions based on mitochondrial genes but consistent with recent studies based on nuclear data. By analyzing these genome data, we screened ten pairs of species-specific primers for molecular identification of ten invasive fruit flies, which PCR validated. In summary, our work provides draft genome data of 16 true fruit fly species, addressing the long-standing taxonomic controversies and providing species-specific primers for molecular identification of invasive fruit flies.


Introduction
The family Tephritidae in the order Diptera, commonly known as true fruit flies, includes over 4,300 species distributed in about 500 genera worldwide (White, 2006).Some species within this family are major agricultural pests globally, threatening various fruits, and causing significant economic losses (Smith et al., 2002).These economically damaging species mainly belong to five genera, Anastrepha, Bactrocera, Ceratitis, Dacus, and Rhagoletis (Smith et al., 2002).
Though extensive efforts have been devoted to clarifying the phylogeny of fruit flies, the relationships between some groups remain controversial.For example, based on morphological characteristics, the tribe Dacini (Dacus + Bactrocera) and Ceratitis genus belonged to the subfamily Dacinae, and the genera Anastrepha and Rhagoletis were in the subfamily Trypetinae (Korneyev, 1999).However, molecular evidence does not support the monophyly of Trypetinae (Han and McPheron, 1997;Han and Ro, 2009).Moreover, recent studies have suggested that both the Dacinae and the Trypetinae are non-monophyletic (Krosch et al., 2012;San Jose et al., 2018;Zhang et al., 2019;Song et al., 2019;David et al., 2021;Yong et al., 2021).In the subgenus Bactrocera, mitochondrial data shows that Bactrocera dorsalis is more closely related to Bactrocera tryoni than to Bactrocera latifrons (da Costa et al., 2019;Yong et al., 2016;Zhang et al., 2010;Zhang et al., 2018), while nuclear data supports a closer relationship between B. dorsalis and B. latifrons (Dupuis et al., 2018;San Jose et al., 2018;Valerio et al., 2022).These inconsistences were primarily due to incomplete lineage sorting or introgression (Zhang et al., 2021;Congrains et al., 2023;San Jose et al., 2023).Moreover, most previous molecular studies are mainly based on a few nuclear genes or mitochondrial genome data.
Phylogenetic analyses with a limited number of loci may lead to disputed conclusions (Munro et al., 2011;Young and Gillung, 2020), expanded sets of molecular markers have been used to infer the evolutionary relationships of species across distant taxa (Young and Gillung, 2020).For example, phylogenetic analysis of genome-scale data has tested controversial phylogenetic relationships for a wide range of organisms, such as bacteria (Gomila et al., 2015), fish (Hughes et al., 2018), spiders (Lozano-Fernandez et al., 2019) and asterids (Zhang et al., 2020).Transcriptomes are important genome-scale data widely used for phylogenetic analyses, including Lepidoptera (Bazinet et al., 2017), spiders (Garrison et al., 2016), insects (Misof et al., 2014), Ostracoda (Oakley et al., 2013) and chalcid (Zhang et al., 2020).However, as transcriptomes contain only expressed genes and transcriptome sequencing typically require a large quantity of high-quality RNA (Lemmon and Lemmon, 2013;McCormack et al., 2013), its utility is restricted.In contrast, whole-genome assemblies (Zhang et al., 2019) can obtain near-complete gene sets from a wide range of tissue types.Moreover, it is feasible to sequence the whole genome from low-quality samples such as preserved museum specimens or those intercepted by customs (Huynh et al., 2023).Genome-scale data have been used to infer phylogenies across distant taxa including lice (Boyd et al., 2017), butterflies (Allio et al., 2020), wasps (Cooper et al., 2020), springtails (Sun et al., 2020), and scale insects (Liu et al., 2022).These studies suggest whole genome assemblies are information-rich for phylogenomic analyses.
Rapid invasive species identification is important for customs departments to develop effective quarantine measures.Presently, the most widely used method for identifying fruit flies relies on the morphological features of adult insects (Huang et al., 2017).However, if intercepted pests are in the stage of larvae or pupae, they need to be reared to adults for accurate identification.This is time-consuming or even impossible to obtain adults because of emergence failure.However, due to the high sequence similarity between true fruit flies, reliable molecular identification primers are still unavailable (Liang et al., 2011;Frey et al., 2013;Manger et al., 2018;Kunprom and Pramual, 2019).
In this study, we used a phylogenomic approach to uncover the phylogeny of Tephritidae to address unclear phylogenetic relationships of true fruit flies.First, genome data of 16 true fruit flies were obtained via Illumina sequencing.Second, we extracted the BUSCO, AHE and UCE from the genome data and built different matrices data to infer the tephritid phylogeny.Moreover, with these genome data, we designed speciesspecific primers for molecular identification of true fruit flies.Our results provide new insights into the phylogenetic relationship of true fruit flies at the genome level and technical support for quarantine identification of invasive true fruit flies at custom ports.

High-throughput sequencing
We collected samples of 16 true fruit fly species for sequencing, across seven genera: Anastrepha, Bactrocera, Ceratitis, Dacus, Zeugodacus, Carpomya and Rhagoletis (Table 1).DNA extraction was performed on a single specimen per species, using the Blood and Cell Culture DNA Midi Kit (Qiagen, United States).The quality of genomic DNA was controlled by the following criterion: the concentration of DNA is greater than 30 ng/μL; the OD260/ 280 range from 1.8 to 2.0; the DNA has no RNA contamination.A 350-bp insert Illumina TruSeq fragment and a 400-bp insert library were constructed from qualified genomic DNA using a TruSeq Nano DNA HT Sample Preparation Kit, and then sequenced on the Illumina Hiseq X-ten and NovaSeq 6000 platforms (see Table 1 for details), respectively.All sequencing data for the 16 fruit flies are available in the National Genomics Data Center GSA database (https://ngdc.cncb.ac.cn/gsa/).The GSA number is CRA016637.Genomes were sequenced on the Illumina Hiseq Xten platform marked with an asterisk, and Illumina NovaSeq6000 platform marked with two asterisks.
Frontiers in Genetics frontiersin.org

Extracting BUSCO, UCE and AHE
For the BUSCO marker, we retrieved the single copy orthologs from the results of each genome assembly by using BUSCO v3.0.2 (Waterhouse et al., 2018) software to scan for the Insecta BUSCO set (1,658 loci).
For UCE and AHE loci, we employed the PHYLUCE v1.6.3 package manual (Faircloth, 2016) to extract UCE and AHE from each genome assembly, using the Diptera-wide UCE2.7kv1 probe set containing 31,328 baits targeting 2,711 loci (Faircloth, 2017) and the Diptera AHE probe set containing 217,702 sites targeting 559 loci (Young et al., 2016), respectively.In the PHYLUCE, the script "phyluce_probe_run_multiple_lastzs_ sqlite" was used to align the probe sequence to the assembly genomes.The script "phyluce_probe_slice_sequence_from_ genomes" was used to extract the Fasta sequence from the assembly genomes.Then the script "phyluce_assembly_match_ contigs_to_probes" was used to match contigs from probes and remove duplicate contigs.Finally, the UCE and AHE loci were extracted using the scripts "phyluce_assembly_get_match_counts" and "phyluce_assembly_get_fastas_from_match_counts."The flanking region of 400 bp on both sides for each UCE and AHE locus was retained.

Phylogenetic analyses
To infer the phylogenetic relationships of the fruit flies, in addition to the 16 species sequenced in this study, nine previously sequenced fruit fly species from NCBI, including Zeugodacus cucurbitae, B. dorsalis, B. latifrons, Bactrocera minax, Bactrocera oleae, B. tryoni, Rhagoletis zephyria, Rhagoletis pomonella, Ceratitis capitata, and two outgroup species (Drosophila melanogaster and Drosophila novamexicana) were analyzed.The accession numbers for each species are listed in Table 1.In total, our taxon sampling was 27 taxa including 25 ingroup species and two outgroup species for phylogenetic analyses.
Phylogenomic analyses were conducted using concatenation method, generating supermatrix and coalescent-based species trees for UCE, BUSCO, and AHE matrices.We executed maximum likelihood (ML) of concatenation analysis using partitioning schemes with PartitionFinder v2.1.1 (Lanfear et al., 2017) for the best trees to conduct 20 ML tree searches (10 random and 10 parsimony-based starting trees) and 1,000 bootstrap replicates using RAxML-NG v1.0.1 (Kozlov et al., 2019).For species tree estimation based on the coalescent method, gene trees were first estimated using RAxML-NG v1.0.1 on individual gene alignments with the GTR + G4 substitution model for nucleotides and amino acids with the LG + G4 substitution model by running 500 bootstrap replicates.Species trees were then estimated from gene trees using ASTRAL-III v5.6.1 (Zhang et al., 2018a), using local posterior probabilities to assess node support.

Calculating Robinson-Foulds (RF) distances
We calculated the pairwise Robinson-Foulds (RF) distances (Robinson and Foulds, 1981) between the topologies of gene trees from BUSCO, UCE, and AHE datasets at 50% species occupancy and their species tree topology using the function multiRF in the phytools R package (Revell, 2012).The discordance between all the gene trees and species tree was visualized using a multidimensional scaling method (Duchêne et al., 2018;Roycroft et al., 2020).The pairwise RF distances were plotted in two dimensions using the function cmdscale in R and visualized using the ggplot2 package (Villanueva and Chen, 2019).

Calculating phylogenetic informativeness
To assess the ability of different marker types to infer relationships at specific time points (Townsend, 2007), the phylogenetic informativeness (PI) of BUSCO, UCE, and AHE nucleotide datasets at 50% species occupancy as measured using TAPIR (Faircloth et al., 2012a), optimized for parallelized calculation across extensive genomic datasets.Before calculating PI, a time-tree was used as an input in this program.We constructed time-calibrated phylogenetic trees for each dataset using our consensus phylogeny topology.The total PI for each dataset and the PI per locus per dataset were calculated, respectively.

Molecular identification of fruit flies
Considering the availability of the specimens collected, only 13 fruit fly species were used to screen for species-specific primers for molecular identification (Table 2).They were identified by morphological characteristics (Plant Health Australia, 2011;Huang et al., 2017), before the test.Coding sequence (CDS) obtained by Ab initio gene prediction of these 13 fruit fly species were blasted against the genome assemblies of all the 25 fruit fly species (Table 1) to predict species-specific sequences by using the following steps: 1) The CDS of each fruit fly was fragmented into 200 bp short sequences with a step length of 50 bp.2) Then short fragments were searched against genome assemblies for high sequence similarity matches using Bowtie2 v2.5.1 (Langmead and Salzberg, 2012).3) Fragments with no blast hit in other fruit flies were species-specific.4) The species-specific fragments were used for designing specific primers and were then verified through PCR.Genomic DNA from each species was extracted using the Blood & Cell Culture DNA Midi Kit (Qiagen).The total PCR reaction volume was 20 μL, including 10 μL of PremixTaq (Takara), 1 μL each of primer (10 μM), 1 μL of DNA template, and 7 μL of ddH2O.The PCR reaction consisted of an initial denaturation step at 95 °C for 3 min, followed by 35 cycles of denaturation at 94 °C for 30 s, annealing at the prime specific temperature (Supplementary Table S4) for 20 s, extension at 72 °C for 45 s, and a final extension step at 72 °C for 5 min.PCR products were examined via 1.2% agarose gel electrophoresis.

Genome assemblies of 16 fruit flies
Each true fruit fly species yielded approximately 30-50 Gb of raw reads, with a sequencing coverage ranging from 47× to 228×.The assembled genome sizes ranged from 302 Mb (Drosophila ciliates) to 1,290 Mb (Rhagoletis cerasi), with contig N50 lengths spanning from 3.72 (Ceratitis rosa) to 96.93 kb (Bactrocera tsuneonis), and the number of contigs ranging from 493,516 to 18,487 (Table 1).
The three molecular markers showed different data matrix patterns.Only a few loci were obtained with 100% presentation.33 BUSCO loci and one AHE locus were present in all 27 species tested, while s no UCE identified locus was present in all species (Table 3).With the decreasing taxon occupancy, the number of loci in the data matrix for each molecular marker type increased.For BUSCO, an average of 31.52%parsimony informative sites (PIS) of amino acid alignments, and 43.58% PIS of nucleotide alignments Assessment of genomic completeness of 16 fruit fly species in the Tephritidae family by benchmarking universal single-copy ortholog (BUSCO) analysis using the insect orthoDBv9 dataset containing 1,658 BUSCO genes.C, complete; S, complete single copy; D, complete duplicated; F, fragmented; M, missing.The results showed that 80.3%-99% (1,332-1,650) of BUSCO genes were complete, only 1%-4.9% (2-85) of BUSCO genes were fragmented, and 0.6%-14.8%(6-245) of BUSCO genes were missing, therefore validating these genome assemblies for further analysis.

Recovering phylogenetic relationships of fruit flies
To construct phylogenetic trees for the Tephritidae, we conducted phylogenetic analyses using the different matrices of the three markers (BUSCO, AHE, and UCE).It has been reported that increasing taxon occupancy leads to a reduced loci number (Allio et al., 2020).Increasing the number of loci rather than taxon-occupancy tends to increase phylogenetic tree topological convergence and node support values for each type of marker (Supplementary Figures S1-S8).At 50% taxon-occupancy, the phylogenetic tree topological and node support values for the three types of molecular markers tended to be convergent, based on both concatenation and coalescent methods (Supplementary Figures S1-S8).
The phylogenetic trees inferred using these molecular markers at 50% taxon-occupancy data were generally congruent at most nodes based on both two methods.The family Tephritidae was shown as being comprised of two main clades, Dacinae and Trypetinae with high bootstrap values at the backbone nodes (Figure 2; Supplementary Figures S9-S11).The subfamily Dacinae includes four genera-Bactrocera, Zeugodacus, Dacus, and Ceratitis, and the subfamily Trypetinae contains three genera-Anastrepha, Rhagoletis, and Carpomya.
In the Dacinae clade, the genus Zeugodacus is sister to the genus Dacus, forming a monophyletic group.Within the Bactrocera subclade, the subgenus Bactrocera forms a monophyletic cluster, separating from the subgenera Daculus and Tetradacus.In the Trypetinae subfamily clade, the genus Carpomya lays close to the paraphyletic cluster Rhagoletis, forming a separate subclade from Anastrepha.However, all three molecular markers were inconsistent for the B. dorsalis species complex with ML and ASTRAL analyses (Figure 2; Supplementary Figures S9-S11).The UCE dataset, however, showed a congruent topology based on two different methods (Supplementary Figure S11).Bactrocera philippinensis was distant from the other two Bactrocera complex species, and Bactrocera thailandica was the most closely related to B. dorsalis (Supplementary Figure S11).The same result was obtained from the BUSCO dataset based on ASTRAL analysis (Figure 2; Supplementary Figure S9) and the AHE dataset based on ML analysis (Supplementary Figure S10).

Evaluation of the phylogenetic performance of molecular markers
To compare the phylogenetic performance of the genomic markers, we used the 50% taxon-occupancy dataset for each marker type.We measured the phylogenetic informativeness (PI) of the three marker types to assess their ability to resolve  evolutionary relationships at given time points.The BUSCO dataset showed surpassingly higher total PI than the UCE dataset, both of which were higher than the AHE dataset across all time scales (Figure 3A).For the PI per locus, the three types of markers displayed nearly identical PI over the past 15 Ma.However, the PI value of the BUSCO dataset then rose rapidly and showed higher than both the UCE and AHE datasets from 15 to 150 Ma.During this period, the AHE dataset showed slightly higher PI values than the UCE dataset (Figure 3B).In summary, the PI of our results indicated that the BUSCO dataset contained more robust phylogenetic signals than both UCE and AHE.
We also calculated Robinson-Foulds (RF) distances between gene trees topologies from each dataset and species tree topology.Across all marker types, an abundant degree of discordance was observed between the gene tree and species tree (Figures 4A-C).The distribution was scattered and none of the gene trees completely matched the topology of the species trees for each type of marker (Figures 4A-C).The gene trees from the BUSCO dataset, with higher average bootstraps (Figure 4D), most of which were more concentrated, showed less RF distance to the species tree compared to the gene trees from UCE and AHE (Figures 4A-C).In contrast, the gene trees from the AHE dataset were the most scattered among themselves and the species tree, showing the largest degree of difference between gene trees and the species tree (Figure 4C).Therefore, the BUSCO dataset containing more PI exhibited less gene tree heterogeneity and gene tree-species tree heterogeneity and possesses a superior potential to resolve the relationship of the studied true fruit flies.

The divergence time of the Tephritidae family
To estimate the divergence time of the Tephritidae family, we used the datasets from UCE50, BUSCO50 (NT), and AHE50.Mean posterior time estimates of all these molecular markers yielded similar results (Figure 5A).However, on the shallower nodes, such as the generic nodes, it seemed that times estimated based on BUSCO tended to be slightly older and feature wider confidence intervals than those based on AHE and UCE, with the youngest age estimates occurring for UCE.In contrast, on the deeper nodes, the estimates from the three markers differed only marginally.This result indicated that time estimates of highly conserved loci were slightly older in the clades that underwent recent rapid radiations.

Molecular identification of fruit flies using species-specific primers
The number of species-specific sequences predicted was 4-1,927 among the 13 fruit fly species (Supplementary Table S3).Based on these specific sequences, ten pairs of specific primers, corresponding to ten species, were verified through PCR amplification.The annealing temperature for these primers ranged from 53 °C to 60 °C, and the product sizes spanned from 101 bp to 184 bp (Supplementary Table S4).A single specific band was found in a total of seven species including B. dorsalis, B. latifrons, B. oleae, C. capitata, Z. cucurbitae, Zeugodacus tau and Anastrepha ludens, while no amplified fragments were found in other species (Figure 6).Though a single target band was found in Dacus punctatifrons, Bactrocera correcta and Anastrepha suspensa, false negative fragments were also amplified in non-target species, inconsistent with the expected band size (Supplementary Figure S12).Therefore, combined with the amplified fragment size and sequence information, these three species can still be reliably identified.In summary, a total of ten pairs of speciesspecific primers were screened, which could effectively distinguish ten species from 13 fruit fly species.

Discussion
The rapidly decreasing sequencing costs have facilitated the fast accumulation of genome data of a wide range of organisms.In contrast to the transcriptome, it is feasible to obtain abundant gene resources from sub-optimal samples such as specimens with 100-year-old history stored in museums (Huynh et al., 2023).The specimens of fruit flies used in this study were intercepted by customs, and the DNA of these samples was usually severely degraded.Although the genome assemblies obtained for these true fruit flies were at the contig level, the BUSCO assessment results showed that the genome completeness of most of them was above 90% (Figure 1), suggesting that these genome assemblies, although fragmented, have a high gene space and are suitable for subsequent phylogenomic analysis.

Genome-scale data for phylogenetic analysis
To construct a phylogenetic tree with high confidence, we employed different types of molecular markers, namely, BUSCO, UCE and AHE, with varying gene completeness datasets.Extraction proportions for AHE and UCE genes from the genome assemblies of 25 fruit flies ranged from approximately 20%-30% and 30%-50%, respectively, while the extraction proportion for BUSCO genes was above 70% (Supplementary Table S1).However, it should be noted that many genes were absent at 100% species-occupancy for all three types of markers.For instance, UCE was lacking, and only one AHE gene was observed at 100% species occupancy (Table 3).These results suggested a species bias in UCE and AHE, which may be due to the loss of some conserved loci during the genomic evolution of true fruit flies, causing the target loci in the universal Diptera probe set not to be conserved in this rapidly diversifying group (Cohen et al., 2021).Another possible reason is that the evolutionary distances between the studied species and those used for creating probes are too far to find more conserved AHE and UCE loci (Branstetter et al., 2017).For example, the Diptera AHE probe set used in this study was initially designed for flower flies in Syrphidae (Young et al., 2016).One species used for this probe kit was D. melanogaster, so it was unexpected that its extraction proportion here was 97% (Supplementary Table S1).However, other species proportions were substantially lower (Supplementary Table S1).Thus, it is necessary to develop a lineage-specific probe set.

Phylogeny of the fruit flies
Although many studies have addressed the phylogenetic relationship of the Tephritidae family over the past few decades, some controversies remain.Deep level phylogenetic analysis using a limited number of mitochondrial genes, reconfirmed the monophyly of the Dacinae but did not support the non-monophyletic relationship of the Trypetinae, and showed that the tribe Carpomyni (Rhagoletis + Carpomya) clustered together with Dacinae rather than Anastrepha (Han and McPheron, 1997;Han and Ro, 2009).However, recent studies based on several genes or mitogenome showed that both the Dacinae and Trypetinae are not monophyletic.For instance, the genus Ceratitis was closer to the genus Anastrepha than to the Dacini tribe (San Jose et al., 2018;David et al., 2021;Yong et al., 2021).In contrast, other studies showed that the Anastrepha was closer to the Dacini tribe, forming a distinct cluster from the Ceratitis (Krosch et al., 2012;Zhang et al., 2019b;Song et al., 2019).However, our results based on genomic data, showed that Ceratitis clustered together with the Dacini tribe and Anastrepha clustered together with the Carpomyni tribe, supporting the monophyly of the Dacinae and Trypetinae which aligns with morphological evidence (Korneyev, 1999) (Figure 2; Supplementary Figures S9-S11).
Our results showed that the genus Zeugodacus was sister to the genus Dacus rather than Bactrocera.Morphological evidence regarded Zeugodacus as a subgenus of Bactrocera (Wang, 1996).However, Krosch et al. (2012) proposed Zeugodacus be elevated to the genus level.This was confirmed by subsequent studies using more genes or the mitochondrial genome (Virgilio et al., 2015;San Jose et al., 2018;Zhang et al., 2019b;Yong et al., 2021).Here, we confirmed previous proposals to raise Zeugodacus to genus level using whole genome data.
At the shallower levels within the subgenus Bactrocera, B. dorsalis has generally been regarded as more closely related to B. tryoni than to B. latifrons which was basal to the subgenus Bactrocera based on mitochondrial data (da Costa et al., 2019;Yong et al., 2016;Zhang et al., 2010;Zhang Y. et al., 2018).In contrast, our results showed that B. dorsalis was more closely related to B. latifrons than to B. tryoni, consistent with recent studies based mainly on nuclear data (Dupuis et al., 2018;San Jose et al., 2018;Valerio et al., 2022).Aside from the results obtained by Valerio et al. (2022), those two studies did not conclude the relationships between these three species due to incongruent results from different analysis methods (Dupuis et al., 2018;San Jose et al., 2018).Our results, based on various types of genomic scale datasets with both ML and ASTRAL analyses, supported the closer relationship between B. dorsalis and B. latifrons.For the B. dorsalis species complex, Bactrocera phillipinensis and Bactrocera invadens were previously considered junior synonyms of B. dorsalis (Schutze et al., 2015;Schutze et al., 2017).But Drew and Romig proposed the withdrawal of this result (Drew and Romig, 2016).Further evidence based on the male aedeagus showed that B. phillipinensis and B. invadens differed from B. dorsalis (Drew et al., 2022), confirming this withdrawal.Our results show that B. dorsalis is more closely related to B. thailandica than B. phillipinensis and B. invadens (Figure 2; Supplementary Figures S9-S11).

Molecular identification of fruit flies
Molecular identification is not limited to the insect stage and specimen integrity, which is a simple and accurate method.DNA barcoding based on the mitochondrial cytochrome oxidase I gene (COI) (Hebert et al., 2003) has been widely used in species identification (Hajibabaei et al., 2006;Nopparat et al., 2010).However, many problems have emerged, such as the close genetic distance between species, sequence similarity, and the interference of mitochondrial pseudogenes, resulting in an inability of COI to accurately distinguish between the species of fruit flies (Liang et al., 2011;Blacket et al., 2012;Manger et al., 2018).Compared to DNA barcoding, which uses a single gene for identifying species, large numbers of potential candidate diagnostic loci were quickly obtained from whole genome assemblies in this study (Supplementary Table S3).These alternative diagnostic loci may circumvent the abovementioned issues and provide a greater range of tools for species identification.Furthermore, the species-specific identification method used here has the advantages of speed and cost-effectiveness, unlike the tree-based COI diagnostics methods.In recent years, species-specific simple repeat sequences from the genome were successfully used for the molecular identification of four fruit fly species including C. capitata, Z. cucurbitae, B. dorsalis and B. tryoni (Ding et al., 2018).However, simple repeat sequences are usually located in non-coding regions, and there are large differences in repeat sequences between individuals of the same species (Miah et al., 2013), making it challenging to ensure their stability in amplification.The CDS used to screen species-specific sequences in this study, encode protein products and are relatively stable in PCR amplification.The PCR products are also relatively easy and stable to amplify due to their size, ranging from 100 to 200 bp (Supplementary Table S4), conducive to the repetition in molecular identification.False negative amplification has also been observed in other nontarget species.Moreover, although the species-specific markers such as Bcor7, Ztau2, and Dpun6, had no hits in the NT database using BlastN, they showed some extent of similarity with bacterial proteins using BlastX.Whether these markers are reliable for effective molecular identification needs further verification.Therefore, more samples are required to verify the selected specific primers in the future.We successfully screened ten pairs of specific primers corresponding to ten species based on a broad survey of whole genome assemblies.Our results provided technical support for the quarantine inspection of invasive fruit flies while enriching the gene resources for identifying fruit flies and presenting new ideas for molecular diagnostic marker screening.

FIGURE 2
FIGURE 2 Species trees for fruit flies in the Tephritidae family estimated based on the BUSCO nucleotide matrix of 50% taxon-occupancy amino acid dataset.Concatenation-based RAxML species phylogenetic tree (left) and coalescent-based ASTRAL species phylogenetic tree (right) were inferred by analysis of 1,636 BUSCO loci.Branch support values denote bootstrap support and local posterior probability, respectively.Only support values smaller than 100% or 1 are shown.

FIGURE 3
FIGURE 3 Phylogenetic informativeness (PI) over time for the 50% taxon-occupancy datasets of each molecular marker type.(A) total PI, (B) PI per locus.Dot means the average PI across all loci of each dataset type for each time point.

FIGURE 4
FIGURE 4 Multidimensional scaling of the pairwise Robinson-Foulds (RF) distance of all gene trees and species trees from BUSCO50 (NT) (A), UCE50 (B), and AHE50 (C) datasets.Each dot represents the topology of each gene tree.Distance of pairwise dots represents the RF distance between gene trees.The red dot represents a species tree inferred from the BUSCO50 (NT) dataset using the coalescent method.Average bootstrap values of individual gene trees from each dataset [BUSCO50 (NT), UCE50, and AHE50] are shown in (D).

FIGURE 5
FIGURE 5 Divergence time of fruit flies in the Tephritidae family estimated by molecular clock analysis.(A) Divergence tree estimation of BUSCO50 (NT) dataset performed using MCMCtree.(B) Clade age at the genus level and higher level estimated using AHE50, BUSCO50 (NT), and UCE50.Points indicate mean posterior time estimates and lines mean 95% confidence intervals.Red stars: prior calibration.Colors highlight the genus or tribe in the Tephritidae family.

TABLE 1
Taxon sampling and genomic information of 25 Tephritidae fruit flies and two Drosophila species as outgroups.

TABLE 2
Information on fruit flies for molecular identification.

TABLE 3
Summary statistics of various datasets.