Decision Theory-Based COI-SNP Tagging Approach for 126 Scombriformes Species Tagging

The mitochondrial gene cytochrome c oxidase I (COI) is commonly used for DNA barcoding in animals. However, most of the COI barcode nucleotides are conserved and sequences longer than about 650 base pairs increase the computational burden for species identification. To solve this problem, we propose a decision theory-based COI SNP tagging (DCST) approach that focuses on the discrimination of species using single nucleotide polymorphisms (SNPs) as the variable nucleotides of the sequences of a group of species. Using the example of 126 teleost mackerel fish species (order: Scombriformes), we identified 281 SNPs by alignment and trimming of their COI sequences. After decision rule making, 49 SNPs in 126 fish species were determined using the scoring system of the DCST approach. These COI-SNP barcodes were finally transformed into one-dimensional barcode images. Our proposed DCST approach simplifies the computational complexity and identifies the most effective and fewest SNPs to resolve or discriminate species for species tagging.


INTRODUCTION
The original concept of DNA barcoding was proposed to identify and discriminate a given species by a unique DNA sequence (Hebert et al., 2003). Such a DNA sequence aims at tagging species like a barcode. It is designed to identify a species from known DNA barcode sequences in a database. The commonly used DNA barcode of animal species is the mitochondrial gene cytochrome c oxidase I (COI) with a length of about 650 base pairs (bps). Meanwhile, COI sequences are also used for evolutionary and ecological studies (Hebert et al., 2003;DasGupta et al., 2005;Meier et al., 2006;Austerlitz et al., 2009;Kress et al., 2015;Park et al., 2018).
However, most nucleotides of the COI gene are conserved among different species except a minor proportion representing single nucleotide polymorphisms (SNPs). Several disease studies have used specific SNP to predict the predisposition for disease and the effects of therapeutic approaches. This concept has rarely been used for tagging species or improving the information content of DNA barcode sequences. The major benefit of using SNPs is the reduction of computational burden by removing the more abundant, non-informative, identical homologous nucleotides.
As an example, the tagging of fish species is not optimized as yet with respect to informative DNA barcoding. Some fish species have very similar morphology and it is difficult to distinguish those similar species, especially for marketing, conservation, and forensic purposes. Seafood mislabeling or fraud is a common societal and legal problem in fish trading (Sarmiento-Camacho and Valdez-Moreno, 2018) and the seafood economy (Vandamme et al., 2016;Willette et al., 2017). Currently, DNA barcoding is a reliable system for species identification and authentication and it is necessary to apply barcoding to many fish species (Liu et al., 2013;Vandamme et al., 2016;Willette et al., 2017;Sarmiento-Camacho and Valdez-Moreno, 2018). However, the COI sequences (∼650 bp) are largely uninformative and too long for an optimized application for the above purposes.
In the present study, we follow the original concept of DNA barcoding to develop a decision theory-based COI SNP tagging (DCST) approach where only the variable nucleotides (SNPs) of a given COI barcode sequence is applied for the tagging of fish species. The Fish Barcode of Life Initiative (FISH-BOL) (Ward et al., 2009) provides a public database for DNA barcode sequences with images, and geospatial information for almost 10,000 fish species (Becker et al., 2011).
We use the idea of decision theory (Quinlan, 1986;Berger, 2013;Fernandez Slezak et al., 2018) to determine which sites (nucleotides) of DNA sequences are selected to discriminate between species. These are used to generate the unique DNA tags for classification. Using the DCST approach, SNPs are extracted from COI sequences to generate a SNP-based COI pattern. Finally, the SNP-COI pattern is transformed into a one-dimensional sequence barcode.
The major aim of our proposed DCST approach is to provide an effective identification tool by generating an SNP-COI barcode. Here we apply this to the example of 126 scombriform fishes.

Sampling and Data Pre-processing
We retrieved the COI sequences from 126 species of the bony fish (Teleostei) order Scombriformes that include representatives of the following families: Ariommatidae, Arripidae, Bramidae, Caristiidae, Centrolophidae, Chiasmodontidae, Gempylidae, Icosteidae, Nomeidae, Pomatomidae, Scombrinae, Scombrolabracidae, Scombropidae, Stromateidae, Tetragonuridae, and Trichiuridae. The sequence data, ranging from 648 to 685 base pairs (bp) in lengths, were obtained from GenBank. Details of the family name, species name, sequence length, and accession number are shown in Table 1. COI sequences (n = 126) from these scombriform fishes were aligned using the ClustalW tool in MEGA 7 software (Kumar et al., 2016). Subsequently, the 5 ′ and 3 ′ protruding sequences were trimmed to gain the same length of COI sequences.

Decision-Based COI SNP Tagging (DCST)
Decision theory (Berger, 2013) improves a decision-maker's choice among a set of alternatives that need to be considered. Most of decision theory is normative, prescriptive and descriptive that provides a decision that is completely rational, has perfect accuracy and easy understanding. Possible alternatives and outcomes are considered as follows: Step (1) clearly define the given problem, step (2) organize all the possible alternatives, step (3) be aware of all possible outcomes, step (4) consider the benefits of each alternative and outcome, step (5) create a mathematical decision theory rule model, and step (6) make a decision by evaluating the models.
Based on such understood decision making, we propose here an approach for DNA barcoding that generates shorter DNA barcodes. We here call a decision theory-based COI-SNP tagging (DCST) approach. Given an N×M matrix of sequence data, S is described as: · · · · · · . . .
where N is the number of sequences from each species and M is the nucleotide length. There are four nucleotide types A, T, G, and C in the matrix S. Then the nucleotide frequency of distribution F is obtained in each position pε [1, M]. The frequency distribution matrix F is represented by: where each frequency is calculated as follows: The decision rules are created to distinguish species and divide them with each step into two subgroups based on the score of each position of sequences. The calculation of score in each position is represented by: where the estimated value at the position p, namely score p is calculated as: where mid p indicates the middle integer, i.e., the integer value of half of the number of sequence data (species number) in each subgroup, mid p = number of data set in node 2 and diff p is a parameter which balances the data for generating approximately equally sized subgroups. Therefore, biallelic loci  with almost equal frequency for each allele get the highest scores and are selected to divide the data into 2 subgroups. The mid p value is used to distribute all sequence data into two subgroups. For the equation for diffp (formula 7), our proposed methodology selects the first appearing SNP starting from the lowest to the highest order of nucleotide position although SNPs at different positions may have the same score. For example, there are four sequences in a given subgroup and the best case is that two data are assigned into the left subgroup and others are assigned to right subgroup. Accordingly, diff p is calculated as (min denotes the minimum value): Moreover, two different nucleotide types make it easier to sort the sequences into two subgroups for tree construction. Three or four nucleotide types are complex and require more tree lineages. Accordingly, the logic of the weighting system (formula 8) of the DCST method emphasizes the two nucleotide types and assigns the highest score among them. Non-polymorphic loci are not considered in this method, and hence they are given a score of 0. The weight p is defined by: The species can be separated into two subgroups according to the score estimation for each score p . The remaining subgroups at different levels are separated in the same way, and all the species are assigned a unique tag. The above step generates a pseudocode (Figure 1).
The flowchart of the DCST approach is shown in Figure 2. For example, the "data" contain 8 sequences (species) with the length for 13 nucleotides. The frequency distribution F is counted from "data" (see formula 2 and 3) and the SCORE (score p ) are FIGURE 2 | Flowchart of the DCST approach. This is an example to show how DCST approach operates. S1-S8 indicates eight sequences from eight species. In each level, the sequences are subgrouped according to the score system of DCSF approach, i.e., the nucleotides with the highest score are divided into two parts. Sometimes, the nucleotides at the same position may be chosen several times depending on the score performance.
calculated (see formula 4∼8). The positions p 1 and p 8 at the first group has 8 sequences (species), therefore, the mid 1 and mid 8 are 8 2 = 4 (formula 6) and the diff 1 and diff 8 are calculated as follows (formula 7): where there are two types in p 1 (C and T) and three types in p 8 , (A, C, and G) hence weight 1 is 1 and weight 8 is 0.66 (formula 8). The scores are calculated as follows (formula 5): This way we can get all scores of positions p 1 ∼p 8 , shown in Figure 2, and the maximum score in position p 1 is calculated in the first group. All sequences are divided into subgroups with "up" and "down" sides as branches related to nucleotides (e.g., C and T). Then, the sub-group follows the same procedure as mentioned above until the end (i.e., 7th group). This way the positions p 1 , p 2 , p 3 , p 4 , and p 7 are found. In this example, the positions, p 3 and p 4 , are chosen twice, i.e., 2nd group/6th group and 3rd group/5th group. Therefore, much shorter informative barcode sequences become available using DCST. Unique tags are generated when each species gets separated. Here, we use the code 128 (standard) of one dimensional barcodes to display each tag which is generated from a one dimension barcode image creator package called pythonbarcode 0.8.1. The standard code 128 in a one dimension barcode is an alphanumerical or numerical-only tool to generate barcode images.

Retrieval of COI Sequences
In this study, we retrieved 126 COI sequences of the fish order Scombriformes from GenBank. The 126 original COI sequences are shown in Figure 3 (the full original data set is available at http://shorturl.at/ayEJ2).

Alignment of COI Sequences
After performing multiple sequence alignments using the clustalW method in MEGA 7 software (Kumar et al., 2016), the resulting 126 aligned COI sequences are shown in Figure 4 (the full aligned data set is available at http://shorturl.at/tBMVW). FIGURE 3 | Original COI sequences (n = 126) of the fish order Scombriformes (Teleostei). This is an example of a group of species and sequences that shows 1st to 10th, 117th to 126th species and 1st to 50th, 640th to 668th position, respectively. The full original sequences for all species are available from http://shorturl.at/ tBMVW. FIGURE 4 | 126 aligned COI sequences of the fish order Scombriformes (Teleostei). This is an example of a group of species and sequences that shows 1st to 10th, 117th to 126th species and 1st to 50th, 640th to 668th position, respectively. The full original sequences for all species are available from http://140.127.112.213/ DNA_barcode/download/Scombriformes_COI_aligned.tar.

Trimming of COI Sequences
The position 1 to 35 and 673 to 696 of 126 aligned COI sequences are trimmed (i.e., protruding the 5 ′ and 3 ′ ends of sequence) that is shown as Figure 5 (the fully trimmed data set is available at http://shorturl.at/ tTU04). Counting from the trimmed sequences, 281 SNPs were identified.

Decision Process of COI Sequences
The decision process was created according the decision rule, and each unique tag was generated from each selected position (shown in Figure 6). Figure 6 shows ith position of nucleotides in each node, and all tags were collected and arranged from each node. Consequently, the original data of COI sequences with 636 bp length were curtailed into specific COI-SNP of only 49 bp length. Accordingly, our proposed DCST approach can effectively obtain shorter tags from COI sequences.

Species-Tag Barcode Generation of COI Sequences
One-dimensional barcodes were generated from these unique tags (shown as Figure 7, the full tags of one dimensional barcodes for 126 scombriform species are available at http://shorturl. at/szJL1). These one-dimension barcode images of tags allow information retrieval with a barcode scanner for technical and scientific applications.

DISCUSSION
The original concept of "DNA barcoding" was thought to identify and discriminate between species by different genetic tags or markers. After a longer search for a most informative gene sequence, the mitochondrial COI gene was found to be most informative in animals at the species level. Besides for taxonomic identification purposes, it is commonly used recently in evolutionary and ecological studies (Hebert et al., 2003; FIGURE 5 | Trimmed COI sequences (n = 126) of the fish order Scombriformes (Teleostei). This is an ellipsis of part of species and sequences that shows 1st to 10th, 117th to 126th species and 1st to 50th, 580th to 636th position, respectively. The reference sequence listed at the top one of figure et al., 2005;Meier et al., 2006;Austerlitz et al., 2009;Kress et al., 2015).

DasGupta
Several applications of machine learning were developed in DNA barcoding taxonomy. For example, the BPSI2.0 interface program (Zhang and Savolainen, 2009) was developed by Zhang and collaborators which is based on back-propagation neural network for species identification. Weitschek et al. (2013) proposed a machine learning approach for species classification, called BLOG 2.0 (Barcoding with LOGic) which is based on character-based DNA barcode sequences. The supervised machine learning methods were later applied to DNA barcodes for species classification (Weitschek et al., 2014). They collected eight datasets of DNA barcode sequences and used four classifiers for classification analysis. The above approaches have in common, that the classification model builds up through a training data set, then it verifies testing data to assess the model performance.
However, our proposed DCST is different from the classification model " (Zhang and Savolainen, 2009;Weitschek et al., 2013Weitschek et al., , 2014 for which a for a large training data set of sequences is necessary to validate the model before it can be applied to the test data." DCST arranges a short DNA barcode into a shorter DNA tag, which comes closer to the barcoding idea originally developed by Hebert et al. (2003). We propose here a DCST approach that generates an evolutionary COI-based identification system that provides even shorter sequences for the species tagging.
As for the decision rule of DCST, we will discuss two extreme cases caused by different designs. In case one, we search each position sequentially when a different nucleotide in p th position is met the first time. This case shows a disordered outcome and indefinite rule leading to uncertainty or imbalance in the number of sequences in the branches of the trees (Figure S1). In case two, we search one of the nucleotides of maximum divergence in each position, its result shows a skewed outcome leading to imbalance tree ( Figure S2). Although those two cases can generate unique DNA tags, they cannot segregate the sequence data for generating approximately equally sized subgroups. In contrast, the advantage of the balanced tree in algorithms and data structures area is the simple way to increase efficiency than other types of imbalance trees (Fleischer, 1996). In the present study, we used a balanced tree-based simple decision theory to arrange the species by COI barcoding systematically. Accordingly, the balanced tree algorithm DCST is theoretically more effective than the imbalanced tree methods (Figures S1, S2). Like the decision tree, the computational complexity time of DCST is O(N×M×D), where N is number of samples, M is the length of nucleotides, and D is the depth of tree (number of levels). Using 49 SNPs, the computational time for DCST to generate specific SNP species tags is 0.14693 ± 0.0016 s (mean ± SD; n = 30 runs) executed on an Intel Core i7-8750H 2.20GHz personal computer with 16 GB RAM. The length of sequences range from 648 bp to 685 bp which have approximately 4 650 possible ATGC-combinations that would allow over 10 million species with unique DNA tags. Our proposed DCST method can, therefore, efficiently obtain shorter DNA barcode for species tagging. The obtained DNA tags can reduce data storage significantly compared to the full length COI sequence.
It is possible that multiple positions for diff p (formula 7) may have the same score. For example, if there are 3 C, 3 T, and 2 A nucleotides in a node, the score is 1 or 2 where 3 C, 3 T, and 2 A = 8, i.e., diff p = min for mid C -f Cp = | 8 2 − 3| = 1, mid Tf Tp = | 8 2 − 3| = 1, and mid A -f Ap = | 8 2 − 2| = 2. In this case, both C and T have the same score for selection and may be the candidates used for SNP barcoding. Both of them are theoretically suitable for the subsequent step of our proposed DCST method although different SNP barcode patterns may be generated. For convenience, the SNP is selected starting from the lowest to highest order of nucleotide position in the DCST method. Once the SNP is selected, then the procedure stops and goes to the next subgrouping process.
A limitation of the DCST approach for tagging species is that it is only used to discriminate the known species with known barcode sequences. However, DCST can still be applied to any other barcode sequence such as nuclear ribosomal internal transcribed spacer (ITS) (Seifert, 2009;Schoch et al., 2012) FIGURE 6 | Tree-like structure outcome. This figure shows the selected position number and information of nucleotides for tagging SNP in 126 scombriform fishes. On the left side, the number of position within parentheses refers to the position of the reference sequence (Ariomma bondi; KT883659.1). For example, CT(243) indicates the nucleotide at the 243th position being selected as a node to separate two subgroups. It also shows the shorter tags from DNA COI sequences for each species on the right side. On the right side, the 1st nucleotide of the driftfish A. bondi has the 8th position in the original sequence KT883659.1 of A. bondi.
FIGURE 7 | DNA tag barcode of B. dussumieri. As an example, a DNA tag barcode is generated for the purpose of fast and precise identification in the teleost goby Boleophthalmus dussumieri.
for fungi and ribulose-1,5-bisphosphate carboxylase/oxygenase (rubisco) and maturase K (matK) (Dong et al., 2014) for plants. Moreover, the DCST approach can be applied to the sequence data retrieved by Next Generation Sequencing (NGS). NGS offers high-throughput nucleotide sequencing for DNA/RNA molecules (Metzker, 2010). Recently, NGS has been applied to metagenomics (Roumpeka et al., 2017). NGS-profiling metagenomics may identify all species existing in a given environment. Using our proposed DCST approach, speciesspecific sequences may be processed to generate species-specific SNP barcodes for tagging species in metagenomics. Suitable SNPs from different positions are selected for species tagging in our proposed DCST system. However, the DCST system does not consider the distances between the selected SNPs. Therefore, the DCST system fails to calculate the evolutionary distance and is unsuitable for phylogenetic analysis. The tree generated in Figure 6 was just to demonstrate that the species in the collected data set have very close relationships with very similar sequences.
The practical application of this DCST system in a laboratory situation is to provide a platform for SNP arrays which allows fast and specific SNP genotyping. Here, SNPs belonging to COI-SNP based species-tags can be genotyped individually and simultaneously. These allow species identification by comparison with DCST-generated COI-SNP based species-tags. For example, Arrayed Primer Extension (APEX) is an array-based detection and can analyze thousands of SNPs in candidate region (Pullat and Metspalu, 2008). After processing to array scanner, the SNP pattern is generated and the species may be recognized immediately by checking the species-specific SNP pattern. In contrast, single gene PCR followed by sequencing needs a DNA sequencing machine and perform bioinformatics BLAST searching. Although both full sequence of a single locus and array assay of DCST-generated SNP can identify a species, DCSTgenerated SNP barcode is more suitable for species-tag barcode generation because few SNPs (∼49 bp) are needed rather than full length of COI sequences (∼650 bp). In other words, 49 SNPs only take 49 line codes but full length needs 650 line codes. Moreover, SNPs may spread out in different genes for the advanced species tagging in future. In this case, full length sequencing of different genes cannot be performed in the same reaction, however, array detection is allowed.

CONCLUSION
The COI sequence with full length provides commonly accepted information for phylogenetic and evolutionary studies. However, the full length sequence contains mostly non-variable nucleotides and only a few SNPs. Our for the first time proposed DCST approach ignores the non-variable nucleotides by a scoring system and provides a format for the arrangement of SNP pattern for the identification of different fish species. This way we provide a decision-based COI SNP tagging (DCST) approach where the COI nucleotide sequence (∼650 bp) is effectively reduced to a shorter COI-SNP barcode (49 bp) for the most informative discrimination of 126 scombriform fish species.

AUTHOR CONTRIBUTIONS
L-YC and H-WC conceived and designed the research and wrote the paper. C-HY instructed K-CW for algorithm processing. K-CW also contributed to sequence retrieval. C-HY and H-WC revised the paper. All authors read and approved the final manuscript.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene. 2019.00259/full#supplementary-material Figure S1 | Sequential searching for SNP is designed to subgroup the COI sequences at each level. In this case (case I), sequential searching is designed to find the diallelic type of SNP at each homologous position and perform subgrouping based on alternative nucleotides at this SNP. However, this case does not consider the nucleotide distribution compared to our proposed DCST method. For example, we found the nucleotide at the first position (nt 1) was a SNP and these sequences were separated into two subgroups based on this SNP (T/C) at 1-level, i.e., S 1 , S 2 , S 4 (T) are allocated to the top side and S 3 , S 5 , S 6 , S 7 (C) are allocated to the bottom side. In the top side of 2-level, the second nucleotide (nt 2) is not a SNP and is skipped. Then, the third nucleotide (nt 3) is a SNP and these sequences were separated into two subgroups based on this SNP (C/T) at 2-level, i.e., S 2 (C) are allocated to the top side and S 1 and S 4 (C) are allocated to the bottom side. Subgrouping for the other levels follows the same rule as mentioned above. Figure S2 | Unique searching for SNP is designed to subgroup the COI sequences at each level. In this case (case II), unique searching is designed to find the SNP with only unique nucleotide for one unique subgroup and the other sequences are processed for next unique searching. For example, the first nucleotide (nt 1) does not show one unique nucleotide, i.e., 3 T and 5 C. Subsequently, the unique searching goes to the second nucleotide. We found the second position (nt 2) of S3 (T) is unique compared to others (C) at the 1-level, i.e., S3 (T) is allocated to the top side and others (C) are allocated to the bottom side. Subgrouping for the other levels follows the same rule as mentioned above.