Decision Theory-Based COI-SNP Tagging Approach for 126 Scombriformes Species Tagging

Yang, Cheng-Hong; Wu, Kuo-Chuan; Chuang, Li-Yeh; Chang, Hsueh-Wei

doi:10.3389/fgene.2019.00259

ORIGINAL RESEARCH article

Front. Genet., 03 April 2019

Sec. Computational Genomics

Volume 10 - 2019 | https://doi.org/10.3389/fgene.2019.00259

Decision Theory-Based COI-SNP Tagging Approach for 126 Scombriformes Species Tagging

Cheng-Hong Yang^1,2

Kuo-Chuan Wu^1,3

Li-Yeh Chuang⁴^*

Hsueh-Wei Chang^5,6,7^*

¹Department of Electronic Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
²Biomedical Engineering, Kaohsiung Medical University, Kaohsiung, Taiwan
³Department of Computer Science and Information Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
⁴Department of Chemical Engineering and Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan
⁵Institute of Medical Science and Technology, National Sun Yat-sen University, Kaohsiung, Taiwan
⁶Department of Medical Research, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan
⁷Department of Biomedical Science and Environmental Biology, Kaohsiung Medical University, Kaohsiung, Taiwan

The mitochondrial gene cytochrome c oxidase I (COI) is commonly used for DNA barcoding in animals. However, most of the COI barcode nucleotides are conserved and sequences longer than about 650 base pairs increase the computational burden for species identification. To solve this problem, we propose a decision theory-based COI SNP tagging (DCST) approach that focuses on the discrimination of species using single nucleotide polymorphisms (SNPs) as the variable nucleotides of the sequences of a group of species. Using the example of 126 teleost mackerel fish species (order: Scombriformes), we identified 281 SNPs by alignment and trimming of their COI sequences. After decision rule making, 49 SNPs in 126 fish species were determined using the scoring system of the DCST approach. These COI-SNP barcodes were finally transformed into one-dimensional barcode images. Our proposed DCST approach simplifies the computational complexity and identifies the most effective and fewest SNPs to resolve or discriminate species for species tagging.

Introduction

The original concept of DNA barcoding was proposed to identify and discriminate a given species by a unique DNA sequence (Hebert et al., 2003). Such a DNA sequence aims at tagging species like a barcode. It is designed to identify a species from known DNA barcode sequences in a database. The commonly used DNA barcode of animal species is the mitochondrial gene cytochrome c oxidase I (COI) with a length of about 650 base pairs (bps). Meanwhile, COI sequences are also used for evolutionary and ecological studies (Hebert et al., 2003; DasGupta et al., 2005; Meier et al., 2006; Austerlitz et al., 2009; Kress et al., 2015; Park et al., 2018).

However, most nucleotides of the COI gene are conserved among different species except a minor proportion representing single nucleotide polymorphisms (SNPs). Several disease studies have used specific SNP to predict the predisposition for disease and the effects of therapeutic approaches. This concept has rarely been used for tagging species or improving the information content of DNA barcode sequences. The major benefit of using SNPs is the reduction of computational burden by removing the more abundant, non-informative, identical homologous nucleotides.

As an example, the tagging of fish species is not optimized as yet with respect to informative DNA barcoding. Some fish species have very similar morphology and it is difficult to distinguish those similar species, especially for marketing, conservation, and forensic purposes. Seafood mislabeling or fraud is a common societal and legal problem in fish trading (Sarmiento-Camacho and Valdez-Moreno, 2018) and the seafood economy (Vandamme et al., 2016; Willette et al., 2017). Currently, DNA barcoding is a reliable system for species identification and authentication and it is necessary to apply barcoding to many fish species (Liu et al., 2013; Vandamme et al., 2016; Willette et al., 2017; Sarmiento-Camacho and Valdez-Moreno, 2018). However, the COI sequences (~650 bp) are largely uninformative and too long for an optimized application for the above purposes.

In the present study, we follow the original concept of DNA barcoding to develop a decision theory-based COI SNP tagging (DCST) approach where only the variable nucleotides (SNPs) of a given COI barcode sequence is applied for the tagging of fish species. The Fish Barcode of Life Initiative (FISH-BOL) (Ward et al., 2009) provides a public database for DNA barcode sequences with images, and geospatial information for almost 10,000 fish species (Becker et al., 2011).

We use the idea of decision theory (Quinlan, 1986; Berger, 2013; Fernandez Slezak et al., 2018) to determine which sites (nucleotides) of DNA sequences are selected to discriminate between species. These are used to generate the unique DNA tags for classification. Using the DCST approach, SNPs are extracted from COI sequences to generate a SNP-based COI pattern. Finally, the SNP-COI pattern is transformed into a one-dimensional sequence barcode.

The major aim of our proposed DCST approach is to provide an effective identification tool by generating an SNP-COI barcode. Here we apply this to the example of 126 scombriform fishes.

Materials and Methods

Sampling and Data Pre-processing

We retrieved the COI sequences from 126 species of the bony fish (Teleostei) order Scombriformes that include representatives of the following families: Ariommatidae, Arripidae, Bramidae, Caristiidae, Centrolophidae, Chiasmodontidae, Gempylidae, Icosteidae, Nomeidae, Pomatomidae, Scombrinae, Scombrolabracidae, Scombropidae, Stromateidae, Tetragonuridae, and Trichiuridae. The sequence data, ranging from 648 to 685 base pairs (bp) in lengths, were obtained from GenBank. Details of the family name, species name, sequence length, and accession number are shown in Table 1. COI sequences (n = 126) from these scombriform fishes were aligned using the ClustalW tool in MEGA 7 software (Kumar et al., 2016). Subsequently, the 5′ and 3′ protruding sequences were trimmed to gain the same length of COI sequences.

TABLE 1

Table 1. 126 COI sequences of the fish order Scombriformes from GenBank.

Decision-Based COI SNP Tagging (DCST)

Decision theory (Berger, 2013) improves a decision-maker's choice among a set of alternatives that need to be considered. Most of decision theory is normative, prescriptive and descriptive that provides a decision that is completely rational, has perfect accuracy and easy understanding. Possible alternatives and outcomes are considered as follows: Step (1) clearly define the given problem, step (2) organize all the possible alternatives, step (3) be aware of all possible outcomes, step (4) consider the benefits of each alternative and outcome, step (5) create a mathematical decision theory rule model, and step (6) make a decision by evaluating the models.

Based on such understood decision making, we propose here an approach for DNA barcoding that generates shorter DNA barcodes. We here call a decision theory-based COI-SNP tagging (DCST) approach. Given an N×M matrix of sequence data, S is described as:

\begin{array}{l} S = [\begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} s_{1, 1} \\ s_{2, 1} \end{matrix} \\ \begin{matrix} ⋮ \\ s_{N, 1} \end{matrix} \end{matrix} \begin{matrix} \begin{matrix} s_{1, 2} \\ s_{2, 2} \end{matrix} \\ \begin{matrix} ⋮ \\ s_{N, 2} \end{matrix} \end{matrix} \begin{matrix} \begin{matrix} s_{1, 3} \\ s_{2, 3} \end{matrix} \\ \begin{matrix} ⋮ \\ s_{N, 3} \end{matrix} \end{matrix} \end{matrix} \begin{matrix} \begin{matrix} \begin{matrix} \dots \\ \dots \end{matrix} \\ \begin{matrix} ⋱ \\ \dots \end{matrix} \end{matrix} \begin{matrix} \begin{matrix} s_{1, M} \\ s_{2, M} \end{matrix} \\ \begin{matrix} ⋮ \\ s_{N, M} \end{matrix} \end{matrix} \end{matrix} \end{matrix}] & (1) \end{array}

where N is the number of sequences from each species and M is the nucleotide length. There are four nucleotide types A, T, G, and C in the matrix S. Then the nucleotide frequency of distribution F is obtained in each position pε [1, M]. The frequency distribution matrix F is represented by:

\begin{array}{l} F = [\begin{matrix} f_{A 1} f_{A 2} f_{A 3} & \dots & f_{A M} \\ f_{C 1} f_{C 2} f_{C 3} & \dots & f_{C M} \\ f_{G 1} f_{G 2} f_{G 3} & \dots & f_{G M} \\ f_{T 1} f_{T 2} f_{T 3} & \dots & f_{T M} \end{matrix}] & (2) \end{array}

where each frequency is calculated as follows:

\begin{array}{l} f_{i p, i \in {A, C, G, T}} = \sum_{k}^{N} (x_{k, p} | i) & (3) \end{array}

The decision rules are created to distinguish species and divide them with each step into two subgroups based on the score of each position of sequences. The calculation of score in each position is represented by:

\begin{array}{l} SCORE = [\begin{matrix} s c o r e_{1} & s c o r e_{2} & {score}_{3} & \dots & s c o r e_{M} \end{matrix}] & (4) \end{array}

where the estimated value at the position p, namely score_p is calculated as:

\begin{array}{l} s c o r e_{p} = \frac{m i d_{p} - d i f f_{p}}{m i d_{p}} + w e i g h t_{p} & (5) \end{array}

where mid_p indicates the middle integer, i.e., the integer value of half of the number of sequence data (species number) in each subgroup,

\begin{array}{l} m i d_{p} = ⌊ \frac{number of data set in node}{2} ⌋ & (6) \end{array}

and diff_p is a parameter which balances the data for generating approximately equally sized subgroups. Therefore, biallelic loci with almost equal frequency for each allele get the highest scores and are selected to divide the data into 2 subgroups. The mid_p value is used to distribute all sequence data into two subgroups. For the equation for diffp (formula 7), our proposed methodology selects the first appearing SNP starting from the lowest to the highest order of nucleotide position although SNPs at different positions may have the same score. For example, there are four sequences in a given subgroup and the best case is that two data are assigned into the left subgroup and others are assigned to right subgroup. Accordingly, diff_p is calculated as (min denotes the minimum value):

\begin{array}{l} d i f f_{p} = min_{i \in {A, C, G, T}} {| m i d_{p} - f_{i p} |} & (7) \end{array}

Moreover, two different nucleotide types make it easier to sort the sequences into two subgroups for tree construction. Three or four nucleotide types are complex and require more tree lineages. Accordingly, the logic of the weighting system (formula 8) of the DCST method emphasizes the two nucleotide types and assigns the highest score among them. Non-polymorphic loci are not considered in this method, and hence they are given a score of 0. The weight_p is defined by:

\begin{array}{l} w e i g h t_{p} = {\begin{matrix} \begin{matrix} 0, if the number of identified nucleotide type is 1 \\ 1, if the number of identified nucleotide types is 2 \end{matrix} \\ \begin{matrix} 0.66, if the number of identified nucleotide types is 3 \\ 0.33, if the number of identified nucleotide types is 4 \end{matrix} \end{matrix} & (8) \end{array}

The species can be separated into two subgroups according to the score estimation for each score_p. The remaining subgroups at different levels are separated in the same way, and all the species are assigned a unique tag. The above step generates a pseudocode (Figure 1).

FIGURE 1

Figure 1. Pseudocode of the DCST approach.

The flowchart of the DCST approach is shown in Figure 2. For example, the “data” contain 8 sequences (species) with the length for 13 nucleotides. The frequency distribution F is counted from “data” (see formula 2 and 3) and the SCORE (score_p) are calculated (see formula 4~8). The positions p₁ and p₈ at the first group has 8 sequences (species), therefore, the mid₁ and mid₈ are $⌊ \frac{8}{2} ⌋ = 4$ (formula 6) and the diff ₁ and diff ₈ are calculated as follows (formula 7):

\begin{array}{l} d i f f_{1} = \min {\begin{matrix} \begin{matrix} f_{A 1} = | 4 - 0 | = 4 \\ f_{C 1} = | 4 - 4 | = 0 \end{matrix} \\ \begin{matrix} f_{G 1} = | 4 - 0 | = 4 \\ f_{T 1} = | 4 - 4 | = 0 \end{matrix} \end{matrix} = 0 \end{array}

and

\begin{array}{l} d i f f_{8} = \min {\begin{matrix} \begin{matrix} f_{A 8} = | 4 - 6 | = 2 \\ f_{C 8} = | 4 - 1 | = 3 \end{matrix} \\ \begin{matrix} f_{G 8} = | 4 - 1 | = 3 \\ f_{T 8} = | 4 - 0 | = 4 \end{matrix} \end{matrix} = 2 \end{array}

where there are two types in p₁ (C and T) and three types in p₈, (A, C, and G) hence weight₁ is 1 and weight₈ is 0.66 (formula 8). The scores are calculated as follows (formula 5):

\begin{array}{l} s c o r e_{1} = \frac{4 - 0}{4} + 1 = 2.0 \end{array}

and

\begin{array}{l} s c o r e_{8} = \frac{4 - 2}{4} + 0.66 ≅ 1.2 \end{array}

FIGURE 2

Figure 2. Flowchart of the DCST approach. This is an example to show how DCST approach operates. S1–S8 indicates eight sequences from eight species. In each level, the sequences are subgrouped according to the score system of DCSF approach, i.e., the nucleotides with the highest score are divided into two parts. Sometimes, the nucleotides at the same position may be chosen several times depending on the score performance.

This way we can get all scores of positions p₁~p₈, shown in Figure 2, and the maximum score in position p₁ is calculated in the first group. All sequences are divided into subgroups with “up” and “down” sides as branches related to nucleotides (e.g., C and T). Then, the sub-group follows the same procedure as mentioned above until the end (i.e., 7th group). This way the positions p₁, p₂, p₃, p₄, and p₇ are found. In this example, the positions, p₃ and p₄, are chosen twice, i.e., 2nd group/6th group and 3rd group/5th group. Therefore, much shorter informative barcode sequences become available using DCST.

Unique tags are generated when each species gets separated. Here, we use the code 128 (standard) of one dimensional barcodes to display each tag which is generated from a one dimension barcode image creator package called python-barcode 0.8.1. The standard code 128 in a one dimension barcode is an alphanumerical or numerical-only tool to generate barcode images.

Results

Retrieval of COI Sequences

In this study, we retrieved 126 COI sequences of the fish order Scombriformes from GenBank. The 126 original COI sequences are shown in Figure 3 (the full original data set is available at http://shorturl.at/ayEJ2).

FIGURE 3

Figure 3. Original COI sequences (n = 126) of the fish order Scombriformes (Teleostei). This is an example of a group of species and sequences that shows 1st to 10th, 117th to 126th species and 1st to 50th, 640th to 668th position, respectively. The full original sequences for all species are available from http://shorturl.at/tBMVW.

Alignment of COI Sequences

After performing multiple sequence alignments using the clustalW method in MEGA 7 software (Kumar et al., 2016), the resulting 126 aligned COI sequences are shown in Figure 4 (the full aligned data set is available at http://shorturl.at/tBMVW).

FIGURE 4

Figure 4. 126 aligned COI sequences of the fish order Scombriformes (Teleostei). This is an example of a group of species and sequences that shows 1st to 10th, 117th to 126th species and 1st to 50th, 640th to 668th position, respectively. The full original sequences for all species are available from http://140.127.112.213/DNA_barcode/download/Scombriformes_COI_aligned.tar.

Trimming of COI Sequences

The position 1 to 35 and 673 to 696 of 126 aligned COI sequences are trimmed (i.e., protruding the 5′ and 3′ ends of sequence) that is shown as Figure 5 (the fully trimmed data set is available at http://shorturl.at/tTU04). Counting from the trimmed sequences, 281 SNPs were identified.

FIGURE 5

Figure 5. Trimmed COI sequences (n = 126) of the fish order Scombriformes (Teleostei). This is an ellipsis of part of species and sequences that shows 1st to 10th, 117th to 126th species and 1st to 50th, 580th to 636th position, respectively. The reference sequence listed at the top one of figure is derived from the accession number KT883659.1 for A. bondi. The 1st position of A. bondi at the top of this figure is the 8th position of KT883659.1 for A. bondi. The full original sequences for all species are available from http://140.127.112.213/DNA_barcode/download/Scombriformes_COI_trimmed.tar.

Decision Process of COI Sequences

The decision process was created according the decision rule, and each unique tag was generated from each selected position (shown in Figure 6). Figure 6 shows ith position of nucleotides in each node, and all tags were collected and arranged from each node. Consequently, the original data of COI sequences with 636 bp length were curtailed into specific COI-SNP of only 49 bp length. Accordingly, our proposed DCST approach can effectively obtain shorter tags from COI sequences.

FIGURE 6

Figure 6. Tree-like structure outcome. This figure shows the selected position number and information of nucleotides for tagging SNP in 126 scombriform fishes. On the left side, the number of position within parentheses refers to the position of the reference sequence (Ariomma bondi; KT883659.1). For example, CT(243) indicates the nucleotide at the 243th position being selected as a node to separate two subgroups. It also shows the shorter tags from DNA COI sequences for each species on the right side. On the right side, the 1st nucleotide of the driftfish A. bondi has the 8th position in the original sequence KT883659.1 of A. bondi.

Species-Tag Barcode Generation of COI Sequences

One-dimensional barcodes were generated from these unique tags (shown as Figure 7, the full tags of one dimensional barcodes for 126 scombriform species are available at http://shorturl.at/szJL1). These one-dimension barcode images of tags allow information retrieval with a barcode scanner for technical and scientific applications.

FIGURE 7

Figure 7. DNA tag barcode of B. dussumieri. As an example, a DNA tag barcode is generated for the purpose of fast and precise identification in the teleost goby Boleophthalmus dussumieri.

Discussion

The original concept of “DNA barcoding” was thought to identify and discriminate between species by different genetic tags or markers. After a longer search for a most informative gene sequence, the mitochondrial COI gene was found to be most informative in animals at the species level. Besides for taxonomic identification purposes, it is commonly used recently in evolutionary and ecological studies (Hebert et al., 2003; DasGupta et al., 2005; Meier et al., 2006; Austerlitz et al., 2009; Kress et al., 2015).

Several applications of machine learning were developed in DNA barcoding taxonomy. For example, the BPSI2.0 interface program (Zhang and Savolainen, 2009) was developed by Zhang and collaborators which is based on back-propagation neural network for species identification. Weitschek et al. (2013) proposed a machine learning approach for species classification, called BLOG 2.0 (Barcoding with LOGic) which is based on character-based DNA barcode sequences. The supervised machine learning methods were later applied to DNA barcodes for species classification (Weitschek et al., 2014). They collected eight datasets of DNA barcode sequences and used four classifiers for classification analysis. The above approaches have in common, that the classification model builds up through a training data set, then it verifies testing data to assess the model performance.

However, our proposed DCST is different from the classification model “(Zhang and Savolainen, 2009; Weitschek et al., 2013, 2014) for which a for a large training data set of sequences is necessary to validate the model before it can be applied to the test data.” DCST arranges a short DNA barcode into a shorter DNA tag, which comes closer to the barcoding idea originally developed by Hebert et al. (2003). We propose here a DCST approach that generates an evolutionary COI-based identification system that provides even shorter sequences for the species tagging.

As for the decision rule of DCST, we will discuss two extreme cases caused by different designs. In case one, we search each position sequentially when a different nucleotide in p^th position is met the first time. This case shows a disordered outcome and indefinite rule leading to uncertainty or imbalance in the number of sequences in the branches of the trees (Figure S1). In case two, we search one of the nucleotides of maximum divergence in each position, its result shows a skewed outcome leading to imbalance tree (Figure S2). Although those two cases can generate unique DNA tags, they cannot segregate the sequence data for generating approximately equally sized subgroups. In contrast, the advantage of the balanced tree in algorithms and data structures area is the simple way to increase efficiency than other types of imbalance trees (Fleischer, 1996). In the present study, we used a balanced tree-based simple decision theory to arrange the species by COI barcoding systematically. Accordingly, the balanced tree algorithm DCST is theoretically more effective than the imbalanced tree methods (Figures S1, S2). Like the decision tree, the computational complexity time of DCST is O(N × M × D), where N is number of samples, M is the length of nucleotides, and D is the depth of tree (number of levels). Using 49 SNPs, the computational time for DCST to generate specific SNP species tags is 0.14693 ± 0.0016 s (mean ± SD; n = 30 runs) executed on an Intel Core i7-8750H 2.20GHz personal computer with 16 GB RAM. The length of sequences range from 648 bp to 685 bp which have approximately 4⁶⁵⁰ possible ATGC-combinations that would allow over 10 million species with unique DNA tags. Our proposed DCST method can, therefore, efficiently obtain shorter DNA barcode for species tagging. The obtained DNA tags can reduce data storage significantly compared to the full length COI sequence.

It is possible that multiple positions for diff_p (formula 7) may have the same score. For example, if there are 3 C, 3 T, and 2 A nucleotides in a node, the score is 1 or 2 where 3 C, 3 T, and 2 A = 8, i.e., diff_p = min for mid_C–f_Cp = | $⌊ \frac{8}{2} ⌋ - 3$ | = 1, mid_T–f_Tp = | $⌊ \frac{8}{2} ⌋ - 3$ | = 1, and mid_A–f_Ap = | $⌊ \frac{8}{2} ⌋ - 3$ | = 2. In this case, both C and T have the same score for selection and may be the candidates used for SNP barcoding. Both of them are theoretically suitable for the subsequent step of our proposed DCST method although different SNP barcode patterns may be generated. For convenience, the SNP is selected starting from the lowest to highest order of nucleotide position in the DCST method. Once the SNP is selected, then the procedure stops and goes to the next subgrouping process.

A limitation of the DCST approach for tagging species is that it is only used to discriminate the known species with known barcode sequences. However, DCST can still be applied to any other barcode sequence such as nuclear ribosomal internal transcribed spacer (ITS) (Seifert, 2009; Schoch et al., 2012) for fungi and ribulose-1,5-bisphosphate carboxylase/oxygenase (rubisco) and maturase K (matK) (Dong et al., 2014) for plants. Moreover, the DCST approach can be applied to the sequence data retrieved by Next Generation Sequencing (NGS). NGS offers high-throughput nucleotide sequencing for DNA/RNA molecules (Metzker, 2010). Recently, NGS has been applied to metagenomics (Roumpeka et al., 2017). NGS-profiling metagenomics may identify all species existing in a given environment. Using our proposed DCST approach, species-specific sequences may be processed to generate species-specific SNP barcodes for tagging species in metagenomics. Suitable SNPs from different positions are selected for species tagging in our proposed DCST system. However, the DCST system does not consider the distances between the selected SNPs. Therefore, the DCST system fails to calculate the evolutionary distance and is unsuitable for phylogenetic analysis. The tree generated in Figure 6 was just to demonstrate that the species in the collected data set have very close relationships with very similar sequences.

The practical application of this DCST system in a laboratory situation is to provide a platform for SNP arrays which allows fast and specific SNP genotyping. Here, SNPs belonging to COI-SNP based species-tags can be genotyped individually and simultaneously. These allow species identification by comparison with DCST-generated COI-SNP based species-tags. For example, Arrayed Primer Extension (APEX) is an array-based detection and can analyze thousands of SNPs in candidate region (Pullat and Metspalu, 2008). After processing to array scanner, the SNP pattern is generated and the species may be recognized immediately by checking the species-specific SNP pattern. In contrast, single gene PCR followed by sequencing needs a DNA sequencing machine and perform bioinformatics BLAST searching. Although both full sequence of a single locus and array assay of DCST-generated SNP can identify a species, DCST-generated SNP barcode is more suitable for species-tag barcode generation because few SNPs (~49 bp) are needed rather than full length of COI sequences (~650 bp). In other words, 49 SNPs only take 49 line codes but full length needs 650 line codes. Moreover, SNPs may spread out in different genes for the advanced species tagging in future. In this case, full length sequencing of different genes cannot be performed in the same reaction, however, array detection is allowed.

Conclusion

The COI sequence with full length provides commonly accepted information for phylogenetic and evolutionary studies. However, the full length sequence contains mostly non-variable nucleotides and only a few SNPs. Our for the first time proposed DCST approach ignores the non-variable nucleotides by a scoring system and provides a format for the arrangement of SNP pattern for the identification of different fish species. This way we provide a decision-based COI SNP tagging (DCST) approach where the COI nucleotide sequence (~650 bp) is effectively reduced to a shorter COI-SNP barcode (49 bp) for the most informative discrimination of 126 scombriform fish species.

Author Contributions

L-YC and H-WC conceived and designed the research and wrote the paper. C-HY instructed K-CW for algorithm processing. K-CW also contributed to sequence retrieval. C-HY and H-WC revised the paper. All authors read and approved the final manuscript.

Funding

This work was supported by funds of the Ministry of Science and Technology, Taiwan (MOST 106-2221-E-214-043, MOST 106-2221-E-151-009-MY2, MOST 105-2221-E-151-053-MY2, and MOST 107-2320-B-037-016).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2019.00259/full#supplementary-material

Figure S1. Sequential searching for SNP is designed to subgroup the COI sequences at each level. In this case (case I), sequential searching is designed to find the diallelic type of SNP at each homologous position and perform subgrouping based on alternative nucleotides at this SNP. However, this case does not consider the nucleotide distribution compared to our proposed DCST method. For example, we found the nucleotide at the first position (nt 1) was a SNP and these sequences were separated into two subgroups based on this SNP (T/C) at 1-level, i.e., S₁, S₂, S₄ (T) are allocated to the top side and S₃, S₅, S₆, S₇ (C) are allocated to the bottom side. In the top side of 2-level, the second nucleotide (nt 2) is not a SNP and is skipped. Then, the third nucleotide (nt 3) is a SNP and these sequences were separated into two subgroups based on this SNP (C/T) at 2-level, i.e., S₂ (C) are allocated to the top side and S₁ and S₄ (C) are allocated to the bottom side. Subgrouping for the other levels follows the same rule as mentioned above.

Figure S2. Unique searching for SNP is designed to subgroup the COI sequences at each level. In this case (case II), unique searching is designed to find the SNP with only unique nucleotide for one unique subgroup and the other sequences are processed for next unique searching. For example, the first nucleotide (nt 1) does not show one unique nucleotide, i.e., 3 T and 5 C. Subsequently, the unique searching goes to the second nucleotide. We found the second position (nt 2) of S3 (T) is unique compared to others (C) at the 1-level, i.e., S3 (T) is allocated to the top side and others (C) are allocated to the bottom side. Subgrouping for the other levels follows the same rule as mentioned above.

References

Austerlitz, F., David, O., Schaeffer, B., Bleakley, K., Olteanu, M., Leblois, R., et al. (2009). DNA barcode analysis: a comparison of phylogenetic and statistical classification methods. BMC Bioinform. 10(Suppl. 14):S10. doi: 10.1186/1471-2105-10-S14-S10

PubMed Abstract | CrossRef Full Text | Google Scholar

Becker, S., Hanner, R., and Steinke, D. (2011). Five years of FISH-BOL: brief status report. Mitochond. DNA 22(Suppl. 1), 3–9. doi: 10.3109/19401736.2010.535528

PubMed Abstract | CrossRef Full Text | Google Scholar

Berger, J. O. (2013). Statistical Decision Theory and Bayesian Analysis, 2nd Edn (Berlin; Heidelberg: Springer).

Google Scholar

DasGupta, B., Konwar, K. M., Mandoiu, I. I., and Shvartsman, A. A. (2005). DNA-BAR: distinguisher selection for DNA barcoding. Bioinformatics. 21, 3424–3426. doi: 10.1093/bioinformatics/bti547

PubMed Abstract | CrossRef Full Text | Google Scholar

Dong, W., Cheng, T., Li, C., Xu, C., Long, P., Chen, C., et al. (2014). Discriminating plants using the DNA barcode rbcLb: an appraisal based on a large data set. Mol. Ecol. Resour. 14, 336–343. doi: 10.1111/1755-0998.12185

PubMed Abstract | CrossRef Full Text | Google Scholar

Fernandez Slezak, D., Sigman, M., and Cecchi, G. A. (2018). An entropic barriers diffusion theory of decision-making in multiple alternative tasks. PLoS Comput. Biol. 14:e1005961. doi: 10.1371/journal.pcbi.1005961

PubMed Abstract | CrossRef Full Text | Google Scholar

Fleischer, R. (1996). A simple balanced search tree with O(1) worst-case update time. Int. J. Found Comput. Sci. 7, 137–149. doi: 10.1142/S0129054196000117

CrossRef Full Text | Google Scholar

Hebert, P. D., Cywinska, A., Ball, S. L., and deWaard, J. R. (2003). Biological identifications through DNA barcodes. Proc. Biol. Sci. 270, 313–321. doi: 10.1098/rspb.2002.2218

PubMed Abstract | CrossRef Full Text | Google Scholar

Kress, W. J., García-Robledo, C., Uriarte, M., and Erickson, D. L. (2015). DNA barcodes for ecology, evolution, and conservation. Trends Ecol. Evol. 30, 25–35. doi: 10.1016/j.tree.2014.10.008

PubMed Abstract | CrossRef Full Text | Google Scholar

Kumar, S., Stecher, G., and Tamura, K. (2016). MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Mol. Biol. Evol. 33, 1870–1874. doi: 10.1093/molbev/msw054

PubMed Abstract | CrossRef Full Text | Google Scholar

Liu, S. Y., Chan, C. L., Lin, O., Hu, C. S., and Chen, C. A. (2013). DNA barcoding of shark meats identify species composition and CITES-listed species from the markets in Taiwan. PLoS ONE. 8:e79373. doi: 10.1371/journal.pone.0079373

PubMed Abstract | CrossRef Full Text | Google Scholar

Meier, R., Shiyang, K., Vaidya, G., and Ng, P. K. (2006). DNA barcoding and taxonomy in Diptera: a tale of high intraspecific variability and low identification success. Syst. Biol. 55, 715–728. doi: 10.1080/10635150600969864

PubMed Abstract | CrossRef Full Text | Google Scholar

Metzker, M. L. (2010). Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46. doi: 10.1038/nrg2626

CrossRef Full Text | Google Scholar

Park, M. H., Jung, J. H., Jo, E., Park, K. M., Baek, Y. S., Kim, S. J., et al. (2018). Utility of mitochondrial CO1 sequences for species discrimination of Spirotrichea ciliates (Protozoa, Ciliophora). Mitochond. DNA A DNA Mapp. Seq. Anal. 30, 148–155. doi: 10.1080/24701394.2018.1464563

PubMed Abstract | CrossRef Full Text | Google Scholar

Pullat, J., and Metspalu, A. (2008). Arrayed primer extension reaction for genotyping on oligonucleotide microarray. Methods Mol. Biol. 444, 161–167. doi: 10.1007/978-1-59745-066-9_12

PubMed Abstract | CrossRef Full Text | Google Scholar

Quinlan, J. R. (1986). Induction of decision trees. Machine Learn. 1, 81–106. doi: 10.1007/BF00116251

CrossRef Full Text | Google Scholar

Roumpeka, D. D., Wallace, R. J., Escalettes, F., Fotheringham, I., and Watson, M. (2017). A review of bioinformatics tools for bio-prospecting from metagenomic sequence data. Front. Genet. 8:23. doi: 10.3389/fgene.2017.00023

PubMed Abstract | CrossRef Full Text | Google Scholar

Sarmiento-Camacho, S., and Valdez-Moreno, M. (2018). DNA barcode identification of commercial fish sold in Mexican markets. Genome. 61, 457–466. doi: 10.1139/gen-2017-0222

PubMed Abstract | CrossRef Full Text | Google Scholar

Schoch, C. L., Seifert, K. A., Huhndorf, S., Robert, V., Spouge, J. L., Levesque, C. A., et al. (2012). Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi. Proc. Natl. Acad. Sci. U.S.A. 109, 6241–6246. doi: 10.1073/pnas.1117018109

PubMed Abstract | CrossRef Full Text | Google Scholar

Seifert, K. A. (2009). Progress towards DNA barcoding of fungi. Mol. Ecol. Resour. 9(Suppl. 1), 83–89. doi: 10.1111/j.1755-0998.2009.02635.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Vandamme, S. G., Griffiths, A. M., Taylor, S. A., Di Muri, C., Hankard, E. A., Towne, J. A., et al. (2016). Sushi barcoding in the UK: another kettle of fish. PeerJ. 4:e1891. doi: 10.7717/peerj.1891

PubMed Abstract | CrossRef Full Text | Google Scholar

Ward, R. D., Hanner, R., and Hebert, P. D. (2009). The campaign to DNA barcode all fishes, FISH-BOL. J. Fish Biol. 74, 329–356. doi: 10.1111/j.1095-8649.2008.02080.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Weitschek, E., Fiscon, G., and Felici, G. (2014). Supervised DNA Barcodes species classification: analysis, comparisons and results. BioData Min. 7:4. doi: 10.1186/1756-0381-7-4

PubMed Abstract | CrossRef Full Text | Google Scholar

Weitschek, E., Van Velzen, R., Felici, G., and Bertolazzi, P. (2013). BLOG 2.0: a software system for character-based species classification with DNA Barcode sequences. What it does, how to use it. Mol. Ecol. Resour. 13, 1043–1046. doi: 10.1111/1755-0998.12073

PubMed Abstract | CrossRef Full Text | Google Scholar

Willette, D. A., Simmonds, S. E., Cheng, S. H., Esteves, S., Kane, T. L., Nuetzel, H., et al. (2017). Using DNA barcoding to track seafood mislabeling in Los Angeles restaurants. Conserv. Biol. 31, 1076–1085. doi: 10.1111/cobi.12888

PubMed Abstract | CrossRef Full Text | Google Scholar

Zhang, A. B., and Savolainen, P. (2009). BPSI2.0: a C/C++ interface program for species identification via DNA barcoding with a BP-neural network by calling the Matlab engine. Mol. Ecol. Resour. 9, 104–106. doi: 10.1111/j.1755-0998.2008.02372.x

PubMed Abstract | CrossRef Full Text | Google Scholar

Keywords: decision theory, DCST, single nucleotide polymorphism (SNP), barcoding, COI, teleost fish, species identification

Citation: Yang C-H, Wu K-C, Chuang L-Y and Chang H-W (2019) Decision Theory-Based COI-SNP Tagging Approach for 126 Scombriformes Species Tagging. Front. Genet. 10:259. doi: 10.3389/fgene.2019.00259

Received: 29 September 2018; Accepted: 08 March 2019;
Published: 03 April 2019.

Edited by:

Juan Caballero, Universidad Autónoma de Querétaro, Mexico

Reviewed by:

Mikhail P. Ponomarenko, Institute of Cytology and Genetics (RAS), Russia
Varalakshmi Vissa, Consultant, Fort Collins, CO, United States

Copyright © 2019 Yang, Wu, Chuang and Chang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Li-Yeh Chuang, Y2h1YW5nQGlzdS5lZHUudHc=
Hsueh-Wei Chang, Y2hhbmdod0BrbXUuZWR1LnR3

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.