Unbiasing Genome-Based Analyses of Selection: An Example Using Iconic Shark Species

Whilemostly confined tomarine environments, species in the taxon Chondrichthyes (cartilaginous fishes) exhibit a wide variety of body size, longevity, and habitat depth. Despite their diversity, genome analysis has concentrated on species with high public visibility (Read et al., 2017; Marra et al., 2019). Previously, a study reporting whole genome sequencing of the white shark, Carcharodon carcharias (Marra et al., 2019) suggested unique molecular evolution accounted for the maintenance of genome stability and the elongated lifespan of the whale shark, Rhincodon typus as well as the white shark. The results reported by Marra et al. (2019) included positive selection of dozens of protein-coding genes potentially involved in genome stability and wound healing. We performed a reanalysis of some of the data presented in Marra et al. (2019) using the genome resources of other shark species and report our opinion based the results of the reanalysis.


INTRODUCTION
While mostly confined to marine environments, species in the taxon Chondrichthyes (cartilaginous fishes) exhibit a wide variety of body size, longevity, and habitat depth. Despite their diversity, genome analysis has concentrated on species with high public visibility (Read et al., 2017;Marra et al., 2019). Previously, a study reporting whole genome sequencing of the white shark, Carcharodon carcharias (Marra et al., 2019) suggested unique molecular evolution accounted for the maintenance of genome stability and the elongated lifespan of the whale shark, Rhincodon typus as well as the white shark. The results reported by Marra et al. (2019) included positive selection of dozens of protein-coding genes potentially involved in genome stability and wound healing. We performed a reanalysis of some of the data presented in Marra et al. (2019) using the genome resources of other shark species and report our opinion based the results of the reanalysis. Marra et al. (2019) reported positive selection in the gene encoding Mdm4 in the whale shark lineage. Mdm4 is a key regulator of the tumor suppressor gene, p53 (Toledo and Wahl, 2007). In our view, the whale shark Mdm4 ortholog sequence used in their analysis (XP_020377040.1 in NCBI; presented in Figure 3 by Marra et al., 2019) seems to harbor an incorrect open reading frame (ORF), when it is curated by transcript sequencing (Figure 1A). The sequence used by Marra et al. (2019) shows a remarkable dissimilarity to their orthologs, as well as the curated sequence of this whale shark gene ( Figure 1A).

REANALYSIS WITH SEQUENCE CURATION
Because one of the main findings by Marra et al. (2019) is challenged by our finding of this simple ORF misidentification, we investigated the other genes that were judged to be positively selected in their study. Here, we focused on Denticleless E3 ubiquitin protein ligase homolog (Dtl), Coenzyme Q3, methyltransferase (Coq3), and Sirtuin 7 (Sirt7) genes. We could not find orthologous coding sequences of these genes in the white shark protein-coding sequences supplied by Marra et al. (2019) but were identified in the genomic sequences, guided by the record of their gene prediction in the general feature format (GFF) file described the exact coordinates and attributes of genes and transcripts. Specifically, of these genes, the coding sequences of Coq3 and Sirt7 seem to be erroneously predicted and possibly used without curation (Supplementary Data, https://doi.org/ 10.6084/m9.figshare.13521329), as shown above for the whale shark Mdm4 sequence.
Dataset S1 in the publication by Marra et al. (2019) frequently exhibits inflated values (e.g., 999) for the ratio of non-synonymous to synonymous substitutions (ω). The inflation may be caused by ORF misidentification as shown above for Mdm4 but could also be due to the inclusion of phylogenetically distant sequences (e.g., paralogs) or species (e.g., teleost fishes that diverged FIGURE 1 | Reanalysis of shark molecular sequences. (A) Multiple alignment of Mdm4 for the amino acid sequence stretch corresponding to the residue 241 to 325 of human Mdm4 (NP_002384.2 in NCBI). The residues in white letters indicate positively selected residues in the whale shark sequence identified previously (Marra et al., 2019), but most of them are neither unique to the whale shark nor included in the curated whale shark sequence. The GenBank accession ID for the curated whale shark sequence (top) supported by transcript sequencing is MN317558.1. Details of the transcript sequencing will be reported elsewhere. (B) Substitution saturation plots for transition and transversion of the coding region of the Fgg, Mdm4, Chek2, and Dtl genes chosen from those previously regarded as positively selected (Marra et al., 2019). Each dot indicates a pair of species in the dataset. The white and black dots indicate transitions and transversions, respectively. The horizontal axis indicates the distance based on the TN93 substitution model (Tamura and Nei, 1993). The vertical axis indicates the observed proportion of transition and transversion. The amino acid sequences were aligned with MAFFT v7.299b (Katoh and Standley, 2013) using the L-INS-i option. Nucleotide sequences were aligned based on the amino acid sequence alignment using the emboss tranalign tool. Unreliably aligned regions were removed using Gblocks v0.91b (Talavera and Castresana, 2007) based on the default parameters. Transversion and transition frequencies were calculated with the TN93 substitution model using the DAMBE program (Xia, 2018). The details of the sequences used for the analysis are included in Supplementary Data (https://doi.org/10.6084/m9.figshare.13521329).

(Continued)
Frontiers in Marine Science | www.frontiersin.org FIGURE 1 | (C) Phylogenetic overview of maximum recorded lifespan, genome size, and repeat content in a uniform presentation. The branch lengths are proportional to the geological times based on information retrieved from the Timetree of Life website (http://www.timetree.org/). The lifespan of the coelacanth was suggested to be over 100 years in other literature (Froese and Palomares, 2000). Repetitiveness in the genomes was quantified uniformly as previously described (Hara et al., 2018). The genome sizes are based on flow cytometry except for the human for which a total length of the genome sequences (3.2 Gbp) is conventionally referred to as its genome size (*). Because the documented lifespan of the brownbanded bamboo shark, Chiloscyllium punctatum is unavailable, that of a different species in the same genus (C. plagiosum) is included. The numbers in parentheses correspond to the following publications: (1) Allard et al. (1998); (2)  from chondrichthyan species more than 400 million years ago; Irisarri et al., 2017). The three chondrichthyan species included by Marra et al. (2019) diverged more than 150 million years ago (Irisarri et al., 2017), leaving long branches connecting taxa that would ideally be broken by including more closely related shark species with the genome sequences made available earlier (such as the brownbanded bamboo shark and the cloudy catshark; Hara et al., 2018).

INVESTIGATION OF SUBSTITUTION SATURATION
To examine the appropriateness of interspecific comparisons in the previous study (Marra et al., 2019), we estimated the frequencies of transversion and transition in the Mdm4, Dtl, Checkpoint kinase 2 (Chek2), and Fibrinogen gamma chain (Fgg) genes shown as positively selected by Marra et al. (2019). For this analysis, we employed the sequence sets used in the previous analysis (Marra et al., 2019). In particular, the Fgg gene associated with wound healing was reported by Marra et al. (2019) as a positively selected gene in both the white shark and whale shark lineages. The inclusion of distantly related species likely resulted in inflated ω values, indicated by the fact that the number of transversions in the Fgg gene exceeds the number of transitions (Figure 1B). Similarly, the other genes, Mdm4, Chek2, and Dtl, exhibited a larger number of transversions than transitions together with increasing evolutionary distance ( Figure 1B).
To investigate possible substitution saturation, we further analyzed the Fgg, Mdm4, Chek2, and Dtl genes. Indices of substitution saturation (Iss), introduced by Xia (2009), were computed for the first, second, and third codon positions of ortholog sequences using the DAMBE program (Xia, 2018). As a result, Iss exceeded the index of substitution saturation for asymmetric topologies (Iss.cAsym) at the third codon positions of these genes (for Fgg, Iss = 0.6858 and Iss.cAsym

DISCUSSION
Ortholog sequences of multiple closely related species and accurate alignments are essential for the proper detection of positively selected genes. It was previously cautioned that imprecise alignments containing erroneous ORFs likely induce ω value inflation in the detection of positively selected genes (Jordan and Goldman, 2012). Our reanalysis illustrates the use of mis-predicted ORF sequences and excessively distant sequences by Marra et al. (2019). The saturation of substitutions at the third codon positions, caused by the use of distant sequences, increases the possibility of falsely detecting positive selection (Weber et al., 2014). Overall, for the genes we examined, at least, no evidence of positive selection was obtained.
In Figure 1C, we reconstructed the schematic summary presented as Figure 1 originally by Marra et al. (2019). Therein, we included genome sizes and repeat element abundance presented in a uniform style. Marra et al. (2019), argued for the association of larger genome size and higher repeat abundance of the white shark and whale shark with their high genome stability. However, the comparison in Figure 1C, involving other shark species, shows that some shark species with relatively small body sizes and short lifespans have comparable or even larger genome sizes than the white shark and the whale shark (Hara et al., 2018). The current dataset does not support any association of genome stability with genome size or repeat abundance, which remains to be further explored when whole genomes of more species are sequenced.
In our opinion, the findings reported by Marra et al. (2019) need to be reassessed without any bias that genome analysis of those long-lived shark species will readily account for their capacities of genome stability maintenance and wound healing. This reassessment could be facilitated by the inclusion of emerging sequence information for other cartilaginous fishes (reviewed in Yamaguchi et al., 2021).

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/supplementary material.

AUTHOR CONTRIBUTIONS
KY performed data analysis. KY and SK wrote the manuscript. All authors contributed to the article and approved the submitted version.