Whole-Genome Sequencing of Sinocyclocheilus maitianheensis Reveals Phylogenetic Evolution and Immunological Variances in Various Sinocyclocheilus Fishes

An adult Sinocyclocheilus maitianheensis, a surface-dwelling golden-line barbel fish, was collected from Maitian river (Kunming City, Yunnan Province, China) for whole-genome sequencing, assembly, and annotation. We obtained a genome assembly of 1.7 Gb with a scaffold N50 of 1.4 Mb and a contig N50 of 24.7 kb. A total of 39,977 protein-coding genes were annotated. Based on a comparative phylogenetic analysis of five Sinocyclocheilus species and other five representative vertebrates with published genome sequences, we found that S. maitianheensis is close to Sinocyclocheilus anophthalmus (a cave-restricted species with similar locality). Moreover, the assembled genomes of S. maitianheensis and other four Sinocyclocheilus counterparts were used for a fourfold degenerative third-codon transversion (4dTv) analysis. The recent whole-genome duplication (WGD) event was therefore estimated to occur about 18.1 million years ago. Our results also revealed a decreased tendency of copy number in many important genes related to immunity and apoptosis in cave-restricted Sinocyclocheilus species. In summary, we report the first genome assembly of S. maitianheensis, which provides a valuable genetic resource for comparative studies on cavefish biology, species protection, and practical aquaculture of this potentially economical fish.


INTRODUCTION
Cave-restricted animals live in dark subterranean environments. They have evolved over time to adapt to the cave environments through various trait changes in morphology, behavior, and physiology (Jeffery, 2001). Cavefishes have degraded eyes, less body pigments, lower immune activities, and decrease in circadian rhythms in comparison to surface-dwelling fishes (Jeffery, 2009;Qiu et al., 2016;Yang et al., 2016;Krishnan and Rohner, 2017). As a compensation, the nonvisual sensory system of cavefishes is usually enhanced, such as development of extra taste buds and increased vibration attraction behavior (Yoshizawa et al., 2010;Yang et al., 2016). These facts of cavefishes are extremely interesting and worth to be explored with more investigations.
Previously, we have proved that cavefishes have fewer copies of major histocompatibility complex-related gene families (genes of cell surface proteins essential for acquired immune system) than surface-dwelling and semi-cave-dwelling counterparts, possibly suggesting relatively lower immune activities in cavefishes , which may be a specific strategy for cave adaptation. Cave-restricted Mexican tetra (Astyanax mexicanus) also shows a big increase in appetite, but its fatty liver did not affect this fish's health (Aspiras et al., 2015), implying that there may be some other immune-related molecular mechanisms for fighting inflammation in cavefishes (Xiong et al., 2019). A recent study (Peuß et al., 2020) proposed that organisms in various environments have developed differential immune strategies with innate immune degradation and T-cell overexpression in cavefishes. However, many of these putative molecular mechanisms are not fully understood.
S. maitianheensis lives originally in the surface of Maitian river in Kunming City, Yunnan Province, China (Supplementary Figure S1). Some Sinocyclocheilus fishes are also residents in caves. S. maitianheensis can therefore be used as a control for comparative studies on cave adaptation. Meanwhile, as a genus of state second-class protection in China, Sinocyclocheilus has been propagated with a series of artificial breeding for protection from extinction (Yin et al., 2021). Various Sinocyclocheilus species likely shared tetraploid origin and have 96 chromosomes that are twice of most teleosts (Heng et al., 2002). The genome diversity makes fishes in this genus as good models for studying cave adaptation and phylogenetic evolution. Although there are some reports on morphological and mitochondrial genomic evolution of various Sinocyclocheilus species (Wu et al., 2010;He et al., 2012;Chen Y.-Y. et al., 2018), whole genome-based comparative studies are rare, except for three representative Sinocyclocheilus fishes that we published before .
Here, we performed whole-genome sequencing, assembly, and annotation of S. maitianheensis and subsequently conducted comparative genomic analysis and immune-gene inquiry with four other Sinocyclocheilus counterparts (including surfacedwelling Sinocyclocheilus grahami, semi-cave-dwelling Sinocyclocheilus rhinocerous, and cave-restricted Sinocyclocheilus anophthalmus and Sinocyclocheilus anshuiensis; their genome sequences are publicly available).
Our main purpose is to provide a genetic resource for indepth studies on cave adaption and cavefish biology. Our study can also contribute to the species protection and exploitation of potentially economical value for S. maitianheensis.

MATERIALS AND METHODS
Sampling, Library Constructing, and Genome Sequencing An adult S. maitianheensis was collected from Maitian river in Kunming City, Yunnan Province, China, for genome and transcriptome sequencing. Genomic DNAs were extracted from muscle sample. Seven Illumina paired-end sequencing libraries (with insert sizes of 270 bp, 500 bp, 800 bp, 2 kb, 5 kb, 10 kb, and 20 kb, respectively) were constructed for a routine shotgun whole-genome sequencing in an Illumina HiSeq 2,500 platform (San Diego, CA, United States). SOAPfilter v2.2 (Li R. et al., 2009) (-z -p -g 1 -f -o clean -M 2 -f 0) was used to filter reads. Duplicate reads from polymerase chain reactions, those reads with 10 or more nonsequencing bases (Ns), adapter sequences, and bases with low quality were removed.
Total RNAs were extracted from muscle, skin, eye, liver, heart, and brain for construction of individual cDNA library. cDNAs were then sequenced in an Illumina HiSeq X platform and filtered by SOAPnuke v1.0 (Chen Y. et al., 2018) with optimized parameters [-l 10 (default: 5) -q 0.2 (default: 0.5) -n 0.05 -c 0 -Q 2 (default: 1)]. Reads with nonsequenced (N) base ratio of more than 5% or low-quality base (base quality ≤10) ratio of greater than 20% were discarded to generate a new set of higherquality reads for subsequent transcriptome-based annotation.

Genome Survey, de Novo Assembly, and Assessment
Genome size was estimated via the routine 17-mer frequency distribution analysis with the following formula: genome size K num /K depth , where k num is the number of k-mers obtained from reads, and K depth is the expected depth of k-mer at a maximum frequency (Song et al., 2016). Two Illumina short-insert libraries (500 and 800 bp) were used for this 17-mer analysis.

Repeat Annotation
Repeat sequence annotation is composed of three routine methods, including de novo annotation, homology prediction, and tandem repeat prediction. First, we used RepeatModeller v1.04 (Chen, 2004) and LTR-FINDER v1.0.6 (Xu and Wang, 2007) to construct a local de novo repeat reference, and then our assembled genome was aligned to this reference library by RepeatMasker v4.06 (Chen, 2004). In addition, RepeatMasker v4.06 (Chen, 2004) and RepeatProteinMask v4.06 (Chen, 2004) were applied for homology prediction after identification of transposable elements based on RepBase (Jurka et al., 2005). Moreover, Tandem Repeat Finder v4.09 (Benson, 1999) was separately used to predict comprehensive tandem repeats in our pipeline as previously reported Zhao et al., 2021). Finally, these results from the aforementioned three methods are integrated by our in-house perl scripts (https://github.com/liruihanguo/Repeats_integration). These scripts separately classified each type of repeat, integrated all repeats, and then removed those overlaps to obtain a nonredundant repeat set.

Phylogenetic and Divergence Time Analyses
Single-copy orthogroups were aligned using MUSCLE v3.8.31 (Edgar, 2004). Subsequently, conserved regions were obtained by Gblocks (Castresana, 2000), and the CDS regions of all singlecopy genes from each species were connected to form a supergene for extraction of the 4d sites. We also constructed a phylogenetic tree by using PhyML v3.0 with the maximum likelihood method (Guindon et al., 2010). MCMCtree in the PAML package (Yang and Rannala, 2006)  An all-to-all alignments of protein sequences from these seven genomes were applied by using BLASTp (McGinnis and Madden, 2004) with an e-value of 1e-5. Syntenic blocks between species were identified using i-ADHoRe 3.0 (Proost et al., 2012) with default parameters, and then homologous proteins were obtained. Subsequently, homologous pairs were aligned using MUSCLE (Edgar, 2004), after we retrieved these homologous protein sequences and converted them to nucleotide sequences. Lastly, we calculated and corrected the 4dTv values for each gene pair by using the HKY model in PAML package (Yang and Rannala, 2006).

Identification of Immune Genes and Pseudogenes in P38 and Mitochondrial Pathways
Fifteen apoptosis-related genes were identified in five Sinocyclocheilus genomes and other five representative vertebrate genomes with high quality to investigate the gene copy number in P38 and mitochondrial pathway. These other five vertebrates include common carp, zebrafish, Japanese medaka, Asian arowana, and Mexican tetra (A. mexicanus, GCF_000372,685.1 in NCBI). Each genome sequence was used to construct a standard aligned database in the first place. Protein sequences of tak1, tab1, ask1, fas, fasl, fadd, tnfa, cd40, cd40l, daxx, mkk4a, mkk4b, mkk6, bcl-2a, and bcl2l1 of zebrafish were downloaded from public uniport databases (Supplementary  Table S1) as the queries. These protein sequences were then aligned onto above 10 genomes by TBLASTn (e-value ≤1e-5). The alignments with aligned ratio of less than 0.5, sequences similarity of less than 50%, and redundant data were filtered out to obtain the final hit alignments. Subsequently, target apoptosisrelated genes were predicted by GeneWise v2.4.1 (Birney et al., 2004) from these 10 vertebrate genomes.
We performed a multiple-sequence alignment using the Muscle module in MEGA v7.0 (Kumar et al., 2016) to identify pseudogenes in each species indicated above. The whole open reading frames were performed with codon-based alignment to identify potential pseudogenes with irregular shifts of premature stop codon(s), codon frameshifts, or missing exon regions.

Summary of the Genome Assembly and Assessment
We generated a total of 236.7-Gb raw reads, among them approximately 179.5 Gb of clean reads were obtained after removal of low-quality data. For the transcriptome sequencing, a total of 39.3-Gb raw reads were generated. The genome size of S. maitianheensis was estimated to be about 1.8 Gb by the routine 17mer frequency distribution analysis, because the k num is 40,525,178,512, and the K depth is 23 (see Supplementary Figure S2).
After de novo assembly and gap closing, the final genome assembly was 1.7 Gb in total length, with a scaffold N50 of 1.4 Mb, a contig N50 of 24.7 kb, and the GC content of 37.6% (Tables 1 and Supplementary Table S2). The final assembled genome accounted for 94.4% of the estimated genome size (1.8 Gb). For assessment of our genome assembly, we searched a total of 3,640 BUSCO (Benchmarking Universal Single-Copy Orthologs) groups and determined that 3,536 (97.1%) were complete (with 1,643 single-copy BUSCOs and 1,893 duplicated BUSCOs), suggesting a high completeness of our genome assembly for S. maitianheensis.
We compared the genome assembly and annotation results for the five examined Sinocyclocheilus species and observed that their genome sizes are at a narrow range of 1.7-1.9 Gb (Supplementary Table S3). The largest genome is 1.9 Gb for S. anophthalmus (GCA_018155175.1 in NCBI) whose assembly was generated with both Illumina (next-generation) and PacBio (thirdgeneration) sequencing reads, whereas others were sequenced only by an Illumina platform with relatively lower scaffold N50 values. GC contents of the five examined Sinocyclocheilus species ranged from 37.2 to 38.6% (Supplementary Table S3).

Genome Annotation
For repeat annotation, we predicted 664,837,705-bp repeat elements (accounting for 39.4% of the genome assembly; Supplementary Table S4). Among them, 455.9 Mb of DNA repeat elements, 135.2 Mb of long terminal repeats (LTRs), 104.9 Mb of long interspersed nuclear elements (LINE), and 9.1 Mb of short interspersed nuclear elements (SINE) were detected (Supplementary Table S5). After homology and transcriptome-based annotations, our MAKER results integrated a total of 39,977 protein-coding genes with an average length of 15.9 kb (Supplementary Table S6).
For function annotation, 38,677 genes were aligned onto four public databases (Interpro, KEGG, TrEMBL, and Swiss-Prot) with function assignments, which account for 96.8% of the annotated genes (Supplementary Table S7). The distributions of CDS, intron, gene, and exon length in S. maitianheensis were generally similar to those in other four examined Sinocyclocheilus fishes (Supplementary Figure S3), which suggested a good reliability of the genome annotation for S. maitianheensis. However, there are still some distinctions among them, such as the higher percentages of CDS and genes at length  "WGD" represents the Sinocyclocheilus-specific WGD event (red), and "3R WGD" represents the third-round WGD event (green). (D) 4dTv distributions of selfalignments in five Sinocyclocheilus fishes, and common carp.
Frontiers in Genetics | www.frontiersin.org October 2021 | Volume 12 | Article 736500 5 approximately 1,000 bp in S. maitianheensis. Moreover, our annotation results reveal that S. maitianheensis has less protein-coding gene number (39,977) than the other four Sinocyclocheilus counterparts (49,865 for S. anophthalmus, 42,109 for S. grahami, 40,470 for S. anshuiensis, and 42,377 for S. rhinocerous). The assessment of protein-coding genes among the five fishes showed that there were more missing BUSCOs in S. maitianheensis (Supplementary Table S8), which may lead to a lower amount of gene number. However, this divergence of gene number may be caused in part by different annotation methods.

Summary of the Gene Orthogroups
A total of 26,875 orthogroups were predicted for the five examined Sinocyclocheilus fishes and six other representative vertebrate species. For S. maitianheensis, 32,150 protein-coding genes were clustered into 15,617 orthogroups, although 7,827 genes were not clustered; the numbers of multiple-copy, single-copy orthologs, and unique paralogs were 15,400, 2,508 and 981, respectively ( Figure 1A and Supplementary Table S9). We summarized the numbers of orthogroups shared with each other between the five Sinocyclocheilus in Figure 1B. There are 10,253 common orthogroups among these Sinocyclocheilus species.

Phylogenetic Position of S. maitianheensis
A total of 191 common single-copy orthogroups among all the examined species were used for construction of a phylogenetic tree. A progressive evolutionary relationship in the five Sinocyclocheilus species revealed a new phylogeny based on both genome and transcriptome data ( Figure 1C). This topology is consistent with the phylogenetic tree based on genomes  and mitochondrial genes (Zhang and Wang, 2018), but it is different from a previous report of a division into two branches, in which S. maitianheensis, S. anophthalmus, and S. anophthalmus are located in one branch, whereas S. anshuiensis and S. rhinocerous are located in another branch (Mao et al., 2021) based on the morphological trait of eyes. Therefore, our latest topology provides novel insights into detailed phylogeny for the five Sinocyclocheilus species at a genome level.
S. maitianheensis and S. anophthalmus diverged at approximately 2.7 (1.4-4.8) Mya, and the divergence time periods with S. grahami, S. anshuiensis, and S. rhinocerous were 4.3 (2.5-6.9), 7.8 (5.0-11.7), and 8.9 (5.8-13.1) Mya, respectively ( Figure 1C). Surface-dwelling S. maitianheensis has the closest relationship with cave-restricted S. anophthalmus; however, there are huge differences in morphological traits between them. Interestingly, the geographical positions of both habitats are close, located in the same Yiliang County (Kunming City, Yunnan Province, China; Supplementary Figure S1). S. anophthalmus resides in a dark environment of several caves in Jiuxiang Town (Zhao and Zhang, 2009), whereas S. maitianheensis lives in the surface of Maitian river. Divergence time of the two Sinocyclocheilus species was relatively late, approximately 2.7 (1.4-4.8) Mya. Their separation may be due to a geographical isolation generated by the continuous uplift of the Yunnan-Guizhou Plateau after Himalayan orogeny (50-40 Mya) (Yin and Harrison, 2000), and some of their ancestors swam down along the Maitian river into these surrounding caves.

Estimation of the Lineage-specific WGD in Sinocyclocheilus
Cyprinidae experienced a recent genome-wide duplication event (Larhammar and Risinger, 1994;David et al., 2003) after the third-round WGD (3R WGD, also known as teleost-specific WGD) (Jaillon et al., 2004). We performed a 4dTv analysis to estimate the timing of this recent lineage-specific WGD in Sinocyclocheilus. Self-alignments with paralogous gene pairs of S. maitianheensis, S. anshuiensis, S. rhinocerous, S. anophthalmus, S. grahami, and common carp displayed distinct peaks ( Figure 1D). Their 4dTv distances were calculated to be 0.04, 0.04, 0.04, 0.05, 0.05, and 0.06, respectively. The peak values of the five Sinocyclocheilus fishes and common carp are very close (0.04-0.06), implying that these fishes might share the recent genome-wide duplication event.
In order to compare the recent specific WGD between Sinocyclocheilus fishes and common carp, we performed a 4dTv analysis on 13,579 orthologous gene pairs between S. maitianheensis and common carp genomes (Supplementary Figure S4). The peak values in each group of C. carpio-C. carpio, S. maitianheensis-C. carpio, and S. maitianheensis-S. maitianheensis were estimated to be 0.06, 0.04, and 0.04, respectively. The nearly overlapping of peaks indicated that S. maitianheensis and common carp might have undergone the recent specific WGD together. Based on the aforementioned 4dTv analyses and the construction of phylogenetic tree, we predicted that the carp-specific WGD occurred ∼18.1 Mya, before the evolutionary separation of Sinocyclocheilus fishes and common carp (∼17.3 Mya; Figure 1D). This estimate is little earlier before the specific WGD of common carp based on the divergence time of transposable elements (Xu et al., 2019).
The time estimation of the latest WGD in Cyprinidae is contentious, from ∼8.2 to ∼16 Mya (Larhammar and Risinger, 1994;David et al., 2003;Xu et al., 2014), although these studies almost focused on the common carp. However, a recent study of common carp defined a rough time range (9.7-23 Mya) and further predicted this time point to about 12.4 Mya (Xu et al., 2019). Our result of ∼18.4 Mya in the present study is within this time range, and the 4dTv analysis of Sinocyclocheilus fishes provides more evidence for the timing of the latest genome duplication in Cyprinidae.

Copy Number Variations of Several Immune Genes in P38 and Mitochondrial Pathways
S. anophthalmus has evolved a series of traits to adapt to the caved environment. It had developed huge differences from its sister species (S. maitianheensis), such as loss of eyes and the semitransparent body (Zhao and Zhang, 2009). Therefore, compared to S. maitianheensis, S. anophthalmus is an independent cave species with many valuable traits that are waiting for in-depth explorations.
To investigate potential immunological variances in caverestricted Sinocyclocheilus species, we identified the copy number Frontiers in Genetics | www.frontiersin.org October 2021 | Volume 12 | Article 736500 6 of 15 immune genes within P38 and mitochondrial pathways in the five examined Sinocyclocheilus genomes (see detailed statistics in Supplementary Table S10). These genes include tak1, tab1, ask1, fas, fasl, fadd, tnfa, cd40, cd40l, daxx, mkk4a, mkk4b, mkk6, bcl-2a, and bcl2l1 (Supplementary Figure S5). In general, five Sinocyclocheilus species and common carp usually own twofold copies compared to the other diploid teleosts. Interestingly, we found copies of some genes in the cave-restricted fishes. For S. anophthalmus, a copy of ask1 (an apoptosis signal-regulating kinase) (Patel et al., 2019) was predicted as a pseudogene; only one copy of bcl2l1 (encoding apoptotic regulators in BCL-2 family) (Warren et al., 2019) was retained in S. anophthalmus compared with S. maitianheensis that contained two copies; one more copy of bcl-2a and no copy of mkk4a were also observed in S. anophthalmus genome. In addition, among most of these genes we studied, another cave-restricted fish Mexican tetra (A. mexicanus) had fewer gene copy numbers than the other examined fishes. These variances in gene copy number imply that the apoptotic activity might have been decreased in cave-restricted fishes, which is consistent with our previous report of relatively lower immunity in cavefishes . However, apoptotic activity is regulated by many factors, and more investigations should be done for in-depth verification.
In addition, we performed high-throughput identification of antimicrobial peptides (AMPs) (Yi et al., 2017;Mwangi et al., 2019). A total of 379, 551, 522, 545, and 552 putative AMP sequences were identified in S. maitianheensis, S. rhinocerous, S. anshuiensis, S. anophthalmus, and S. grahami genomes, respectively (Supplementary Table S11). Thrombin, histone, lectin, chemokine, scolopendin, and ubiquitin are the most abundant AMPs in the five examined Sinocyclocheilus species. The lowest number of AMP sequences in S. maitianheensis genome may be related to its least protein-coding genes in the five Sinocyclocheilus fishes.

CONCLUSION
In summary, we reported the first genome assembly of S. maitianheensis, which provides a valuable genetic resource for comparative studies on cavefish biology, species protection and practical aquaculture of this potentially economical teleost fish. This genome assembly also supplies essential genomic data for indepth genetic analysis. Based on these genomic data, we observed a close relationship between S. maitianheensis and S. anophthalmus. Some variations of gene copy number in the immune system might indicate the variation in immunity and apoptosis in cave-restricted Sinocyclocheilus species.

VALUE OF THE DATA
This is the first draft genome of a representative surface-dwelling Chinese golden-line barbel fish, Sinocyclocheilus maitianheensis. The final assembly was 1.7 Gb with a scaffold N50 of 1.4 Mb and a contig N50 of 24.7 kb.
The phylogenetic tree revealed that S. maitianheensis is close to S. anophthalmus (a cave-restricted species with similar locality). The divergence time between the two relatives is about 2.7 million years ago (Mya).
The 4dTv analysis demonstrated that the recent carp-specific WGD event occurred approximately 18.1 Mya.
A decrease in the copy number of many important immunological genes was observed in cave-restricted Sinocyclocheilus species.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

ETHICS STATEMENT
All animal experiments in this study were performed in accordance with the guidelines of the Animal Ethics Committee and were approved by the Institutional Review Board on Bioethics and Biosafety of BGI, China (No. FT18134). Written informed consent was obtained from the owners for the participation of their animals in this study.

AUTHOR CONTRIBUTIONS
Conceptualization, JY and QS; data analysis, RL and ZG; samples collection and assisted data analysis, YZ, WJ, and MW; data curation, XY, LC, and XP; writing-original draft preparation, RL and ZG; writing-review and editing, QS, CB, and XW; supervision, QS and JY; funding acquisition, JY and QS. All authors have read and approved the published version of the manuscript.