Transcriptome Sequencing of the Endangered Species Elongate Loach (Leptobotia elongata) From the Yangtze River: De novo Transcriptome Assembly, Annotation, Identification and Validation of EST-SSR Markers

Elongate loach (Leptobotia elongata) is endemic to middle and upper reaches of the Yangtze River in China. Due to overfishing and habitat destruction, this loach has become an endangered species. So far, lack of reliable genetic information and molecular markers has hindered the conservation and utilization of elongate loach resources. Therefore, we here performed an Illumina sequencing and de novo transcriptome assembly in elongate loach, and then developed polymorphic simple sequence repeat markers (SSRs). After assembly, 51,185 unigenes were obtained, with an average length of 1,496 bp. A total of 23,901 expressed sequence tag-simple sequence repeats (EST-SSRs) were identified, distributing in 14,422 unigenes, with a distribution frequency of 28.18%. Out of 16,885 designed EST-SSR primers, 150 primers (3 or 4 base repetition-dominated) were synthesized for polymorphic EST-SSR development. Then, 52 polymorphic EST-SSRs were identified, with polymorphism information contents (PIC) ranging from 0.03 to 0.88 (average 0.54). In conclusion, this was the first report of transcriptome sequencing of elongate loach. Meanwhile, we developed a set of polymorphic EST-SSRs for the loach. This study will provide an important basis, namely genetic information and polymorphic SSRs, for further population genetics and breeding studies of this endangered and economic loach in China.


INTRODUCTION
Elongate loach (Leptobotia elongata), also known as Queen loach, is the largest species in the Cobitidae family (Yuan et al., 2010). This loach is recorded in the middle and lower reaches of the Jinsha River, the upper reaches of the Yangtze River, the Minjiang River, the Jialing River, the Qujiang River and the Lancang River. It is endemic to the middle and upper reaches of the Yangtze River in China (Ding, 1994). Elongate loach contains rich nutrients (protein, fat and essential amino acids) and is of diuretic, nourishing Yin and other medicinal values, and ornamental value as well (Liang et al., 2009), making it with high market price. Meanwhile, the loach grows fast. These advantages bring a bright future for elongate loach aquaculture. Due to over-fishing, pollution of the water environment and the construction of water conservancy projects, the resources and distribution area of elongate loach have been seriously reduced (Liang et al., 2009;Yuan et al., 2010). This loach is currently listed as a highly endangered fish in the "China Species Red List" (Yue and Chen, 1998). Therefore, protection and sustainable utilization of elongate loach resource are imperative, imminent and of great significance.
Nowadays, molecular methods and means, e.g., transcriptomics and molecular markers, are often applied in the conservation and utilization of fish resources. The RNAseq technology combined with bioinformatics can generate extensive data in a very cost-effective way (Parchman et al., 2010;Ekblom and Galindo, 2011;Harkess et al., 2017;Mercati et al., 2019). It not only facilitates the acquisition of a large number of gene sequences, but is also conducive to develop expressed sequence tag (EST)-derived markers (Qiao et al., 2014;Weisberg et al., 2017;Penin et al., 2019). Simple sequence repeats (SSRs) are sequences formed by short tandem repeating of 1-6 nucleotide bases, also known as microsatellites, widely found in the nuclear genome of eukaryotes (Jarne and Lagoda, 1996). They were first discovered in the genomes of plant chloroplasts and mitochondria (Powell et al., 1995;Provan et al., 1996;Whitton et al., 1997;Procaccini and Mazzella, 1998). As a molecular marker, SSR has the advantages of high polymorphism, co-dominance, quantity and easy experimentation (Zhang et al., 2016). It is a powerful tool in researches on resource conservation, phylogeny and population genetics, and genetic breeding. Different from genomic SSRs, EST-SSRs are located in the coding regions and highly transferable to related taxa (Zhao et al., 2013). Thus, EST-SSRs are considered as effective functional markers (Li et al., 2004;Mercati et al., 2019).
There have been a few reports on SSR development of elongate loach. Liu et al. (2012) developed a total of 22 polymorphic SSRs in the loach by using the fast isolation by AFLP of sequences containing repeats protocol (FIASCO). 18 polymorphic SSRs were developed in the loach based on the method of triplex affinity capture (Lian et al., 2013). In addition, Liu et al. (2014) developed 14 polymorphic SSRs through 454 sequencing technology of elongate loach genomic DNA. In this study, RNA-seq analysis was conducted in elongate loach for the first time to enrich its genomic resources. Then, we mined sequences containing SSRs from the transcriptome dataset and 52 polymorphic EST-SSRs were validated. This study will benefit for studies of population genetics, gene functions and genetic breeding in elongate loach and other related species.

Ethics Statement
This study was conducted in strict accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of Huazhong Agricultural University. All efforts were made to minimize suffering of the loach.

Fish Samples
In October 2019, thirty elongate loach adults used here were collected from the waters of the Minjiang River (104 • 37 12 N; 28 • 46 12 E) in China. The heart, liver, brain and muscle tissues from each elongate loach adult (n = 4) were dissected. The samples were immediately frozen in liquid nitrogen for 24 h, and then stored at −80 • C for RNA sequencing.

RNA Isolation, Library Preparation, and Sequencing
Total RNA was isolated from each sample by using TRIzol Reagent (TaKaRa, Dalian, China) according to the manufacturer's protocol. RNA purity and concentration were determined by gel electrophoresis and Nanodrop 2,000 (Thermo Fisher Scientific, Waltham, MA, United States). The RNA integrity number (RIN) was examined using Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA, United States). The isolated RNAs from the four different tissues of each loach, with RIN values greater than 9.0, were mixed with equivalent concentration as one biological sample. Since a RIN value of a live tissue was less than 9.0, three biological samples were finally obtained. The mixed RNA samples were then used for cDNA synthesis (Invitrogen, Carlsbad, CA, United States). Then the synthesized cDNA was subjected to end-repair, phosphorylation and "A" base addition according to Illumina's library construction protocol. Three cDNA libraries were constructed (namely LE1, LE2, and LE3) in this study. The cDNA libraries were sequenced on the Illumina sequencing platform (Illumina Hiseq xten/NovaSeq 6,000 sequencer Illumina, San Diego, CA, United States) by using paired-end technology. The Illumina GA pipeline (version 1.8.2, San Diego, CA, United States) was used to analyze the raw image and for base calling.

De novo Assembly and Annotations
The sequencing data was filtered with the Sickle 1 and SeqPrep 2 software to remove the linker and low-quality sequences. Then, all clean reads were de novo assembled into transcripts by Trinity (Version v2.8.5) software, and the details were described by Zhang et al. (2017). After assembly, the TGICL clustering software (Version 2.1, J. Craig Venter Institute, Rockville, MD, United States) was used to cluster and remove redundant  transcripts, and then the remaining sequences were defined as unigenes. TransRate 3 and CD-HIT 4 were then used to filter and optimize the sequences obtained. Assembly results were evaluated using BUSCO (Version 3.0.2) software. Unigenes were functionally annotated by BLASTX alignment with an E-value threshold of 10 −5 against the following databases: NCBI non-redundant protein database (Nr), swiss-prot database, protein family database (Pfam), clusters of orthologous groups of proteins database (COG), gene ontology database (GO), and kyoto encyclopedia of genes and genomes database (KEGG). The GO annotations were performed using Blast2GO based on the NCBI Nr database. The KEGG automatic annotation server was used to map these unigenes to the KEGG pathway database. The expression levels of unigenes were calculated using transcripts per million (TPM) with RSEM (Version 1.3.1) software.

Detection of EST-SSRs
The software MISA v1.0 5 was used to detect SSRs in the unigenes with the following parameters: the motif sizes ranged from one to six nucleotide repeat units (six for mono-and di-, five for tri-, tetra-, penta-, and hexa-nucleotides), and the maximum interruption distance between two SSRs was 100 bp. We used the program Premier 5.0 software (Premier Biosoft Inc., Palo Alto, CA, United States) to design SSR primers and the methods in details referred to Feng et al. (2017). In this study, out of 16,885 designed EST-SSR primers, 150 pairs of primers were randomly selected and synthesized for polymorphic EST-SSR (3 or 4 base repetition-dominated) development. Of all unigenes (51,185), 44.97% (23,019) were shorter than 500 bp, 16.98% (8,689) ranged from 501 to 1,000 bp, 14.47% (7,408) from 1,001 to 2,000 bp, and 23.58% (12,069) longer than 2,000 bp.

EST-SSRs Amplification and Genotyping
Thirty elongate loaches were used for DNA extractions. The genomic DNA was extracted from caudal fin by using ammonium acetate method (Richardson et al., 2001), diluted with deionized water, and then subjected to agarose gel electrophoresis to detect the quality. Amplification and polymorphism of the 150 EST-SSRs were tested through polymerase chain reaction (PCR) and electrophoresis, respectively. DNA amplification was performed by using a T100 Gradient Thermal cycler (Bio-Rad Laboratories, Inc., Pleasanton, CA, United States) within a 10 µL final reaction volume: 3.5 µL of double distilled water, 0.5 µL of each primer (100 µmol/L), 0.5 µL of genomic DNA (70 ng/µL), and 5 µL of 2X DreamTaq Green PCR Master Mix. PCR amplification program was listed as following: an initial denaturation at 98 • C for 5 min, then 35 cycles of 30 s at 96 • C, 10-16 s at 60 • C, 1 min at 72 • C, and final extension at 72 • C for 10 min. 2% agarose gel electrophoresis was used to evaluate the success of the PCR reaction and the polymorphisms of EST-SSRs.
The POPGENE32 (Yeh et al., 1999) software was used to assess the number of observed alleles (Na), observed heterozygosity (Ho), expected heterozygosity (He), and polymorphic information content (PIC) of each EST-SSR.

De novo Assembly and Functional Annotation of Elongate Loach Transcriptome
Before the 2000s, the annual catch of elongate loach was about 10,000 kg. By 2008, the annual catch had reduced to 2,000-3,000 kg (Duan et al., 2008). At present, elongate loach is an endangered fish species endemic to Yangtze River in China (Yuan et al., 2010). Like elongate loach, populations of many fish species in the Yangtze River are getting smaller (Zhang et al., 2020). That's why Chinese government has planned to carry out a ten-year forbidden fishing in the Yangtze River, which will be fully implemented since January 1, 2021 (Huang, 2020). Along with the obviously continuous reduction in wild resources, elongate loach is getting more and more attention. Many studies of the loach have been done. However, they mainly focus on its ecological habits, morphological characteristics, artificial reproduction, etc (Ku and Wen, 1997;Li, 2014;Zhang et al., 2018). Although the sequencing technologies have developed rapidly, different from other endangered animal species (Magalhães et al., 2019;Chen et al., 2020;Grandjean et al., 2020;Wen et al., 2020), there were few researches on genetic information of elongate loach. In this study, we developed a reference transcriptome in elongate loach. This was the first report of transcriptome sequencing in elongate loach, which could provide useful information for its protection and effective utilization.
Sequencing details of each library of elongate loach are shown in Table 1. All of the sequenced libraries were mixed for de novo assembling. After cleaning and quality checks, 202,296,650 clean reads with a total length of 29.98 Gb were obtained, which assembled into 51,185 unigenes with a mean length of 1,495.84 bp and a N50 of 2,785 bp ( Table 2). The assembled unigene lengths ranged from 201 to 48,765 bp. In addition, the TransRate score and BUSCO score were 0.36 and 84.9%, respectively. Supplementary Table 1 showed the comparison of sequencing data and assembly results by using Transdecoder, indicating that the reference transcriptome had a good quality. The length distribution of the assembled unigenes is shown in Figure 1.
All unigenes were annotated against the six functional databases: NCBI Nr, Swiss-Prot, Pfam, COG, GO, and KEGG. As indicated in Figure 2, 49.43% unigenes (25,303) were significantly matched with known genes in the Nr database, while 42.45% unigenes (21,726) matched with entries in the Swiss-Prot database, and 38.39% unigenes (19,651) matched with proteins in the Pfam database. Based on the E-value distribution, 17.21% of annotated sequences showed high scores for homology (E-value <1 × 10 −30 ), and 82.80% showed homology with E-values ranging from 0 to 1 × 10 −30 (Figure 3A). Figure 3B showed that 71.57% of sequences had similarities over 80%. These results reflected the high identities of the mapped sequences with known sequences, suggesting good assembly quality of the reference transcriptome in elongate loach. Species-based distribution analysis indicated that Sinocyclocheilus rhinocerous (22.37%) provided most of the BLASTx hits for the elongate loach unigenes (Figure 3C).
Functional prediction and classification of all unigenes were performed using their annotations with COG, GO, and KEGG databases. A total of 11, 894 unigenes (23.24%) in elongate loach were annotated in COG and grouped into 24 COG classifications (Supplementary Figure 1A). Based on Nr annotation and 60,229 functional terms, 8,970 (14.89%) elongate loach unigenes were assigned to GO term (Supplementary Figure 1B). It is classified into 59 functional groups from three aspects: biological process, cellular component and molecular function. All unigenes were analyzed with the KEGG pathway database as well. Carrying significant matches in this present study, 16,257 (31.76%) unigenes were assigned to 43 predicted metabolic pathways (Supplementary Figure 1C).
Through the reference transcriptome, we calculated the relative abundance of the assembled unigenes from four tissues of elongate loach. Cytochrome c oxidase subunit genes (COX1/3) were the highest expressed, followed by some genes related to muscle growth, including GAPDH, TPM4, and ACTC1 (Supplementary Table 2). Moreover, many ribosomal genes including RPL37, RPL23, RPS2, RPL32, RPS19, RPS23, RPL27A, and RPL15, were highly expressed in elongate loach.

Identification and Classification of EST-SSRs
A total of 23,901 EST-SSRs in elongate loach were identified in elongate loach (Table 3), which provided foundation for further investigations of population genetics and breeding of this species. Among all the unigenes, 28.18% (14,422) of them contained SSRs. Meanwhile, 4,450 unigenes had more than one SSR. The EST-SSR type in elongate loach was abundant, with a total of 54 repeat types from mononucleotide to hexanucleotides (Supplementary Table 3). 96.36% of mononucleotide repeats was A/T motif, accounting for 49.99% of the total EST-SSRs. Dinucleotide was dominated by AC/GT and AG/CT, accounting for 22.73% and 11.76%, respectively. Trinucleotide was mainly AAT/ATT and ATC/ATG, accounting for 1.58 and 1.21% respectively. The proportion of the remaining motifs did not exceed 1%. AAAC/GTTT and AAAG/CTTT motif accounted for the highest proportion in tetranucleotide. Pentanucleotide and hexanucleotide were the least of all motifs, and their proportions were 0.04% (Supplementary Table 3). Among the perfect SSRs, mononucleotide repeats were the most abundant (51.87%), followed by dinucleotide (40.28%), tri-(6.06%), tetra-(1.69%), penta-(0.084%), and hexa-(0.013%) nucleotide repeats    ( Table 4). In addition, we analyzed the numbers of repeat units and found that most EST-SSRs motifs had fewer than 15 repeats and accounted for 89.57% of all SSRs. 6-10 and 11-15 repeats harbored the most SSRs. In this study, we found that mononucleotide-repeat SSRs were the most abundant in elongate loach. This was consistent with the findings in Acipenser dabryanus (Chen et al., 2020). However, dinucleotide-repeat SSR was the main type in other fishes (Magalhães et al., 2019).

Development of Polymorphic EST-SSRs
Out of 23,901 EST-SSRs, 16,885 primer pairs were successfully designed, and 150 primers (3 or 4 base repetition-dominated) were randomly selected for amplification of genomic DNA from elongate loach. Among the 150 primer pairs, 96 pairs (64%) were able to generate clear bands by PCR, which implied a decent EST-SSR primer dataset. Considering the amplification and polymorphic results, 52 SSRs (including 4 dinucleotide-repeat SSRs, 16 trinucleotide-repeat SSRs, and 32 tetranucleotide-repeat SSRs) were selected to evaluate polymorphism levels within the elongate loach population (Table 5). Previously, a total of 54 polymorphic SSRs were developed in elongate loach (Liu et al., 2012(Liu et al., , 2014Lian et al., 2013). Among them, 29 were dinucleotiderepeat SSRs. Compared with mono-and dinucleotide-repeat SSRs, tri-and tetranucleotide-repeat SSRs were more efficient in genotyping researches (Tibihika et al., 2020). Therefore, it could be told that this study developed many efficient SSR marks to conserve and utilize the threaten and economic loach.
In October 2019, we collected an elongate loach population from the Yangtze River and found that a total of 318 alleles were detected in all the 52 polymorphic SSRs developed here. Meanwhile, the average number of alleles was 6.1. Ho varied from 0.0333 to 0.8333 with an average of 0.3986, while He from 0.3333 to 0.8757 with an average of 0.5870, were found. The value of PIC ranged from 0.0332 to 0.8381 with a mean value of 0.5414. Among the 52 SSRs obtained here, 30 SSRs were highly polymorphic (PIC > 0.5), and 14 were moderately polymorphic (0.25 < PIC < 0.5) (Botstein et al., 1980). Liu et al. (2017) reported that the average number of alleles and Ho for seven populations of elongate loach collected from July 2011 to December 2013, were respectively 11.5 and 0.74, which were obviously higher than those in our study. To a certain extent, it indicated an obvious reduction in population genetic diversity of elongate loach in the Yangtze River recently, which would be caused by over-fishing and habitat destruction. Therefore, it needs to pay more attention to the protection of elongate loach.
In summary, this study provides a reference transcriptome of multiple tissues and 52 polymorphic SSRs (3 or 4 base repetitiondominated) in elongate loach, a highly threatened fish of the Yangtze River in China. These will benefit for the protection and sustainable utilization of elongate loach.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/ Supplementary Material.

ETHICS STATEMENT
The animal study was reviewed and approved by the Institutional Animal Ethics Committee and the Use of Laboratory Animals of Huazhong Agricultural University.

AUTHOR CONTRIBUTIONS
XC and YCZ designed and supervised the experiments. YBZ, JG, and YHZ carried out the experiments. YBZ analyzed the data and wrote the manuscript. XC revised the manuscript. All authors reviewed and approved the manuscript.