Complete genome sequence analysis of Edwardsiella tarda SC002 from hatchlings of Siamese crocodile

Edwardsiella tarda is a Gram-negative, facultative anaerobic rod-shaped bacterium and the causative agent of the systemic disease “Edwardsiellosis”. It is commonly prevalent in aquatic organisms with subsequent economic loss and hence has attracted increasing attention from researchers. In this study, we investigated the complete genome sequence of a highly virulent isolate Edwardsiella tarda SC002 isolated from hatchlings of the Siamese crocodile. The genome of SC002 consisted of one circular chromosome of length 3,662,469 bp with a 57.29% G+C content and four novel plasmids. A total of 3,734 protein-coding genes, 12 genomic islands (GIs), 7 prophages, 48 interspersed repeat sequences, 248 tandem repeat sequences, a CRISPR component with a total length of 175 bp, and 171 ncRNAs (tRNA = 106, sRNA = 37, and rRNA = 28) were predicted. In addition, the coding genes of assembled genome were successfully annotated against eight general databases (NR = 3,618/3,734, COG = 2,947/3,734, KEGG = 3,485/3,734, SWISS-PROT = 2,787/3,734, GO = 2,648/3,734, Pfam = 2,648/3,734, CAZy = 130/3,734, and TCDB = 637/3,734) and four pathogenicity-related databases (ARDB = 11/3,734, CARD = 142/3,734, PHI = 538/3,734, and VFDB = 315/3,734). Pan-genome and comparative genome analyses of the complete sequenced genomes confirmed their evolutionary relationships. The present study confirmed that E. tarda SC002 is a potential pathogen bearing a bulk amount of antibiotic resistance, virulence, and pathogenic genes and its open pan-genome may enhance its host range in the future.


Introduction
Edwardsiella tarda is a versatile bacterium that infects a broad range of organisms such as fish, amphibians, reptiles, birds, and mammals (1)(2)(3). Edwardsiella tarda infections show a broad spectrum of clinical manifestations in various organisms. For instance, Miyazaki and Kaige (4) reported necrotic lesions on multiple organs of Tilapia (Tilapia nilotica), Koeboelkuti et al. (5) reported multiple subcutaneous abscesses on grass snakes, while White et al. (6) reported hemorrhagic enteritis of pelicans. Some studies also observed septicemia and bacteremia in farmed hatchling crocodiles (7,8), and meningitis, .
From the economic point of view, the Siamese crocodile (Crocodylus siamensis) is a widely bred species in the South China region. A previous examination showed that immature Siamese crocodiles are vulnerable to genetic abnormalities, pilling up and asphyxia, mouth sores, skin diseases, gastropathy, and fungal and bacterial infections. Among bacterial infections, E. tarda was considered the main Edwardsiella species of septicemia among hatchlings and was controlled by the antibiotic "oxytetracycline" (7). However, later a new outbreak of Edwardsiellosis was reported and its subsequent investigation showed tetracycline-resistant E. tarda isolates (8). In the present study, from the aforementioned outbreak, a lethal isolate "E. tarda SC002" was subjected to wholegenome sequencing (8) and their genetic properties, virulence factors, invasive properties, and drug resistance were explored. The present study provides genomic insights into E. tarda, which will be indirectly helpful for candidate vaccine preparation against outbreaks of Edwardsiellosis in the aquaculture industry.

Materials and methods
Bacterial isolate and DNA extraction Edwardsiella tarda SC002 was previously isolated from hatchlings of a diseased Siamese crocodile (8) and grown in BHI broth at 28 C overnight. After further sub-culturing up to the 40th generation, the bacterial genomic DNA from the exponential growth phase was extracted by the SDS technique. The extracted DNA was then subjected to 1% agarose gel electrophoresis for detection purposes and measured by a Qubit R 2.0 Fluorometer (Thermo Scientific).

Whole-genome sequencing procedures
The whole genome was sequenced using the Nanopore PromethION platform and the Illumina NovaSeq PE150 platform (Novogene Beijing, China). For sample preparation, 1 µg of DNA per sample was used as input material. According to the manufacturer's guidelines, sequencing libraries were prepared using NEBNext R Ultra TM DNA Library Prep Kit for Illumina (NEB, USA). In brief, samples were sliced up to 350 bp by the sonication technique. The fragments were then endpolished, poly A-tailed, and ligated to a full-length adaptor for Illumina sequencing with further PCR amplification. Eventually, the amplified products were purified (AMPure XP system) and analyzed via Agilent2100 Bioanalyzer and quantified by realtime PCR. A hybrid assembly was generated using Unicycler software on short-read Illumina PE150 sequencing and longread Nanopore sequencing data. The distribution of sequencing depth was counted, and the assembled sequence was specified as a chromosomal or a plasmid sequence according to size and alignment. The assembled sequence of whether it is a circular or a linear genome was also checked.

Genome components' prediction
Various important components of the genome, including (CDSs) coding DNA sequences, repetitive sequences, non-coding RNAs (ncRNAs), genomic islands (GIs), transposons, prophages, and clustered regularly interspaced short palindromic repeat (CRISPR) elements, were predicted. The GeneMarkS package was used for retrieval of the associated coding gene. The RepeatMasker for the prediction of interspersed repeat sequences and the Tandem repeat finder for the detection of tandem repeats (22) were applied. tRNAscan-SE and rRNAmmer were used for the prediction of transfer RNA (tRNA) and ribosomal RNA (rRNA) genes, respectively (23,24). The conserved small RNAs (snRNAs) were predicted by BLAST against the Rfam database (25). The IslandPath-DIOMB and the transposonPSI programs were applied to predict GIs, and transposons based on BLAST sequences, respectively (26). Furthermore, PHAST and CRISPRFinder were used for the identification of prophages and CRISPR elements, respectively.
. /fvets. . seven databases. The secretory proteins and membrane proteins were predicted by the SignalP database and TMHMM Server, respectively. In addition, type secretory proteins (types I-VII) were predicted by the EffectiveT3 tool. Gene clusters for secondary metabolism were evaluated by the antiSMASH tool (28). For the assessment of pathogenicity, the Pathogen-Host Interactions (PHI), the Virulence Factors Database (VFDB), and the Antibiotic Resistance Genes Database (ARDB) were used. Carbohydrate-active enzymes were identified by the Carbohydrate-Active enZYmes Database (CAZy).

Phylogenomics and comparative genomics analysis
A genomewide comparison of the genus Edwardsiella members and species from related genera in the family Enterobacteriaceae, including the Escherichia coli (E. coli) strain 97-3250 (NZ_CP027599) and the Salmonella enterica strain FDAARGOS_878 (NZ_CP065718), was performed by Mauve genome alignment using the MegAlign_17 tool. A phylogenetic tree was constructed using the maximum-likelihood (ML) technique with the statistical support of 1,000 bootstrap replicates. Further analysis for carrying out comparative genomics was based on average nucleotide identity (ANI) values. The ANI values between the two genomes were calculated by the ANI calculator of Kostas lab (29) and visualized by a heatmap generated by the package ggplot2 in the statistical software R (version 2021-09.1+372). PGAweb, an online server, was used to estimate E. tarda pan-genome; however, incomplete genomes were excluded from the analysis (30).

Results
General characteristics of the whole-genome sequence Edwardsiella tarda SC002 is comprised of a circular chromosome of 3,662,469 bp with 57.29% of G+C content ( Figure 1). The predicted genetic components are 3,734 genes, 12 genomic islands (GIs), a CRISPR component with a total length of 175 bp, and 106, 37, and 28 tRNAs, sRNAs, and rRNAs, respectively. Furthermore, 7 prophages (4 on chromosome and 3 on different plasmids), 48 interspersed repeat sequences (including LTR and one unknown repeat), and 248 tandem repeat sequences (including TRF, minisatellite, and microsatellite DNA) were predicted (Table 1). In addition, the predicted 3,734 coding DNA sequences (CDSs) have an average length of 916 bp, representing 82.83% of the total genome.
pEtSC002-2 encodes a few Tra proteins (M, T, I, and Y domains), which indicate that it is a conjugative plasmid. However, plasmid partition proteins, such as SopA, SopB, and ParB, including plasmid replication initiator protein IncFII RepA, were encoded, indicating that it is an IncFII plasmid. There were no antibiotic resistance genes found in this plasmid; however, bacteriocin and colicin V were encoded which can act as efflux pumps conferring antibiotic resistance.
pEtSC002-3 was more unique than other plasmids observed in this study, a non-circular plasmid and one that encodes the maximum abundant number of Tra proteins (A, B, C, D, E, F, G, H, I, K, L, M, N, Q, T, U, V, W, X, and Y), indicating that it has a strong conjugation potential. Furthermore, plasmid partition proteins, such as ParA, ParB, and ParM, and a plasmid replication protein, DnaC of the IncFII RepA family, were encoded, indicating that it is also an IncFII plasmid.
pEtSC002-4 found in this study was a circular conjugative plasmid with various Tra (A, B, C, D, E, F, G, H, I, K, L, N, Q, T, U, and V) and Trb (B, C) proteins. The plasmid replication IncFII RepA protein family and ParA protein were found, indicating that it is an IncFII plasmid.

Genomic plasticity and genomic islands
The G+C content shows high variability in the E. tarda SC002 genome ( Figure 1). A significant part (213,631 bp; 5.17%) of the genome (4,131,220 bp) is comprised of genomic islands (GIs, n = 12), representing a complex structure of the genome. In addition, an adequate amount (n = 48) of interspersed repeat sequences (0.138% of 4,131,220 bp) and a variable number of tandem repeats (VNTRs) (n = 151, 0.435% of 4,131,220 bp) were detected. The GIs usually encoded hypothetical proteins, followed by the transposase, the integrase, the transcription regulator, transporter proteins, plasmid partition proteins, the terminase, phage proteins, and others ( Table 2). The seven prophages (5.68% of 4,131,220 bp) were discerned, of which four were on a chromosome (average GC% = 53.19) and three were on two plasmids (average GC% = 50.56). Furthermore, prophages 4, 5, and 6 overlapped on several CDS regions of GIs 1, 4, and 5, respectively, on the chromosome. A valuable amount of ncRNA (1.3% of 4,131,220 bp) was observed as tRNA (n = 106), rRNA (n = 28), and sRNA (n = 37). A CRISPR .
/fvets. .   Based on the orthology examination, the COG database predicted 2,947 genes (78.92%), and these genes were distributed into 23 functional categories (Figure 2A). According to the COG categorization, the five most rich annotated functions were amino acid transport and metabolism (284 genes), carbohydrate transport and metabolism (248 genes), energy production and conversion (232 genes), transcription (226 genes), and translation, and ribosomal structure and biogenesis (225 genes). Furthermore, 166 hypothetical genes were discerned, which may need to be explored by further studies. In addition, 16 genes for extracellular structures, 166 genes each for transposons and prophages, and one gene were found for RNA processing and modification (Supplementary Tables S1, S2; Figure 2A).

Gene functional analysis
The KEGG database annotation represents a total of 2,510 genes. Followed by the classification of the KEGG orthology (KO) database, the annotated genes were distributed into six categories: metabolism (1,663), environmental information processing (317), genetic information processing (190), cellular processes (175), human diseases (126), and organismal systems (39) ( Figure 2B). According to the orthology results, the most populated class was represented by metabolic pathways, global and overview maps from the "Metabolism" category, with 587 genes. The second abundant class was the biosynthesis of secondary metabolites from the "Metabolism" category (267 genes), followed by microbial metabolism in diverse environments (184 genes) and biosynthesis of antibiotics (175 genes) from the "Metabolism" category. In addition, two-component systems (TCSs) and ABC transporters were discerned with 138 and 123 genes, respectively, from the "Environmental Information Processing" category (Supplementary Tables S1, S2).
According to GO analysis, a total of 2,648 protein-encoding genes were annotated ( Figure 2C). The annotated genes were mainly categorized into 43 subfunctional items. These were further distributed into three major categories: biological process (22 subfunctions), cellular component (12 subfunctions), and molecular function (9 subfunctions). However, a total of 1,525 different genes, followed by 1,500, and 532 belonging to the cellular process, metabolic process, and localization, respectively, of the biological process category were annotated (Supplementary Tables S1, S2).
Transporter Classification Database (TCDB) annotated a total of 637 genes and categorized them into 7 classes and 18 subclasses. Class 3 has the highest number of genes followed by class 2 and class 1 with 209, 202, and 89 genes, respectively ( Figure 2F). The subclassification revealed that the highest number of genes belong to subclass 2.A, followed by 3. A, 1.B, 9.B, and 3.D with 198, 167, .

Pathogenicity/virulence factor analysis
To evaluate the pathogenicity of SC002, databases, such as ARDB, CARD, PHI, and VFDB, were searched. According to Antibiotic Resistance Genes Database (ARDB) analysis, 11 genes were annotated, of which two each of tetracycline, streptomycin, and sulfonamide resistance and 1 each of aminoglycosides, bacitracin, penicillin, fluoroquinolone, and chloramphenicol resistance were found (Supplementary Table S3). The Comprehensive Antibiotic Resistance Database (CARD) analysis annotated a total of 142 genes, in which macB was the most prevalent, with eight genes followed by five evgS and four .

E ector/secretory protein profiling
A gene cluster is a pool of genes on DNA encoding similar proteins. These proteins collectively perform a generalized function. Gene cluster size is inconsistent, from a few genes to some 100 genes (31). For secondary metabolite evaluation, there are five clusters encoded by the genome. Cluster 1 was identified as homoserine lactone that consists of 17 genes, Cluster 2 was betalactone with 18 genes, Cluster 3 was bacteriocin with 11 genes, Cluster 4 was thiopeptide with 21 genes, and Cluster 5 was siderophore with 11 genes (Figure 3). Furthermore, Clusters 1-4 were encoded by chromosome, while Cluster 5 was present on pEtSC002-4 ( Table 3). All of the aforementioned clusters have 100% homology with reference species, except Cluster 4, which has 33% homology. Upon secretory protein analysis, 209 gene-encoding proteins were predicted as signaling proteins. Furthermore, 129 genes were predicted for encoding T3SS proteins, of which 16 genes were encoded by all the plasmids.

Phylogenetics and comparative genomics analysis
For evolutionary relationships, SC002 and 10 strains of the genus Edwardsiella including two outgroup organisms were compared. A phylogenetic tree was constructed based on conserved aligned blocks, which shows a clearly divergent lineage. Four E. tarda strains were clustered together and one strain of E. tarda was branched with E. piscicida species (previously known as E. tarda) ( Figure 4A). To further explore the evolutionary relationships, ANI values were discerned. The ANI values were compared to estimate genomic differences and relatedness between two genomes. According to our results ( Figure 4B), the genomes of four E. tarda strains shared ANI values ranging from 99.34 to 99.45%. These values are above the threshold of 94-96% identity, which is usually considered a speciation edge (32). However, the strain ET_1 shared ANI values of 83% with E. tarda strains and 99.85% with E. piscicida strains (formerly known as E. tarda), which clearly shows that it is a distinct species currently classified as E. tarda.
Pan-genome analysis for six E. tarda strains ( Figure 4C), shows that the species have an open pan-genome. It is comprised of 4,222 cluster genes, distributed into 2,546 core cluster genes (60%), .

Discussion
Edwardsiellosis is a global issue for the aquaculture industry in the current era. Therefore, an understanding regarding the genetic makeup of Edwardsiella is needed that may be helpful in finding ways to overcome their outbreak in the aquatic species. Accordingly, this study isolated E. tarda SC002, a virulent strain from Siamese crocodile hatchlings in the Hainan Province of China, and their whole genome was sequenced to determine its pathogenicity, virulence, and niche-related properties. The SC002 genome has relatively higher coding and non-coding genes (tRNA, rRNA, and sRNA) than the other sequenced Edwardsiella species (such as KC_Pc_HB1, C07-087, and FL95-01) and is relevant to the rapid growth of the bacterium (33). The 5S rRNA gene was found to double in our study, which is similar to that mentioned in a previous study (34). Previously, the isolate SC002 was clustered with ATCC 15945 (8), indicating that it is a genotype of mammalian origin (13,35). The G+C content of SC002 was 57.29%, which is similar to a previous study where the G+C content of Edwardsiella genomes ranges from 56.8 to 59.80% (36).
Unlike the published Edwardsiella complete genomes having one or two plasmids, the SC002 genome contained four novel plasmids (34,(37)(38)(39). Subsequently, BLAST analysis shows that the pEtSC002-1 was identical to the S. enterica plasmid (NZ_CP037959.1), while pEtSC002-2, pEtSC002-3, and pEtSC002-4 were identical to the E. tarda KC_PC_HB1 plasmid (NZ_CP023707.1) (40,41). The antibiotic resistance determinants were only observed on pEtSC002-1 where the resistance gene patterns (sul1, sul2, cml, tetA, aph33, and aph6) were similar to those of EIB202 plasmid resistome (33). The β-lactam and fluoroquinolone resistance genes were additional to pEtSC002-1, indicating that this plasmid is more virulent than the latter plasmid. Three T3SS genes were annotated on pEtSC002-1, whereas plasmid pEIB202 encodes an incomplete set of T4SS genes which may play a role in horizontal gene transfer (HGT) (34). However, the detection of two prophages on pEtSC002-1 may enhance its capability for producing genetic exchange among bacteria. Furthermore, all plasmids share conjugative plasmid features and encode for Tra, replication initiation, and plasmid partition proteins, which shows that these plasmids might belong to the IncP plasmid that was capable of undergoing replication and stable inheritance in a wide variety of Gram-negative bacteria (42).
According to our study, 12 genomic islands (GIs) were detected in SC002, whereas 11 in E. tarda FL6-60, 24 in E. piscicida, and 31 in E. ictaluri, which shows that the taxawise GIs are frequent as E. tarda < E. piscicida < E. ictaluri (34). However, 5 out of 12 GIs of SC002 encode for T3SS proteins, whereas in EIB202, two GIs: GI7 and GI17 encode for T3SS and T6SS, respectively (34), which shows that these GIs contribute to the pathogenicity of SC002.
The analysis of the SC002 genome suggests that it has the ability to deal with environmental changes due to the presence of gene regulation systems, such as two-component systems (TCSs) and quorum sensing systems. In the genome of SC002, a total of 138 TCS genes and 50 quorum-sensing genes were detected, which were more than those of the previously reported E. piscicida (34). Usually, TCS genes are adjacent; however, similar to a previous study (34), the barA/UvrY, arcA/B, and cheA/B/Y, TCSs in the present study are unusually organized. The presence of some wellknown TCSs explored previously in E. tarda, such as PhoP/Q (43) and QseB/C (44), might have a role in the pathogenicity and virulence of SC002.
Edwardsiella tarda is capable of surviving and persisting intracellularly in phagocytes followed by systemic infection (45). To resist phagocyte-mediated killing, various essential strategies, such as enzyme production including CAT, POD, and SOD, have been implicated by E. tarda to neutralize reactive oxygen species (ROS), while TTSS and T6SS contribute to invasion and subversion of the host cells (46). In the present study, 129 T3SS genes were .
A previous study revealed the mechanism of E. tarda host adaptation and interspecies discrepancies within the species of the genus Edwardsiella (35). Phylogenomics and pairwise ANI Consequently, the pathogen has a chance to expend its host niches in the future. Further study is needed to experimentally prove the function of unique genes in host specificity and evolution. E. tarda is a well-known aquatic pathogen affecting the aquaculture industry worldwide. Its target is not only to attack high-value fish species such as turbot but also to inflict injuries on birds, reptiles, and mammals. Therefore, strict measures to control the pathogen and proper vaccine candidate development are needed.

Conclusion
The whole-genome sequence of SC002 was characterized as highly virulent and multidrug resistant (MDR). A thorough examination of the genome sequence shows that the pathogen has an array of drug-resistant genes on pEtSC002-1. The conjugative and prophage determinants of the observed plasmids further considered that the contents of the genome are partly structured in various aquatic ecological niches during its life cycle. An adequate amount of TCSs' quorum sensing and T3SS genes that showed the diverse nature of the pathogen were confirmed. By exploring the understanding of the pathogenesis, virulence, and host specificity of the organism and by adopting the approach of "reverse vaccinology", this study lays the foundation for candidate vaccine development.

Ethics statement
The animal study was reviewed and approved by Hainan University Laboratory Animal Care Committee (HULACA180703). Written informed consent was obtained from the owners for the participation of their animals in this study.