uspA gene-based phylogenetic analysis and antigenic epitope prediction for Escherichia coli strains of avian origin

Pathogenic Escherichia coli (E. coli) is responsible for various local and systemic infections in animal and human populations. Conventional methods for the detection and identification of E. coli are time-consuming and less reliable for atypical strains. The uspA gene has been widely used as a target for the detection of E. coli. The present study was aimed at phylogenetic analysis of the uspA gene sequences to determine the evolutionary relationships between the strains and other members of the Enterobacteriaceae family. In addition, the unique differences in the sequences of the current study with Salmonella and Shigella species were tested using Tajima’s molecular clock test. Antigenic epitope prediction was performed to locate the B-cell epitope region of the UspA protein. Two E. coli isolates of avian origin and strains from the National Center for Biotechnology Information (NCBI) database were used for prediction. The Immune Epitope Database (IEDB) server, Bepitope, ABCpred, SVMTrip, and ElliPro server were used to identify B-cell epitopes. The 3D structure was predicted using SWISS-MODEL. Phylogenetic analysis of the isolates from the current study revealed that both OM837340 and OM837341 sequences from the current study had maximum nucleotide homology (nt) of 99.87%–100% with E. coli isolates and minimum nt homology of 84.08% with Salmonella enteritidis and S. Hissar. The isolates in the current study had a homology of 98.87%, while the homology with Shigella species was 99.25%. Seven silent mutations were observed in the coding region of the UspA protein of ECO9LTBW (current study). Modeling of the UspA protein revealed a maximum homology of 67.86% with the Protein Data Bank in Europe (PDBe), also validated by the Ramachandran plot. No significant differences were found in the coding regions of uspA of Salmonella, Shigella, and E. coli with Tajima’s test. For the E. coli isolates, a total of 24 linear B-cell and seven discontinuous epitopes were predicted using in-silico analysis. When the results of the predicted peptides were compared, two peptides, namely ARPYNA and YSDLYTGLIDVNLGDMQKRISEE, were found suitable candidates. In conclusion, the uspA gene appears to be conserved among E. coli isolates and can be used for molecular detection.


Introduction
Escherichia coli (E.coli) is a diverse group of bacteria belonging to the Enterobacteriaceae family and is associated with various diseases in humans and animals.E. coli is responsible for septicemia, peritonitis, meningitis, abscesses, and urinary tract infections (UTI) in humans (1).In animals, E. coli is associated with a variety of infections including metritis, mastitis, septicemia, neonatal diarrhea, and UTI (2).Avian pathogenic E. coli (APEC) causes avian colibacillosis responsible for huge economic loss to the poultry industry (2).The 77 min-long E. coli chromosome contains the 435 bp-long uspA gene, which encodes the UspA protein (positions 54,079 to 54,878, GenBank accession number U00039).
The genomes of bacteria, fungi, archaea, protozoa, and plants all have a conserved set of proteins belonging to the superfamily of universal stress protein (Usp) (3,4), but most of their biochemical roles and biological functions remain unknown (3).Survival of E. coli during cell growth, adhesion, and motility depends on the uspA gene, which is one of the six usp genes found in this organism (4).The uspA gene has been used worldwide to identify and confirm E. coli (2,(5)(6)(7).Stressors like heat shock, nutrient depletion, osmotic pressure, toxic substances, etc. promote the synthesis of Usp protein (8).
Peptide-based diagnostic tools are increasingly used for both human and animal diseases (9).If point-of-care diagnostics is available for treating certain infections, swift intervention and therapy could be possible.Further research and studies are needed for the development of peptide-based diagnostic tools, given the wide variety of pathogenicity and pathogenicity characteristics of APEC.
In the present work, uspA gene sequencing and phylogenetic analysis were used to characterize E. coli isolates grown from chickens with colibacillosis and their environment.To propose a peptide-based diagnostic test or vaccination, the identification of B-cell epitopes (antigenic regions that activate B-cell response) is a crucial first step.The uspA gene was chosen as it is conserved among different E. coli strains and is widely used to identify E. coli using PCR making it an ideal target for antigenic epitope studies in these bacteria (10).In addition, the gene can elicit an immune response and is critical for bacterial survival because of its involvement in stress response pathways (3).
Several online resources are accessible for the prediction of discontinuous and linear (3D or continuous in sequence) antigenic epitopes (11).The main goal of the current work is to phylogenetically analyze the uspA gene, model the UspA protein, and then predict the antigenic epitope of the protein to better understand UspA and provide a basis for further research.This information will be used as a basis for future applied research.

Source of samples
The samples used in the current study were from our previous study (2).Two avian-origin E. coli isolates viz., APEC41LFB isolated from the liver of the colibacillosis-affected (dead) bird and ECO92LTBW isolated from litter material of poultry shed were used.Details of the isolates used in the current study, including consent and ethical approval requirements can be found in the previous study (2).

Sequencing and phylogenetic analysis
The PCR product, i.e., uspA gene (884 bp), was amplified using primers described by Osek (6).The PCR products were purified using a Gel extraction kit (Qiagen, Germany), and uspA-specific primers were used as the sequencing primers (AgriGenome Labs Pvt. Ltd., India).The sequences were then submitted to GenBank and accession numbers were assigned.The ClustalW algorithm was used in the MEGA 11.0 program to align the nucleotide (nt) sequences of uspA with other sequences acquired from the NCBI database.The evolutionary relationship was deduced by building a phylogenetic tree using the maximum likelihood approach, the Kimura-2 parameter (K-2) model, complete deletion for missing data, and a rate difference of five categories in MEGA 11.0 (12,13).Tajima's molecular clock theory was examined to determine if there was any notable divergence between the sequences (14).After nucleotide alignment, the deduced amino acid sequences were captured and further processed.BioEdit was used to generate the dot plot for the nucleotides within the coding region (15).

Molecular modeling and protein structure assessment
The homology-modeling server SWISS-MODEL (16), was used for modeling of UspA protein.A three-dimensional model for UspA protein was created utilizing the list of 48 templates.The geometrical properties of the modeled protein structures were evaluated using qualitative model energy analysis (QMEAN).The models with the greatest QMEAN values were selected.For structural validation, the Ramachandran plot for the models was generated using MolProbity (17).

Peptide comparison and selection
B-cell epitopes predicted by different tools were analyzed based on the scores of four models, namely the antigenicity method of Kolaskar and Tongaonkar, the Emini surface accessibility prediction method, the beta-turn method of Chou-Fasman, and the hydrophilicity method of Parker (27,28).The peptides with scores above the threshold for each method were selected as the most suitable candidates.

Structure-based epitope prediction
The online tool ElliPro (found at: http://tools.immuneepitope.org/tools/ElliPro/iedb input) was used to predict discontinuous epitopes from 3D structures (pdb format) of proteins based on solvent accessibility and flexibility (29).The input files for the APEC41LFB were sent separately to the server in pdb format, with the lowest value set to 0.7 and the maximum distance set to 6 Å.

Ethical statement
For this study strains from our previous study (2) were used.Therefore, approval was not required from the Institutional Ethics Committee.

Results
Two E. coli isolates (one pathogenic and one non-pathogenic) were used in the current study.Isolation and identification of E. coli isolates used in the study were based on cultural, morphological, VITEK2, and uspA gene amplification by PCR.The resulting amplicon of 884 bp, and was further sequenced by Sanger sequencing.The nucleotide sequences of the uspA genes from pathogenic (APEC41LFB) and non-pathogenic (ECO92LTBW) E. coli were deposited at GenBank and assigned accession numbers OM837340 and OM837341, respectively.

Phylogenetic analysis
The sequenced information of OM837341 and OM837340 was analyzed for phylogeny and compared with the existing databases of various E. coli strains and other species to determine possible changes in nucleotide and amino acid composition.The sequence OM837341 (non-pathogenic isolate) showed a maximum of 100 percent nucleotide (nt) homology with E. coli sequences CP098219.1,AE014075.1,and OX030701.1,whereas a minimum nt homology of 85% was observed with Salmonella enteritidis and S. Hissar sequences (CP084532.1 and CP088138.1).Sequence OM837340 (pathogenic isolate) revealed a maximum homology of 99.87% with E. coli (CP115173.1 and CP097884.1)and a minimum of 84.75% homology with S. enteritidis and S. Hissar (CP084532.1 and CP088138.1).The two strains in the current study showed homology of 98.87%-99.25% with Shigella species (Figure 1).Seven silent mutations were observed in the nucleotide sequence OM837341 when compared to the standard E. coli K-12 sequence (CP097884.1)(Figure 2).

Tajima's molecular clock hypothesis test
The test of Tajima's molecular clock hypothesis was performed for sequence A (OM837341) and sequence B (OM837340) with sequence C (CP084532.1 Salmonella enterica subsp.enterica serovar Enteritidis) and sequence D (CP055124.1 Shigella flexneri) used separately to determine evolutionary divergence in the coding region of uspA gene.
With S. enteritidis serving as the outgroup, 297 identical sites, three divergent sites, two distinct differences in sequences A and B, and 390 in sequence C were detected (p = 1.00).When Shigella was used as an outgroup, 785 identical sites, 0 divergent sites, 5 unique differences in sequence A, 3 unique differences in sequence B, and 2 unique differences in sequence D were detected (p = 0.479).The null hypothesis of an equal rate in the different lineages was not rejected.

Protein structure
Homology modeling of UspA protein using the SWISS Model revealed a maximum identity of 67.86% for the current study protein with Protein Data Bank in Europe (PDBe): 1jmv chain A (x-ray diffraction 1.85 Å).The server modeled the structure with 99.92% confidence using the homology and the PDBe: 1jmv chain A as a template (Figure 3A).A Ramachandran plot was generated and no unexplained outliers were found.This indicates that the structure was of high quality (Table 1).The QMEAN score was −0.73.The Ramachandran plot showed that the modeled structure had a 97.05% favorable zone, which served as validation (Figure 3B).

Target UspA protein sequence and epitope prediction
The consensus sequence was generated by aligning different retrieved sequences using BioEdit software.The sequence obtained was 144-mer in length: "MAYKHILIAVDL SPESKVLVEKAVSMARP YNAKVSLIHVDVNYSDLYTGLIDVNLGDMQKRISEETHHALTEL STNAGYPITETLSGSGDLGQVLVDAIKKYDMDLVVCGHHQDFW SKLMSSARQLINTVHVDMLIVPLRDEEE." Similarity between the current study strains and 100 other sequences available in NCBI was examined using the basic local alignment search tool protein (BLASTp).Since the protein sequences of the all the compared isolates were completely identical, additional parameters could be inferred from any single sequence.According to the method of Kolaskar and Tongaonkar, the 144 amino acid sequence contained six antigenic peptides.The antigenic peptides had three octapeptides ranging from 8 to 11 amino acids in length.Table 2 lists the peptide lengths, their sequences, and their position along the entire length of the sequence.Figure 4 illustrates the anticipated peptides of E. coli isolates based on antigenic propensity (y-axis) and sequence position (x-axis).The average antigenic propensity value was 1.052, with minimum and maximum ranging from 0.934 to 1.216, respectively.
The beta-turn regions in UspA protein were predicted using Chou and Fasman Beta turn prediction and two regions from 75 to 81 (TNAGYPI) and 86 to 92 (SGSGDLG) were identified as constant B turn regions out of seven epitopes above the threshold.Figure 5A shows the predicted peptides for the UspA protein based on sequence position (x-axis) and surface probability (y-axis).The Emini surface accessibility prediction result predicted two sequences, 27ARPYNA32, where 29P is the surface residue or a residue that is more than 20 Å from water, and 57DMQKRISEETH67, where 59Q is the surface residue.However, the maximum surface probability value calculated by the tool was 4.604 from amino acid position 100 to 105.According to Figure 5B, peptide 106LVVCGH111 (at amino acid positions 106 to 111) had a minimum surface probability value of 0.087.Figure 5C shows the graphical representation of the results anticipated by the flexibility scale approach developed by Karplus and Schulz.The highest flexibility score was 1.108 (heptapeptide: 85 to 91 amino acids).According to results, the sequence of heptapeptide was 85LSGSGDL91, with 88S serving as the surface residue.The more organised portion was assigned a minimum score of 0.901 portion (Figure 5C).Parker's hydrophilicity prediction predicted linear epitopes based on the hydrophilicity of amino acid residues.The epitope region was located by limiting the window to seven amino acid residues.The results showed that the median value was 1.087, while the maximum and minimum ranged from 4.800 to −4.000.All results that met or exceeded the criteria, as shown in Figure 5D, were likely hydrophilic.

Epitope prediction using other tools
The results obtained using various other tools namely BepiPred linear epitope prediction or BepiPred, BepiPred 2.0 linear epitope prediction, ABCpred, and SVMTrip are summarized in Table 3.Using BepiPred and BepiPred 2.0, residues with a score above the cut-off value (0.5 as the default value) were predicted to be part of the epitope (Figures 6A,B).The trained recurrent neural network result was used by the ABCpred to rank the anticipated B-cell epitopes.The probability of a peptide being an epitope increased with the peptide score.To improve prediction performance, tri-peptide similarity and propensity scores (SVMTriP) were combined with support vector machine (SVM).Two 12-mer peptide sequence namely TGLIDVNLGDMQ and AYKHILIAVDLS, were predicted using SVMTriP.

Comparison and suitable peptide candidate
After analysis of different peptides predicted by various methods two peptides namely ARPYNA (6-mer) and YSDLYTGLIDVN LGDMQKRISEE (23-mer) were found as the most suitable peptide candidates for UspA.The peptides used in the comparison along with the scores are listed in Table 4.

Structure-based epitope prediction
Epitopes were predicted based on protein structure using the ElliPro website tool.Three discontinuous peptides were selected for UspA chain B and four for chain A (score >0.7).The maximum probability of a discontinuous epitope was calculated as 87.7% (protrusion index; PI score: 0.877).Table 5 lists the residues implicated in discontinuous epitopes along with their sequence location, number of residues, and scores, whereas Figure 7 depicts their placements on 3D structures (A through G).

Discussion
APEC is a serious concern to the poultry industry in terms of economic losses, poor welfare of birds, and antimicrobial resistance.Diagnostic and characterization methods for APEC need to be updated with rapid or point-of-care (POC) testing.A total of seven silent mutations were observed in the sequence of non-pathogenic E. coli isolate (ECO92LTBW) of the current study.Although these mutations did not alter amino acid sequences of encoded protein directly, they might influence exonic splicing efficiency resulting in change in mRNA processing of genetic information (30).The sequence OM837340 and OM837341 revealed a maximum nucleotide (nt) homology of 99.87%-100% with E. coli isolates and a minimum nt homology of 83.73%-84.08% with Salmonella species, consistent with previous findings (7).The current study strains exhibited higher closeness (99.25% nt homology) to Shigella species, also observed previously for uspA gene (31).The DNA hybridization studies in past have also observed that at the species level Shigella and E. coli are taxonomically indistinguishable (32).On the other side, the uspA of Shigella species and those of Salmonella (retrieved sequences) shared a homology of 85%.Similarly, Wang et al. (33) and Wei et al. (34) proposed that to precisely reflect the evolutionary relationship, a new nomenclature may be decided for Shigella and E. coli.However, current and abovediscussed studies have not employed alternate immunogenic genes for differentiation of aforesaid species.So, further studies using whole genome sequencing may be directed to precisely understand the relatedness of both the species.Tajima's molecular clock or rate test also signifies the same as there were no divergent sites found on aligning the uspA sequences of Shigella spp.The null hypothesis that the amount of evolutionary change in two lineages (both E. coli isolates and Shigella) is equal was not rejected and thus suggests that molecular evolution of uspA gene might be occurring at an approximately uniform rate over time.The lower homology among current study isolates as compared to retrieved sequences is also surprising and might be due to different nature (pathogenic or non-pathogenic) or different niche as OM837341 (ECO92LTBW) is of faecal origin and is non-pathogenic and OM837340 (APEC41LFB) is pathogenic isolate recovered from dead bird.This aspect may be further investigated by employing isolates of different origin and pathogenicity.The current study also suggests that entire gene   sequence including coding and non-coding region should be used for the sequence homology as protein coding region is almost conserved among different strains of E. coli and similar suggestions were also made by other researchers (31).The uspA gene has been targeted by various researchers as an identification marker for the E. coli using molecular methods (5,6,31).At a confidence level of >90%, the core model was extremely accurate with 2-4 Å RMSD from the native structure, according to the SWISS Model.The current study sequence and template employed shared 67.86% identity, which implies an extraordinarily high accuracy model.The MolProbity score for current study's UspA protein is 1.31.This score is a combined protein quality score which indicates the crystallographic resolution at which   a good-quality model is anticipated (17).The low number indicates a good quality model of protein structure.Kolaskar and Tongaonkar's technique predicted six antigenic peptides for UspA protein.This method uses the physicochemical characteristics of amino acid residues that frequently appear in experimentally determined antigenic epitopes.According to earlier studies, this approach has an experimental accuracy rate of 75% (20).Our findings are in consensus with Mishra et al. (7), where the researchers predicted seven peptides for UspA protein using diarrheagenic E. coli isolates.Based on high score among the predicted peptides, "KHILIAVDLS" might be a potential candidate for the diagnostic assay or vaccine development.This finding is also supported by the previous investigation by Mishra et al. (7).The Chou-Fasman algorithm predicted two constant beta turn regions at 75-81 and 86-92 amino acid positions.The immunodominant region of a protein is located in beta turn region and protein's beta twists are often hydrophilic and surface-accessible, and they significantly contribute to the development of antigenicity (18).However, no comparable studies exist for epitope prediction using uspA in E. coli, beta turn prediction is reported as the most frequently used propensity (27) out of several propensities or models in use.
The surface probability of a hexapeptide greater than 1.0 (threshold), as suggested by Emini et al. (19), indicates that the sequence has a higher likelihood of being detected on the surface.The highest surface probability that could be obtained for the sequence 100KKYDMD105 was 4.604, however, its comparison with other propensities (beta turn, hydrophilicity, etc.) might not prove it a suitable candidate.However, two other peptides namely 27ARPYNA32, and 57DMQKRISEETH67 might prove better candidates.The epitope with more surface probabilities may prove a crucial candidate for an E. coli peptide vaccine.Using Karplus and Schulz's flexibility scale, for peptide 85LSGSGDL91, the maximum flexibility score was 1.108.This approach generates the B factor or atomic temperature factor which represents vibrational motion of atoms within structure.A lower B factor value represents a wellorganized structure, while a greater value indicates a fluid structure (20).The creation of an E. coli diagnostic assay may benefit from the anticipated flexibility of the UspA protein.The Parker's prediction of an amino acid's hydrophilicity predicted seven regions which are based on retention duration of peptide on a reversed phase column in high performance liquid chromatography (HPLC) and values that fall within and above the threshold are probably hydrophilic (22).
BepiPred 2.0 predicted two whereas BepiPred predicted three peptides as suitable epitope regions.BepiPred 2.0 predicts B-cell epitopes from a protein sequence using a random forest algorithm trained on epitopes and non-epitope amino acids that it identifies from crystal structures, whereas BepiPred predicts the location of B-cell epitopes using a hidden Markov model and a propensity scale method (23,24).These models were also previously used by Elhag The results predicted by (A) BepiPred (Th = 0.350) and (B) BepiPred 2.0 (Th = 0.50).The area in yellow depicts the score above the threshold and green depicts the score below the threshold value.28) for prediction of antigenic epitopes for proposing a vaccine against Pseudomonas aeruginosa.Similarly, another tool ABCpred predicted six B cell epitopes.This tool works on the scores that the trained recurrent neural network assigned to each peptide.The greater the peptide's score, the more likely it is to be an epitope.Antigenic epitope prediction by ABCpred is accurate up to 65.93% (25).
Although 24 different B-cell epitopes were predicted by various tools and/or models used in the study, the results were compared based on the scores of four methods as described in the literature (27,28).The analysis revealed two most suitable candidates namely APYRNA and YSDLYTGLIDVNLGDMQKRISEE.These predicted peptides may be utilized for further investigations and development of diagnostic assays.
Eight discontinuous peptides were predicted in the current study using ElliPro.ElliPro distinguishes epitopes based on proteinantibody interactions.It establishes a correlation between a protein's solvent accessibility, antigenicity, and flexibility.The score, commonly known as the protrusion index (PI) value, displays the proportion of protein atoms engaged in antibody binding that protrude beyond the molecular bulk (ellipsoid).For the current study, 87.7% was shown to be the highest chance of a discontinuous epitope (PI score: 0.877).In comparison to other tools, ElliPro has been tested on common data of conformational antibody-protein complexes and is more userfriendly (29).There is no comparable studies using UspA protein to validate the results obtained.However, this aspect may be further investigated to draw conclusive evidence and/or develop an assay or diagnostic method.
Moreover, contact binding and epitope mapping of immunodominant epitopes on the universal stress protein could define functional characteristics with regard to pathogenicity.UspA family appears to be crucial for the bacteria to support cellular defense mechanisms and oxidative stress (3,4).Many immunological assays can target the uspA gene, and it can also be considered as a potential candidate for multiple subunit vaccines and/or as a diagnostic marker.Based on the degree of methodological correctness, the predicted peptides might be a putative candidate for use as an epitope in diagnostics.The inclusion of other peptides of virulence factors discovered in pathogenic E. coli may increase the efficacy of these as a suitable epitope for the development of multiple subunit vaccines.Also, by using lateral flow assay techniques, these peptides could serve as an antigen for the detection of certain antibodies against E. coli.

Conclusion
The two promising applications of B-cell epitope prediction are development of vaccine and diagnostics.Antigenic epitope-based peptides appear to be appealing candidates for diagnostic, preventive, and therapeutic vaccinations.The current study utilizes UspA protein to predict the antigenic epitopes and thus select the most suitable candidates.A total of 24 linear epitope peptides and seven discontinuous peptides were precited using various online tools which on further analysis yielded two most suitable candidates viz., APYRNA and YSDLYTGLI DVNLGDMQKRISEE.These peptides might prove as potential candidates for peptide based diagnostic assay for E. coli.Further applied research might be directed to develop these anticipated peptides into diagnostic assay/method.The phylogenetic analysis and molecular clock hypothesis of the uspA gene's coding region in the current study showed that Shigella and E. coli share huge similarities concerning uspA gene.The 3D structure of UspA protein validated in the current study may be utilized further for molecular docking or protein interactions studies.

FIGURE 1
FIGURE 1Phylogenetic analysis using the uspA gene's nucleotide sequences.The phylogenetic relationship was established using MEGA 11.0 software with the maximum likelihood method and the Kimura-2 parameter (K-2) model.Texas lists the GenBank accession numbers of the compared strains.Each tree node displays the bootstrap values (%).Black circles denote the present research strains (OM837340) and (OM837341).

FIGURE 2
FIGURE 2Dot plot of coding region of current study isolates with other strains from NCBI constructed using BioEdit version 7.0.E. coli strain K-12 MG1655 was used as the standard strain to detect any changes in the nucleotide bases.The changes in the bases of current study sequence OM837341 are presented in rectangular boxes.

FIGURE 3 (
FIGURE 3 (A) 3D structure of UspA protein and (B) Ramachandran plot of UspA protein.

FIGURE 4
FIGURE 4 Graphical representation of the results predicted by Kolaskar and Tongaonkar antigenicity method [Threshold (Th) = 1.052].The area in yellow depicts the score above the threshold and the green depicts the score below the threshold value.

FIGURE 5
FIGURE 5 Results of prediction methods (A) Chou and Fasman beta-turn (Threshold = 0.951); (B) Emini surface accessibility (Threshold = 1.00); (C) Karplus and Schulz (Threshold = 0.984); (D) Parker's hydrophilicity (Threshold = 1.087).The area in yellow depicts the score above the threshold and the green depicts the score below the threshold value.

FIGURE 7
FIGURE 7The figure representing discontinues epitopes (A-G) of UspA protein of E. coli.The bulk of the UspA protein is represented by green sticks for chain A and blue sticks for chain B, while the epitopes are shown as yellow surfaces (spherical balls).

TABLE 1
Results of the Ramachandran analysis using MolProbity.

TABLE 2
Predicted antigenic epitope peptides of E. coli for UspA protein with Kolaskar and Tongaonkar antigenicity method.

TABLE 4
The peptides with their Emini surface accessibility, Kolaskar and Tongaonkar antigenicity, Parker hydrophilicity, and Chou-Fasman beta turn scores.
a Peptides that passed all the four methods.

TABLE 5
Discontinuous antigenic epitopes of the UspA protein of E. coli predicted using ElliPro web-based tool.