Building comprehensive MS-friendly databases for proteomic analysis of bacterial species of unknown genetic background

In proteogenomic analysis of prokaryotes of unknown genetic background, merging different gene annotation from genomic data of all strains for a given species is a valuable strategy to help the characterization of the sample. It is also relevant for identification of important amino acid polymorphisms and validation of coding regions. We designed a bioinformatic tool which constructs fasta databases including conserved and unique sequences of strains of a given species. By using mass spectrometry data collected from 8 clinical strains from Mycobacterium tuberculosis, we checked protein identification performance of three sequence databases, one including all proteins from 65 sequenced strains; one built using our tool using the same 65 strains; and one using the assembly of model strain H37Rv. Finally, we built databases for 10 species with complete sequenced genomes and monitored features which are relevant for probabilistic-based protein identification by proteomics. We observed that as expected increase in database complexity correlates with pangenomic complexity. However Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all respectively. This indicate that differences in gene annotation is higher than average between strains of those species.


ABSTRACT
In proteogenomic analysis of prokaryotes of unknown genetic background, merging different gene annotation from genomic data of all strains for a given species is a valuable strategy to help the characterization of the sample. It is also relevant for identification of important amino acid polymorphisms and validation of coding regions. We designed a bioinformatic tool which constructs fasta databases including conserved and unique sequences of strains of a given species. By using mass spectrometry data collected from 8 clinical strains from Mycobacterium tuberculosis, we checked protein identification performance of three sequence databases, one including all proteins from 65 sequenced strains; one built using our tool using the same 65 strains; and one using the assembly of model strain H37Rv. Finally, we built databases for 10 species with complete sequenced genomes and monitored features which are relevant for probabilistic-based protein identification by proteomics. We observed that as expected increase in database complexity correlates with pangenomic complexity. However Mycobacterium tuberculosis and Bortedella pertusis generated very complex databases even having low pangenomic complexity or no pangenome at all respectively. This indicate that differences in gene annotation is higher than average between strains of those species.

INTRODUCTION
While genomic annotation approaches undertook considerable advances in the past years (1-3), correct gene prediction is still challenging even for prokaryotes. Particularly the correct assignment of coding open reading frames (ORFs), the presence and classification of pseudogenes, and very often observed is also differences in translational starting site (TSS) choices (4-6). Using amino acid sequence information, usually collected through bottom-up mass spectrometry (MS)-based proteomic approaches, has become an accurate manner to further validate and confirm proposed gene models, the so called Proteogenomics approach (for recent reviews see (7,8)).
In proteogenomics, researchers will often construct customized protein sequence databases which will be inspected against peptide sequence data collected by MS (9). Database customization is often achieved using two different strategies: if the genomic information of the prokaryote under study is known, a 6frame translation is routinely employed (10, 11); if genomic information is unknown, an alternative is to construct a database using ab initio gene predictions from related strains of the same species, taking into consideration variations caused by SNPs,indels,divergent TSS choice,etc (12,13). These approaches are not mutually exclusive, as gene annotation from related strains can be used to further optimize 6-frame translation approaches (14).
However, peptide identification in MS-based proteomics is often performed through probabilistic calculations between the observed peptide fragmentation pattern and theoretical data from a sequence database. Furthermore, database size is one of many parameters which can alter the search space and consequently the probabilistic calculations performed during peptide identification (15). Therefore, building databases using either 6-frame translations or sequence merging approaches will result in larger databases which, at some point, might become detrimental for the proteomic analysis (16). Particularly for approaches merging annotation from related strains, database size becomes a more evident issue as the amount of complete sequenced genomes available increases. Such issue might as well contribute differently depending on the species under study, for example, we observed in previous work that building such databases for Mycobacterium tuberculosis (Mtb) complex is not incremented heavily as new strains are considered (17), since genetic similarity between strains is high and suggest clonal expansion, and there is no reports of examples of horizontal gene transfer (18,19). But this will surely not be the case for species where a complex pangenome is observed, i.e, the genome of all strains of a given species contains only a set of genes from a larger pool of accessory genes for that species (20,21).
In addition, this wil be relevant as proteomic analysis will eventually become more routinely used to investigate bacterial communities (in similar manner as metagenomic analysis). Proper identification of peptides originated from more than one species will require merging of more databases, as already illustrated (22). It is then critical to address the impact of such approach in the size of customized databases used in MS-based proteomics, as well as in the performance for peptide identification.
We developed a computational tool which can merge genomic information from different strains, and create fasta databases following "MS-friendly" rules (13, 23). The tool defines unique annotated proteins as well as protein homologues across strains, adding all unique sequence information in the final database on a non-redundant manner. To test the approach, we performed a proteomic identification of 8 clinical strains of Mtb, using three databases: the routinely used laboratory strain H37Rv, and a concatenated database of all 65 available strains with or without processing by our tool. Our data shows an increase number of ORFs identifications when compared to single database search against the goldstandard strain, and similar performance when compared to a concatenated database without processing. Finally, to quantify database parameters, we then generated databases for ten different bacterial species with complete sequenced genomes, using increasing number of strains per round, and quantified the number of entries and unique peptide sequences for each species. As expected, species with large pangenomes such as Escherichia coli showed the larger database size increase per number of strains used, and database size in general correlated with pangenome size. The exceptions were Mtb, which has the 2 nd smaller pangenome size, yet it had the 4 th in database size increment rate, and Bortedella pertussis, which has no pangenome described and it was the 7 th in database increase, with a increment rate similar than Staphylococcus aereus which has a core genome corresponding to 34% of the total genetic pool of the species.

Species selection
Ten bacterial species were selected for database construction: Acinetobacter baumannii (89)

Script design and availability
The tool was designed in PERL (version 5.24.1) and is present as two modules: rand.pl) provides the sequence alignment and creates the outputs with unique entries and homologues; create_bd.pl process the homologues output, and create the final database and the log file. In house BLAST installation (version 2.7.1 or later) is required. All script outputs are saved as txt files in the data folder.

Data processing
Basically, to assign gene homology, our script initially gathers the protein sequence data from all strains and then performs pair wise alignments using BLASTP (27) through a Bidirectional Best Hit (BHH) method (28). The script performs in a manner that the strain with most number of protein entries is initially selected as the query dataset and consequently aligned to the remaining strains sequence datasets (subjects). Two sequences from different strains will be defined as homologues if: i) they are the best hit possible for all alignments performed; ii) sequence identity is equal or higher than 50%; iii) sequence similarity is equal or higher than 70%; and iv) sequence coverage is equal or higher than 70%. These have to be correlated on both directions (BHH) of the alignment.
When homologues are defined between query and subject strains, our script will proceed by indexing homologue IDs for later characterizations of polymorphisms. It will then define "unique" protein sequences from the query dataset (i.e., a sequence without a defined homologue in any of the subject strains), and add it to the partial version of the final database output (see Figure 1).
In parallel, all entries from the subject strains which were properly aligned to a query sequence will be removed from the respective strains datasets. A new round of alignment will be performed after excluding the query strain and selecting one of the previous subject datasets as a new query strain. This will be performed successively until all strains are used as query. At this point, the script will had then defined all "unique" protein entries from all datasets which did not aligned to any other annotated sequence under the selected parameters.
Regarding all indexed homologues, our script then defines the longest version of the protein as reference (not necessarily from same strain used as query for the BHH step above). The reference sequence is copied integrally into the partial version of the final output. The script performs an in silico trypsin digest of the protein sequences, and compare amino acid composition of all generated tryptic peptides. Peptides shorter than 7 or longer than 35 amino acids are excluded. Whenever different tryptic peptide sequences are observed in the nonreference dataset (due to differences of TSS choices, SAAs or indels which result in frame shifts), they are added into the final output under an artificially created protein entry (13). Amino acid changes result from poorly annotated sequences, where X or U are present in the sequence as an amino acid, are not considered.
Finally, a log file is created, reporting and classifying all differences observed, describing the amino acid changes and also in which strains those were observed.

Database analysis
To investigate how each dataset contributes to the final database size, for each species we constructed databases using 5, 10, 15, 30 and 65 randomly selected strains. This was performed a total of three times to decrease the chances that a randomly selected strain with very unique features might interfere with the final result. Each database had its MS search space size measured regarding number of entries and number of unique tryptic peptide sequences.

Mycobacterial Cell Culturing
Stock cultures of Mtb strains TB179, TB861, TB1430, TB1593, TB1659, TB1841, TB1945 and TB2666 were inoculated into mycobacterial growth indicator tubes and incubated until positive growth was detected using the Bactec 460 TB system (BD Biosciences). Approximately 0.2 mL was inoculated onto Löwenstein-Jensen medium and incubated over 6 weeks with weekly aeration until colony formation. Colonies were transferred into 20 mL of supplemented 7H9 Middlebrook medium (BD Biosciences) containing 0.2% (v/v) glycerol (Merck Laboratories), 0.1% Tween 80 (Merck Laboratories), and 10% dextrose, catalase. Once the culture reached an A600 of 0.9, one mL was inoculated into 80 mL of supplemented 7H9 Middlebrook medium and incubated until an A600 between 0.6 and 0.7 was reached. All steps were performed at 37 °C until Mtb cells in mid-log growth phase were used for whole cell lysate protein extraction.

Preparation of Crude Mycobacterial Extracts
Mycobacterial cells were collected by centrifugation (10 min at 2500 x g) at 4°C and resuspended in 1 mL of cold lysis buffer containing 10 mMTris-HCl, pH 7.4 (Merck Laboratories), 0.1% Tween 80 (Sigma-Aldrich), one tablet per 25 mL Complete protease inhibitor mixture (Roche Applied Science), and one tablet per 10 mL phosphatase inhibitor mixture (Roche Applied Science). Cells were transferred into 2 mL cryogenic tubes with O-rings, and the pellet was collected after centrifugation (5 min at 6000 x g) at 21 °C. An equal volume of 0.1 mm glass beads (Biospec Products Inc., Bartlesville, OK) was added to the pelleted cells. In addition, 300 µ L of cold lysis buffer including 10 µ L of 2 units/ml RNase-free DNase I (New England Biolabs) was added, and the cell walls were lysed mechanically by bead beating for 20 s in aRibolyser (Bio101 Savant, Vista, CA) at a speed of 6.4.
Thereafter the cells were cooled on ice for 1 min. The lysis procedure was repeated three times. The lysate was clarified by centrifugation (10,000 x g for5 min) at 21 °C, and the supernatant containing the whole cell lysate proteins was retained. Thereafter the lysate was filter-sterilized through a 0.22 our tool using the same 65 strains. MaxQuant analysis included an initial search with a precursor mass tolerance of 20 ppm which were used for mass recalibration (33). In the main Andromeda search precursor mass and fragment mass were searched with initial mass tolerance of 6 ppm and 0.5 Da, respectively. The search included variable modifications of oxidation (Met), N-terminal acetylation (protein).
Carbamidomethyl cysteine was added as a fixed modification. Minimal peptide length was set to 7 amino acids and a maximum of two miscleavages was allowed.
The false discovery rate (FDR) was set to 0.01 for peptide and protein identifications. In the case of identified peptides that are shared between two proteins, these are combined and reported as one protein group. Protein and peptide datasets were filtered to eliminate the identifications from the reverse database, and common contaminants.

RESULTS
Our tool was designed to use gene assembly data from Genbank from selected bacterial strain genomes to construct protein sequence databases which considered all possible sequence variations (SAP, TSS choice, for example) but on a non-redundant manner (i.e., without extensive entry usage for very similar sequences). On average, the assemblies used on this work had at least 910 genes (Chlamydia trachomatis, average value to all strains) to 6092 genes (Pseudomonas aeruginosa). The analyzed species were selected not solely due to higher number of strains with complete genome sequences available. They were also selected considering the complexity of their pangenomes. Nine of the selected species have pangenomes which can be very diverse (for example, Escherichia coli, 2,459 core genes to approximately 26,000 pangenomic genes) or just a fraction of the core genome (such as Chlamydia trachomatis and Mtb, which have only one accessory gene to every five core genes) (20, 21, 34). Figure 1 illustrates the tool workflow: briefly, a query strain is selected, and a pair wise comparison is performed with the remaining subject strains using a bidirectional approach. Unique entries from query strain (i.e., sequences which no other protein entries in subject strains had the minimal identity, similarity and 1 1 coverage percentage thresholds) are saved in the final database. Protein sequences across strains that share the required levels of identity and similarity are further processed. The longest version is used as a reference (regardless of which strain is originated from) and saved into the final database. Tryptic peptides between reference and non-reference sequences are compared. When a tryptic peptide from a non-reference sequence cannot be matched to any of the reference sequence (due to a different TSS choice or a SNP for example), a new protein entry is created in the final database. This new entry will contain all non-reference tryptic peptides that do not match the reference sequence.
Once this is finished, the query strain is excluded from the strain list, and all proteins from subject strains which were matched to a protein in the query strain (and therefore already processed to define polymorphisms) are excluded from the subject strains datasets. New pair wise comparisons will be performed using a new query strain. This process will be repeated until no strains are left to be used as query, or until no protein entries are left in yet-to-be selected query strains. A log file is also created, reporting all entries that were matched and classifying all tryptic peptides which were added to the final database, showing the type of event (TSS choice, SNP etc) and in case of SNP, the resulting amino acid substitution.
Since our group possessed unpublished MS data for clinical strains of Mtb, we optimized our tool using this data and databases created from 65 strains with complete genome for this species. The processed Mtb database created by the tool contained 15,996 protein entries, from those 4,506 are entries containing polymorphic peptides. The remaining 11,490 entries contain reference sequences from homologue comparisons and sequences which were unique to each strain. As the median value for a single Mtb strain is 4,056 annotated proteins, our data suggest that each of the 64 strains processed after the query strain incremented the database with approximately 114 unique sequences. Considering that Mtb genome sequences are highly similar between strains and its pangenome is relatively small, most of this increment should be a result of differences in genomic annotation. In comparison, a merely concatenated Mtb database using all sequences from the 65 strains would create a database with approximately 265,000 protein entries. Our MS dataset was then challenged for peptide/protein identification using three databases: one was the assembly of the strain H37Rv, which is the laboratory strain more extensively annotated (35)  "uncharacterized" sequences, such as the region in yellow boxes in Figure 2. We classified those as uncharacterized because, considering our tool does not compare genomic information of the strains yet, we can only assume that such changes in primary structure result from indels which could be short and lead to a frameshift, or are large enough to create a new peptide sequence and stop codon.
All polymorphic peptides and their identification features are shown in Supplementary Table I. Multiple TSS polymorphisms for the same protein were also observed even for the same strains. Figure  where the N-terminal HGVTEHGR is slightly overrepresented than MIIDLHVQR.
Finally, the DB1 results also included 5 polymorphic peptides from the artificial entries which could not be found in the log file created in parallel to the database.
Closer inspection of these peptides revealed that they all contain missing cleavage sites.  Figure S4, and as expected, the identified spectra has poor score and sequence coverage. Supplementary Table II lists all 5 peptides.
When comparing the results from DB1 and DB2 searches, the number of peptides identified was very similar: the DB1 search resulted in the identification of 33,501 tryptic peptides, while DB2 resulted in 34,102 peptides identified.
Therefore, our processed database identified 601 peptides less than DB2, a mere difference of 1.75%. This difference is very close to the 1% variation expected to from the probabilistic nature of the identification approaches used by MaxQuant (selected as a 1% allowed false-discovery rate), which means both performances should be considered very similar. In addition, a close inspection of those peptides indicated that a good number of the peptides identified in the DB2 search were peptides which should had been added on DB1 as non-reference peptides, but they weren't due to the parameters selected in our processing. For example, we did not consider possible miscleavages when comparing reference/non-reference homologues, the in silico trypsin digest generated peptides at 100% enzyme efficiency. Some of the identification was also non-reference peptides with more than 35 amino acids, which we excluded in our processing.
Finally, we further created databases from another 9 species, in order to measure if the rate of increase in the database size could be impacted by unique features of each genome, such as size of a pangenome. For a fair comparison among species, we built databases using a maximum of 65 strains, since Mtb had only 65 strains with complete sequenced genomes available. To assist the visualization of how the database size is incremented as more strains are used, we also created databases using 5, 10, 20 and 30 randomly selected strains of each species. And to avoid outlier behavior due to a very unique strain annotation being randomly selected, we created every database three times in total. We then counted total number of protein entries and tryptic peptides in each database.
Supplementary Table 3

DISCUSSION
Constructing customized databases is a key step in proteogenomics (9), or for the characterization of samples with unknown genetic background such as bacterial clinical strains (17,29). Such databases allow the validation of coding regions and confirmation of sequence polymorphisms normally omitted from reference databases. This is a critical characterization, considering that even genomes with two decades of considerable investigation still have gene annotation issues (37). This is true even for prokaryotes (4, 5), with their simpler genomic structure and higher coding density.
We had previously developed a tool named MSMSpdbb which creates MSfriendly databases for prokaryote species using strain annotations (12), however at the time we used database rules which are nowadays incompatible to peptide search engines. Further motivation to create an up-to-date tool came with the publication of the tool iPtgxDB from Omasitis and colleagues (13) As such set of test samples were analyzed using three different databases, some observations became clear: first, performance (i.e., number of protein and peptide hits) was superior using DB1 or DB2 compared to using a single annotation from a model strain by at least 12% in number of proteins identified. In addition to identification coverage, the observations of multiple polymorphisms also justify the use of multiple strain databases. Surprisingly we even identified multiple N-terminal predictions for the same protein, in some cases observed in the same strain. While normally the identification of a protein N-terminal peptide in proteogenomics is used to validate and confirm the TSS of the protein, our data suggest that excluding the remaining TSS choices from the database might be undesirable, considering they might be later identified when additional strains are analyzed by MS. Overall, DB1 and DB2 performance was very similar, and differences were negligible even though DB2 did perform slightly better. However, while it was not our concern to benchmark computational performance for this work, the searches for DB1 took half of the computational time than DB2 searches.
As more genomic information is used, or as more complex the databases become to incorporate multiple species for metaproteomics, it is not absurd to predict that processing entries as we did for DB1 will be clearly advantageous in order to save computational power.
When databases for ten species were created, surprisingly Mtb and B.
pertussis databases size increased at rates higher than expected based on the low complexity of their core and pangenomic genes. Previous data from Mtb suggested that database size increment was not heavy, when at the time only five Mtb strains and three strains of M. bovis were considered (17). Our data here however shows that Mtb database size increment was one of the most prominent, arguably the most relevant if its small pangenome is considered. This suggests that as additional strains were sequenced, variations in gene annotation performed were more acute then observed for the other species that were investigated here.
Something similar could be assumed about B. pertusis databases, which has no pangenome and database increase was higher than three species with characterized pangenomes. However, C. trachomatis, L. monocytogenes and C.
jejuni have smaller genomes than compared to B. pertussis, all also having higher density of coding regions (38-40). More nucleotide regions in the genome marked as non-coding will offer more opportunity for different strategies to generate conflicting gene annotation data, which could explain B. pertussis database size increase in our analysis. The database size increase for Burkholderia pseudomallei and Burkholderia mallei was not taken into consideration by us on this analysis, because while we decided to merge both species as one based on genotyping data (25), large scale rearrangements in B. mallei (41) might be interfering with our analysis.
Finally, another motivation to this work has been that, up to date, little has been done with relative good protein identification coverage regarding bacterial metaproteomics, with one rare recent exception (22). In general, the databases employed in those researches are built through simply concatenating all annotated proteins for the species under investigation. Even though we had no access to the database used in the reference above, it is simple to estimate that the database used probably had close to one million entries, taking into account an average of 2,000 proteins for all 462 genomes described in the Human Oral Microbiome repository (42). As spectra identification happens at peptide level in bottom-up proteomics and most peptides in such databases will have several identical copies in many homologue proteins, such concatenated databases are largely composed by peptides which aren't unique enough to characterize which species are in the sample. This will result in very large databases with redundant information that consequently demand a large computational capacity during the analysis. The strategy described here to streamline annotations from different strains of the same bacterial species can be also used for metaproteomic databases, aiming to reduce their redundancy and optimize species identification using unique peptide sequences.  I  n  f  e  c  t  i  o  n  ,  g  e  n  e  t  i  c  s  a  n  d  e  v  o  l  u  t  i  o  n  :  j  o  u  r  n  a  l  o  f  m  o  l  e  c  u  l  a  r  e  p  i  d  e  m  i  o  l  o  g  y  a  n  d  e  v  o  l  u  t  i  o  n  a  r  y  g  e  n  e  t  i  c  s  i  n  i  n  f  e  c  t  i  o  u  s  d  i  s  e  a  s  e  s   1  1  ,  5  8  7  -5  9  7  2  0  .  R  a  s  k  o  ,  D  .  A  .  ,  R  o  s  o  v  i  t  z  ,  M  .  J  .  ,  M  y  e  r  s  ,  G  .  S  .  ,  M  o  n  g  o  d  i  n  ,  E  .  F  .  ,  F  r  i  c  k  e  ,  W  .  F  .  ,  G  a  j  e  r  ,  P  .  ,  C  r  a  b  t  r  e  e  ,  J  .  ,  S  e  b  a  i  h  i  a  ,  M  .  ,  T  h  o  m  s  o  n  ,  N  .  R  .  ,  C  h  a  u  d  h  u  r  i  ,  R  .  ,  H  e  n  d  e  r  s  o  n  ,  I  .  R  .  ,  S  p  e  r  a  n  d  i  o  ,  V  .  ,  a  n  d  R  a  v  e  l  ,  J  .  (  2  0  0 (  2  0  1  1  )  P  r  o  t  e  o  g  e  n  o  m  i  c  a  n  a  l  y  s  i  s  o  f  M  y  c  o  b  a  c  t  e  r  i  u  m  t  u  b  e  r  c  u  l  o  s  i  s  b  y  h  i  g  h  r  e  s  o  l  u  t  i  o  n  m  a  s  s  s  p  e  c  t  r  o  m  e  t  r  y  .   M  o  l  e  c  u  l  a  r  &  c  e  l  l  u  l  a  r  p  r  o  t  e  o  m  i  c  s  :  M  C  P   1  0  ,  M  1  1  1  0  1  1  6  2  7  3  7  .  A  b  a  s  c  a  l  ,  F  .  ,  J  u  a  n  ,  D  .  ,  J  u  n  g  r  e  i  s  ,  I  .  ,  M  a  r  t  i  n  e  z  ,  L  .  ,  R  i  g  a  u  ,  M  .  ,  R  o  d  r  i  g  u  e  z  ,  J  .  M  .  ,  V  a  z  q  u  e  z  ,  J  .  ,  a  n  d  T  r  e  s  s  ,  M  .  L  .  (  2  0  1  8  )  L  o  o  s  e  e  n  d  s  :  a  l  m  o  s  t  o  n  e  i  n  f  i  v  e  h  u  m  a  n  g  e  n  e  s  s  t  i  l  l  h  a  v  e  u  n  r  e  s  o  l  v  e  d  c  o  d  i  n  g  s  t  a  t  u  s  .   N  u  c  l  e  i  c  a  c  i  d  s  r  e  s  e  a  r  c     were also observed at high levels for strain S1593.  Supplementary Table III.  T  a  b  l  e  I   -D  a  t  a  b  a  s  e  s  i  z  e  i  n  c  r  e  a  s  e  a  n  d  p  a  n  g  e  n  o  m  e  s  i  z  e  i  n  t  e  n  s  e  l  e  c  t  e  d  s  p  e  c  i  e  s   1  s  t  r  a  i  n  F  i  n  a  l  D  B  R  e  f  g  a  i  n  p  e  r  s  t  r  a  i  n  D  B  i  n  c  r  e  a  s  e  p  a  n  g  e  n  o  m  e  ?  P  a  n  g  e  n  o  m  e  s  i  z  e  (  %  /  1