Edited by: George Tsiamis, University of Patras, Greece
Reviewed by: Jayaseelan Murugaiyan, SRM University, India; Miriam Cordovana, Bruker Daltonik GmbH, Germany
This article was submitted to Systems Microbiology, a section of the journal Frontiers in Microbiology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Identification of microorganisms by MALDI-TOF mass spectrometry is a very efficient method with high throughput, speed, and accuracy. However, it is significantly limited by the absence of a universal database of reference mass spectra. This problem can be solved by creating an Internet platform for open databases of protein spectra of microorganisms. Choosing the optimal mathematical apparatus is the pivotal issue for this task. In our previous study we proposed the geometric approach for processing mass spectrometry data, which represented a mass spectrum as a vector in a multidimensional Euclidean space. This algorithm was implemented in a Jacob4 stand-alone package. We demonstrated its efficiency in delimiting two closely related species of the
Due to the advent of the matrix-assisted laser desorption ionization time of flight mass-spectrometry (MALDI TOF MS) and because microorganism identification by means of reference mass spectra libraries has become possible, it is now necessary to choose or develop a new mathematical algorithm for the analysis of mass spectrometry data. Initially, a wide range of mathematical approaches was available for a comparison of mass spectrometry data; these approaches have been tested and optimized on applications related to the identification of organic compounds by gas chromatography with mass spectrometry (Crawford and Morrison,
The effectiveness and accuracy of identification of microorganisms directly depend on the availability of a representative database. Existing commercial products are implemented as stand-alone software packages, require regular paid updates, and are tied to a certain type of equipment. Deposition of reference mass spectra into the database is performed by the software developer, whereas in-house databases of the users are hardly accessible to the rest of the user community. Furthermore, commercial databases employ incompatible formats of data storage and methods for spectra processing. The nonprofit sector is also characterized by stand-alone software packages, such as mass-up (López-Fernández et al.,
To create a publicly available online platform, there should be a single common storage medium accessible for users of mass spectrometers from different vendors as well as a mathematical algorithm that enables processing, classification, and comparison of spectra. To solve this problem, such a Web service should work with the original raw data, meanwhile this calculation burden will be carried by the server part of the project. This arrangement rules out the error that can arise when different methods and parameters are utilized for data processing. This arrangement will also enable investigators to preserve the original information fully contained in the spectra. Most software packages for the work with mass spectra have an option for export of all the data as a txt/csv file, thus helping to address the question of uniformity of the data storage format.
The key task when an online platform is being created is the choice of a data-processing method and of its parameters ensuring the highest accuracy of identification, no inferior to that of commercial products. In a previous work involving the geometric approach, we successfully discriminated two closely related species:
In this work, we compare the features of several algorithms for data processing that are based on the geometric approach and incorporate either peak picking analysis (PPA) or full-spectrum analysis (FSA). The methods of mass spectra analysis that involve peak picking identify a set of mass peaks in a spectrum, and the resultant set of peaks serves as a fingerprint for identifying an organism. In the geometric approach to the analysis, the obtained peak list is converted into a multidimensional vector, which is equivalent to describing a spectrum curve at fixed intervals on the m/z axis. Therefore, it makes sense to study the feasibility of comparing spectra by means of full data excluding the peak-picking procedure. This approach allows a researcher to avoid the peak-picking procedure and does not cause a data loss, while still being sensitive to noise. Another problem is a greater (than that for peak picking) volume of stored information and a greater amount of calculations. Nonetheless, the modern advances in computational technologies to some extent have reduced this problem (Cejnar et al.,
Seventy-four strains from the collections of ICG SB RAS (Collection of biotechnological microorganisms as a source of novel promising objects for biotechnology and bioengineering of Institute of Cytology and Genetics, Siberian Branch of Russian Academy of Sciences), UNIQEM (Collection of Unique and Extremophilic Microorganisms of various physiological groups for biotechnological purposes of the Research Center of Biotechnology RAS), IEGM (Collection of Institute of Ecology and Genetics of Microorganisms, Ural Branch of Russian Academy of Sciences), KMM (Collection of Marine Microorganisms, Pacific Institute of Bioorganic Chemistry, Far Eastern Branch of Russian Academy of Sciences), and VKM (All-Russian Collection of Microorganisms, G.K.Skryabin Institute of Biochemistry and Physiology of Microorganisms, of Russian Academy of Sciences) were used. This set of strains represents the following species: the
The studied strains were isolated from geographically and environmentally diverse locations, from Kamchatka hot springs to saline lakes of Southern Siberia. We isolated microbial strains from high-temperature petroleum reservoirs, freshwater and saline water bodies, thermal springs, tailings dams, rhizospheres of higher plants, etc. The samples were taken from both pristine and polluted locations. Strains were cultivated on diverse media: LB, PCA, malt agar, potato agar, at 28–60°C.
The isolated strains include those found in air, soil and water; thermophilic and mesophilic; acidophilic, neutrophilic, and alkaliphylic; halophylic and freshwater.
Despite being phylogenetically closely related, the studied strains obviously possess a very broad diversity of phenotypic characters.
For this analysis, a full microbiological loop of a given culture was transferred into a 1.5 ml Eppendorf tube and resuspended in 300 μl of deionized water. For inactivation of bacterial cells, 900 μl of ethanol was added with thorough mixing. The cells were collected by centrifugation for 2 min at 15,600 × g, the supernatant was discarded, and the pellet was dried for 5 min in a vacuum concentrator (Eppendorf Concentrator Plus). The cell walls were disrupted by the addition of 50 μl of 70% formic acid. For extraction of proteins, to the resulting mixture, 50 μl of acetonitrile was added followed by thorough mixing. The obtained mixture was centrifuged for 2 min at 15,600 × g, and the supernatant was transferred into a fresh microfuge tube for subsequent mass-spectrometric analysis.
For this analysis, 1 μl of the protein extract was applied to a stainless-steel plate and allowed to dry at room temperature. After that, the sample was overlaid with 1 μl of the matrix [6 mg/ml α-cyano-4-hydroxy-cinnamic acid in an acetonitrile/water/trifluoroacetic acid mixture (50:47.5:2.5, v/v/v)].
The spectra were recorded on an Ultraflex III MALDI time-of-flight TOF/TOF mass spectrometer (Bruker Daltonics). They were acquired in linear positive mode at laser frequency 100 Hz in a mass range of 2,000–20,000 Da. Voltage on the accelerating electrode was 25 kV, voltage IS2 23.45 kV, and voltage on the lens 6 kV, without delays of extraction.
External calibration was conducted by means of precise mass values of known proteins of
To build the reference database, 12 colonies of each strain were chosen randomly. For creation of the test database, three colonies of each strain were collected. For each colony, three spectra were recorded by summing 500 laser impulses (5 × 100 impulses from different positions of the target cell). The mass spectra were inspected visually.
RAW spectra were exported using the mMass software, as a plain-text table (m/z;I). For the reference dataset, all replicates were exported independently. For the test dataset, averaged data were exported from mMass to quicken the R script. The data were processed with an R script in the R3.4.3 software. Publicly available libraries MALDIquant, baseline were used in this work. An outline of the processing of mass spectrometry data is presented in
An outline of the processing procedures and analyses of mass spectrometry data for the algorithms based on full-spectrum analysis (FSA) or peak-picking analysis (PPA). Results of the FMS/FMM test are presented for various combinations of normalization, centroid intensity transform, and distances. The following codes and abbreviations were used to denote algorithms and procedures: Max, normalization to maximal intensity; TIC, normalization to the sum of intensities; Length, normalization to the root of the sum of squares of intensities; NT, no intensity transform; SRT, square root transform; PP1, setting the intensities of peaks to 1.0. The full name of an algorithm is composed of the analysis method (FSA or PPA), code of intensity transform, code of normalization, and the metric (ED or Jc).
Exported mass spectra were converted into a data array (m/zi; Ii) with a fixed step of one Da along the
For PPA, we chose the windowed local maximum procedure for peak searching with the following parameters: intensity threshold 1, 5, and 10%, half-window size = 1, and signal-to-noise-ratio = 3. Reconstruction of peak lists in the form of a vector was carried out using a polynomial function as described elsewhere (Starostin et al.,
No intensity transform (NT), square root transform (SRT), or a reduction of all peaks' intensities to 1.0 (PP1) we performed in the case of PPA.
To get rid of low-intensity noise in the case of FSA algorithms, intensities below a certain threshold were set to zero.
To maximal intensity (here after: Max), to the sum of intensities (TIC), onto the square root of the sum of squares of intensities (Length) was carried out.
Similarity/dissimilarity between centroids were measured by Euclidean distance (ED) and Jaccard (Jaccard,
The effectiveness of PPA-based and FSA-based algorithms is assessed via the lowest number of false matches (FMM), when the best match occurs between the centroids of different species, and via the greatest number of matches at the strain level (FMS), when the test and reference centroids separated by the shortest distance belong to the same strain.
To this end, we utilized four subsets of strains belonging to species
Identification of the test samples was conducted in the first sample in a list if the distance to the reference sample was less than the cutoff. StrainID means the best match conforming to the cutoff on the condition that both the test sample and reference sample belong to one strain. If the first sample passes the cutoff and belongs to a wrong species, then it is regarded as falsely identified: Miss ID. The average percentage of “false” samples relative to the total number of samples that passes the cutoff is defined as the percentage of false positive results: FP. If the distance to the best match is greater than the cutoff, then the sample is regarded as unidentified: NoID. The average percentage of samples belonging to the species of the tested sample that do not pass the cutoff is considered a false negative result: FN. The computation of the above criteria is depicted in
Illustrated computation of the parameters of effectiveness of the methods with the use of a cutoff. In this fictional example, the database contains eight reference entries (eight samples): five
To perform cross-validation, the test samples were identified one after another using the reference database. From the results of the comparison with the reference database, we excluded the case when both the test sample and reference sample belonged to one strain. Identification was conducted via the FMS/FMM test and the cutoff.
The procedures of subtraction of baseline (IRLS, FSS and SNIP) were evaluated. Result quality was inspected visually. Optimal results were obtained using SNIP with the window parameter set to 10. Parameter optimization of the smoothing procedure and determination of the optimal range of m/z were carried out by the FMS/FMM test. The best results were obtained by means of dynamic rolling mean with the starting and final window set to 1 and 3, respectively, and the m/z range of 3–15 kDa.
For the peak-picking procedure, the filter parameter was optimized by minimal relative intensity. The study of this parameter's role helped to assess the contribution of low-intensity peaks to the species specificity of spectra. Threshold values 1, 5, and 10% were assessed. The results of identification of a set of test samples for these parameters were as follows:
1%. ED (FMS/FMM): 20/4, Jc (FMS/FMM): 29/0
5%. ED (FMS/FMM): 21/5, Jc (FMS/FMM): 26/0
10%. ED (FMS/FMM): 19/6, Jc (FMS/FMM): 28/2
When the threshold exceeded 1%, the accuracy of identification diminished, pointing to the importance of low-intensity peaks for the species specificity of a mass spectrum. Judging by the results, the optimal threshold was 1%.
For FSA-based algorithms, we applied the procedure of filtration of low-intensity noise. Thresholds 0, 0.5, 1, and 5% were evaluated. The results of identification of a set of test samples for these parameters were as follows:
0%. ED (FMS/FMM): 22/4, Jc (FMS/FMM): 22/2
0.5%. ED (FMS/FMM): 23/4, Jc (FMS/FMM): 26/0
1%. ED (FMS/FMM): 22/4, Jc (FMS/FMM): 26/0
5%. ED (FMS/FMM): 20/6, Jc (FMS/FMM): 24/1
Thus, the optimal threshold was 0.5–1.0%, and in subsequent analyses, 0.5% was chosen for the filtration of low-intensity noise.
The overall processing algorithm consisted of sequential operations: preprocessing, computation of a centroid, normalization, intensity transformation, and calculation of distances. On the basis of the several types of normalization and intensity transform and the two methods for calculation of the distance (ED and Jc), we obtained 30 algorithms for the processing and comparison of mass spectrometry data. The initial evaluation of these algorithms and the search for the most effective ones were performed by the FMS/FMM test. The results are shown in
In FSA, the best approach is Length normalization, and the worst is Max normalization, regardless of the metric used or the method for intensity transform. With SRT, in the case of Max normalization, the worst results were obtained: 21/5 for ED and 16/5 for Jc. With Tic or Length normalization, SRT was found to be more effective for ED, and no transform was more effective for Jc. In case of ED, the best algorithm was Length-SRT, whereas for Jc, the best algorithm was Length-NT; both yielded the 29/0 result. These algorithms were selected for subsequent analyses.
The results were noticeably different between ED and Jc. In case of ED, the highest effectiveness was achieved with Length normalization, and the lowest with Max normalization. The reduction of peak intensities to 1.0 notably worsened accuracy for Max normalization and Tic normalization but yielded the best result with Length normalization: 29/0. The square root of intensity gave good results with TIC or Length normalization, and the worst result with Max normalization: 15/9. It can be concluded that for this metric, Length normalization is the best choice, and in the case of the intensity transform, all three types are worthwhile. For Jc, the variance of the results was substantially lower than that with ED: from 26/2 to 31/0. The transform via the square root worsened the results. Within one type of transform of intensities, all three types of normalization yielded similar results. For subsequent analyses, methods Length-PP1 and Length-NT were chosen as the best options for the metrics ED and Jc, respectively.
As an external control, the FMS/FMM test was performed on the results of identification in Biotyper 3 software. The obtained value was 29/0, which is worse than that for the most effective methods of PPA and comparable with the results of FSA-based methods.
For each of the four selected algorithms, the cutoff was computed. To evaluate selectivity, the criteria were first tested on the reference database. In this case, the reference database entries were identified by means of the same database. FP and FN are presented in
Results of identification of the test set of samples by means of the calculated cutoffs and the reference collection of samples.
Test 1 | 6.9 | 61.5 | 29 | 27 | 0 | 0 | 5 |
Test 2 | 11.1 | 59.5 | 29 | 26 | 4 | 4 | 6 |
Crossvalid. | 12.6 | 71.3 | 6 | 4 | 13 | ||
Test 1 | 2.7 | 52.5 | 31 | 30 | 0 | 0 | 2 |
Test 2 | 3.6 | 52.1 | 31 | 30 | 0 | 0 | 4 |
Crossvalid. | 2.7 | 62.1 | 4 | 1 | 10 | ||
Test 1 | 0.9 | 76.4 | 29 | 18 | 0 | 0 | 25 |
Test 2 | 2.2 | 72.3 | 29 | 18 | 1 | 1 | 28 |
Crossvalid. | 1.9 | 76.5 | 3 | 1 | 29 | ||
Test 1 | 3.4 | 63 | 29 | 27 | 0 | 0 | 9 |
Test 2 | 4.7 | 62 | 29 | 22 | 1 | 1 | 16 |
Crossvalid. | 4.3 | 67 | 4 | 1 | 16 | ||
Test 1 | 1.3 | 46 | 29 | 27 | 0 | 0 | 3 |
Test 2 | 3 | 41.5 | 29 | 27 | 0 | 0 | 3 |
Crossvalid. | 2.1 | 51.3 | 4 | 0 | 11 |
Next, we carried out the identification of a set of test samples with the use of the cutoff (
Algorithm PPA-Length-NT-Jc yielded the best results: 30 instances of identification at the strain level in the absence of instances of incorrect identification. This performance is better than that of Biotyper 3. FSA-Length-SRT-ED gave the worst result: 18 and 0, respectively. A comparison of Test1 and Test2 revealed that the majority of methods could successfully discriminate closely related strains of
For this purpose, a sample corresponding to a test sample was excluded from the reference database. Because the species
The aims of this work were to compare two approaches to the analysis of mass spectrometry data and to develop a convenient and reliable cutoff for the identification of unknown samples. As an experimental model, we used the strains mostly belonging to the genus
A solution to the problem of finding an algorithm for the processing and analysis of mass spectrometry data is relevant and important for a publicly available online platform designed for working with an open database. The algorithms currently in use are based on a peak-picking procedure. This approach allows for the removal of noise and of matrix effects and is most productive in terms of calculation speed. Nevertheless, it causes an unavoidable loss of data for the following reasons:
Poorly resolved peaks may be registered as one peak or only the more intensive peak will be taken into account. A restrictive threshold may exclude relevant low-intensity peaks. Restrictive statistical criteria can discard peaks of low frequency.
In the work with clinical strains, commercial platforms employ standardized conditions of growth, sample preparation from cultures, equipment, and software. Under such conditions, the drawbacks of peak picking are minimized by the high reproducibility of mass spectra. By contrast, in the work with natural strains, which feature substantial diversity of growth conditions, it becomes impossible to find optimal conditions for growth and sample preparation during high-throughput screening. This state of affairs will decrease reproducibility. “Nonstandard” results will be caused by the isolation of cultures on specific and sometimes toxic substrates or under specific conditions for growth: e.g., extreme pH levels, temperatures, and ionic strength. Another downside of this problem is the reproducibility of results on various brands and models of mass spectrometers.
A possible solution to this problem is algorithms based on FSA, which minimize the data loss. This approach mostly resolves the issue of poorly resolved peaks. During the calculation of a centroid, peaks of low frequency will be taken into account to the extent that they contribute to the centroid. The trouble with low-intensity peaks still depends on the effectiveness of the algorithms for noise filtration and is equally inherent in both FSA and PPA.
Aside from the comparison of two principal approaches, we studied the influence of various procedures for the processing of mass spectrometry data on identification accuracy. As a metric for the comparison, ED and the Jc coefficient were employed, which are coefficients of correlation that can be converted to the1-Jc metric.
Our initial analysis based on the comparison of indicators FMS/FMM revealed an advantage of Jc as a metric for data comparison in case of PPA-based algorithms. In the case of FSA-based algorithms, this metric similarly either had an advantage over (or was comparable to) ED usually. The geometric approach in the proposed version of the method constructs a full spectrum from peak lists or directly transforms raw mass spectrometry data onto a coordinate plane as a multidimensional vector. Consequently, correlation analysis is an effective method for dissecting such data. Conversion into the 1-Jc metric enables investigators to effectively apply “geometric” methods of cluster analysis.
Depending on the metric in question, Jc or ED, different types of normalization had different effects on the results of identification of the strains. In the case of ED, the accuracy of identification decreased in the order Length, TIC, and Max. In the case of 1-Jc, the impact of normalization was insignificant.
The square root transform allows increasing the relative contribution of low-intensity peaks. This transform appreciably improved the results with Length or TIC normalization, suggesting that these peaks are important for accurate identification.
Setting the intensities of peaks to 1.0 before their transformation into a multidimensional vector allows researchers to remove a factor called relative intensity from consideration and thus to assess its usefulness for species specificity of the spectra. With normalization to maximal intensity or the sum of intensities, this transform caused a notable decrease in the accuracy of algorithms. By contrast, with normalization to Length, the accuracy was similar or better than that of known effective methods. This finding points to the predominant role of the m/z values of mass peaks in the species specificity of mass spectra; a simple list of mass peaks is sufficient for satisfactory identification at the species level. Nonetheless, relative intensity of the peaks remains a relevant parameter making the method more reliable. If we take a look at a more detailed investigation of the method, we will notice that the PPA-Length-1-ED method miserably fails at the identification of closely related species: Test 2 and cross-validation with the exclusion of a biological replicate. In our previous work, we learned that the differences between species
The method proposed here for computation of the cutoff derives from the hypothesis that the average distance between centroids of strains belonging to one species will be a characteristic feature. Our comparison of average intraspecies differences for some taxa (the
The nonmetric multidimensional scaling (NMDS) plot shows the positions of centroids of test samples (blue dots) and reference samples (red dots) relative to each other.
In brief, implementation of the geometric approach substantially improves analytical characteristics for the identification of microorganisms. This principle is especially important when it is impossible to strictly standardize the methods of cultivation, sample preparation, and acquisition of primary mass spectrometry information, as strictly as in the case of clinical diagnostics. The evaluated algorithms based on either PPA or FSA showed comparable or greater effectiveness than Biotyper 3.1 software did. FSA-based algorithms are somewhat worse than PPA-based ones. We attribute this finding to the sensitivity for low-intensity noise; however, it seems that FSA-based algorithms may have an advantage when reference and test spectra have different resolution. A possible solution to this problem is to find a more effective algorithm for noise filtration. Both approaches may serve as the basis for the creation of an open online platform for microorganism identification. The proposed technique for the computation of the cutoff and the metrics also manifested high accuracy of identification. A special advantage of the proposed method for cutoff calculation is the simplicity of computation, which will help to rapidly adjust its values during the growth of the database.
In the course of this work, a database of reference mass spectra was compiled consisting of 74 strains. The algorithm for the processing and analysis of mass spectrometry data is implemented as an R script and will serve as the basis for the mathematical part of the online platform helping to work with mass spectrometry data. Raw data and the R script are available at
The original contributions generated for the study are publicly available. This data can be found here:
KS, ED, and SP wrote the main manuscript text. KS performed mass-spectrometry analysis, computations, and results analysis. VE, NE, KS, and ED developed the mathematical algorithm. NE developed the R script. AB performed microbiological works. VS performed 16s rRNA sequencing. SP led the project. All authors contributed to the article and approved the submitted version.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at: