Mass Spectra-Based Framework for Automated Structural Elucidation of Metabolome Data to Explore Phytochemical Diversity

A novel framework for automated elucidation of metabolite structures in liquid chromatography–mass spectrometer metabolome data was constructed by integrating databases. High-resolution tandem mass spectra data automatically acquired from each metabolite signal were used for database searches. Three distinct databases, KNApSAcK, ReSpect, and the PRIMe standard compound database, were employed for the structural elucidation. The outputs were retrieved using the CAS metabolite identifier for identification and putative annotation. A simple metabolite ontology system was also introduced to attain putative characterization of the metabolite signals. The automated method was applied for the metabolome data sets obtained from the rosette leaves of 20 Arabidopsis accessions. Phenotypic variations in novel Arabidopsis metabolites among these accessions could be investigated using this method.

The structure-related information available from LC-MS analysis includes the retention time, exact mass number, and tandem mass spectrum (MS/MS spectrum). The structure associated with each metabolite signal has been estimated by searching databases containing reference data using the information obtained from LC-MS analysis Kind and Fiehn, 2010;Neumann and Bocker, 2010). The amount of information obtained from database searches varies among metabolite peaks; therefore, four levels of structural elucidation have been standardized by the metabolome standard initiative (MSI) as follows (Fiehn et al., 2007;: (1) Identified: a minimum of two independent data points relative to an authentic compound analyzed under identical experimental conditions. (2) Putatively annotated: without chemical reference standards, based on physicochemical properties and/or spectral similarity with public/commercial spectral libraries. (3) Putatively characterized: based on characteristic physicochemical properties of a chemical class of compounds, or spectral similarity to known compounds of a chemical class. (4) Unknown. Based on the standardized format, a framework for automated structural elucidation is required to explore the structural diversity of phytochemicals. However, several technical issues must be solved before databaseassisted elucidation of metabolite structures (Kind and Fiehn, 2010;Neumann and Bocker, 2010). One bottleneck is represented by a shortage of standard compounds and their associated MS/MS spectra data. Owing to the poor availability of plant secondary metabolites, only a very low percentage of the observed metabolite signals can be assigned by comparison of the chrom atographic behavior

IntroductIon
The ability to produce various secondary metabolites has evolved in plants for the purpose of self-defense, environmental adaptation, and interaction with other organisms. Because humans utilize phytochemicals as a rich resource for various purposes such as the production of pharmaceuticals, further understanding of the genetic background behind the diversity of secondary metabolites produced by plants will facilitate more intensive application of these compounds . Recent progress in gene sequencing has enabled generation of a large volume of data on genetic polymorphisms that is related to natural variations in phytochemicals (Clark et al., 2007;Ossowski et al., 2008;Zeller et al., 2008). Accordingly, it is expected that novel genes and functions of plant secondary metabolism as well as those involved in evolution could be investigated based on the association between genotypes and metabolic phenotypes (metabolotypes; Plantegenet et al., 2009;Weigel and Mott, 2009). Since the metabolotype data required for such analyses is both qualitative (structure of secondary metabolites) and quantitative (amount of metabolite), metabolic profiling analysis using liquid chromatography-tandem mass spectrometry (LC-MS) has been used to obtain comprehensive profiles of plant secondary metabolites (De Vos et al., 2007). While qualitative data describing hundreds of metabolite signals have routinely been acquired during analysis (Keurentjes et al., 2006), structural elucidation of the observed signals using LC-MS is still difficult (Moco et al., 2006;Bottcher et al., 2007;Iijima et al., 2008;Matsuda et al., 2010a).
Mass spectra-based framework for automated structural elucidation of metabolome data to explore phytochemical diversity with chemical reference standards (Matsuda et al., 2010a). Although great effort has been put into construction of the MS/MS spectral databases (Moco et al., 2006;Wishart et al., 2007;Horai et al., 2010), further enrichment is required for structural elucidation of the wider range of metabolites. Another difficulty is the low reproducibility of the structure-related information. For instance, the fragment patterns in MS/MS spectra depend on the mass spectrometers and their operating conditions. The error derived from the analysis also exists in the high-resolution mass spectral data (Mihaleva et al., 2008;Matsuda et al., 2009b). Owing to these technical problems, elucidation of the structure associated with signals corresponding to metabolomes is time consuming, which has hampered the investigation of phytochemical diversity across plant species or ecotypes.
In this study, a novel framework for the automated elucidation of metabolite structures in LC-MS metabolome data was constructed by integrating three different databases. To overcome the aforementioned problems, the MS/MS spectra databases were enriched using literature reported information. Additionally, the high-resolution MS/MS spectra data were redundantly acquired from each metabolite signal to improve the quality of structure-related information that was used to search the databases. The outputs were retrieved using the CAS metabolite identifier for identification and putative annotation. A simple metabolite ontology system was also introduced to enable putative characterization of the metabolite signals. The automated method developed here was applied for metabolome data sets obtained from the rosette leaves of 20 Arabidopsis accessions, from which phenotypic variations in novel Arabidopsis metabolites among these accessions could be investigated.

MetaboloMe analysIs usIng lc-esI-Q-tof/Ms
The collected sample tissues were weighed and stored at −80°C until analysis. The frozen tissues of independent plants were homogenized in five volumes of 80% aqueous methanol containing 0.1% acetic acid, 0.5 mg/l of lidocaine, and d-camphor sulfonic acid (Tokyo Kasei, Tokyo, Japan) using a mixer mill (MM 300, Retsch) with a zirconia bead for 6 min at 20 Hz. Next, the samples were centrifuged at 15,000 g for 10 min and filtered (Ultrafree-MC filter, 0.2 μm; Millipore, Bedford, MA, USA). The sample extracts were then applied to an HLB μElution plate (Waters, Milford, MA, USA) that had been equilibrated with 80% aqueous methanol containing 0.1% acetic acid. The eluates (3 μl) were subsequently subjected to metabolome analysis by LC coupled with electrospray quadrupole time-of-flight tandem MS using an Acquity BEH ODS column (LC-ESI-Q-Tof/MS, HPLC: Waters Acquity UPLC system; MS: Waters Q-Tof Premier).
The metabolome analysis and data processing were conducted according to a previously described method (Matsuda et al., 2009c(Matsuda et al., , 2010a. Briefly, the metabolome data were obtained in the negative ion mode (m/z 100-2,000; dwell time: 0.45 s; interscan delay: 0.05 s, centroid), from which a data matrix was generated with the aid of MetAlign (De Vos et al., 2007;Lommen, 2009). In order to reduce a redundancy of the data matrix, fragment ions were removed by a following procedure. A metabolite signal was removed from the matrix when there is another intense peak eluted at similar retention times [within the retention time threshold (<0.5 s)] with the highest correlation coefficient above the threshold value (>0.8). The analysis was conducted using five biological replicates of 20 accessions, from which a data matrix composed of 703 signals (peaks) was obtained (Table S1 in Supplementary Material). The number of signals would not reflect an exact number of detected metabolites due to the complex nature of the metabolome data.
To construct MS2T libraries, the extracts of five ecotypes were mixed and utilized for the MS2T data acquisition. The analyses were repeatedly conducted for four mixtures by previously described methods (Matsuda et al., 2009c). Each MS2T entry was assigned a unique accession code, such as ATH10n03690, in which ATH10n is the name of the library and 03690 is the entry number. All data obtained in this study are available at the PRIMe website 1 (Akiyama et al., 2008).

databases and software
The ReSpect (RIKEN MS/MS spectra database for phytochemicals; 2011 January version), KNApSAcK (2010.12.24 version;Shinbo et al., 2006;Takahashi et al., 2008), and PRIMe standard compound database (2009 November version) were used in this study. The genetic polymorphism data from 20 Arabidopsis accessions were downloaded from the TAIR web site (Clark et al., 2007;Poole, 2007). All data processing procedures were conducted using the in-house script written with Perl. Structural elucidation work was performed in-batch search for all metabolite signals.
In the automated structural elucidation procedure, several thresholds were required to conduct the database searches. The thresholds used in this study are described in Figures 2 and 3. To search the MS/MS spectra, the similarity scores were determined by employing dot product method with mass tolerance at 0.5 Da (Stein and Scott, 1994). The two spectra were considered to be the similar when the similarity score was greater than 0.6. For hierarchical clustering analysis, log2-transformed Z-scored signal intensity data were processed using MEV version 4.4 (Saeed et al., 2003(Saeed et al., , 2006.

results acQuIsItIon of MetaboloMe data froM 20 ArAbidopsis accessIons
To investigate variations in the composition of secondary metabolites among Arabidopsis strains (accessions), metabolic profile data were obtained from the rosette leaves of 20 accessions of Arabidopsis by LC-ESI-Q-Tof/MS analysis (Matsuda et al., 2009c(Matsuda et al., , 2010a. The 20 diverse accessions evaluated herein were previously selected by Clark et al. (2007) to investigate the genetic variations within the popula-of those metabolite signals, MS/MS spectra data were obtained from identical extracts by using the automated data acquisition methods described in Section "Materials and Methods." Since the analyses were conducted repeatedly, multiple MS/MS spectra data were recorded for each metabolite signal (Matsuda et al., 2009c). Consequently, MS/MS spectral tag (MS2T) libraries containing 126,889 accessions were constructed (Table 1). Each MS2T entry was assigned a unique ID such as ATH67n06391. Based on the MS2T data, the structure of each metabolite signal was elucidated by searching the databases. tion of Arabidopsis. The analysis was conducted using five biological replicates of 20 accessions, from which a data matrix composed of 703 metabolite signals (peaks) was obtained (Table S1 in Supplementary Material). Here, the dataset was designed as AtMetExpress 20 Ecotypes and each metabolite signal was addressed by a unique ID, such as aen00884. Hierarchical clustering analysis of the dataset revealed that there were large variations in the metabolic profiles across 20 accessions, which should be derived from those genetic polymorphisms (Figure 1). To acquire information for structural elucidation  the case of a metabolite signal assigned as aen00884 (Rt 3.82 and m/z 422), the MS2T library contains 19 MS2T accessions acquired from the identical metabolite signal with various spectral quality (Figure 2A). In other words, the metabolite signal was tagged with 19 accessions of corresponding MS2Ts. Each MS2T accession consists of the exact mass number of the precursor ion and MS/ MS spectra data. Thus, MS/MS spectra data were submitted to the ReSpect database to identify metabolites producing similar MS/ MS spectra. In the case of the MS2T accession, ATH67n06391, the MS/MS spectrum was similar to that of 13 compounds whose CAS numbers are obtained as search results ( Figure 2B). Additionally, the exact mass number of the precursor ion was used to search the KNApSAcK database to find metabolites possessing a highly similar mass number, by which the CAS number of 1 metabolite was obtained. A common CAS number (499-30-9, 2-phenylglucolsinolate) observed in the outputs of both the ReSpect and KNApSAcK searches indicated that it is a candidate structure of the metabolite signal deduced from the MS2T data. To improve the search quality, the procedure was repeated for all 19 MS2T accessions, and the same results were observed for 11 MS2Ts. Since identical metabolites were elucidated using two distinct search methods with high reproducibility (>50%), it is likely that the metabolite signal was derived from 2-phenylethylglucosinolate or its structural isomers. Based on the MSI standard, the metabolite signal could be putatively annotated using the automated structure elucidation procedure ( Figure 2D). Furthermore, an automated search of the PRIMe standard compound database revealed that the authentic compound of 2-phenylethylglucosinolate, CAS 499-30-9, was also detected at a similar retention time and mass number as the queried metabolite signal.
Since three distinct pieces of information, including the MS/MS spectra, exact mass number, and chromatographic behavior, were matched to the identical metabolite, the metabolite signal was identified as 2-phenylethylglucosinolate ( Figure 2F).
Among the 703 metabolite signals in the AtMetExpress 20 ecotype dataset, 25 and 106 peaks could be identified and putatively annotated, respectively, using the procedure described above (Table S1 in Supplementary Material). Additionally, comparison with the manually curated results produced in our previous study (Matsuda et al., 2010a) revealed no significant error among the 32 commonly annotated metabolite signals.

ProcessIng of PutatIvely characterIzed MetabolItes
In addition to the identification and putative annotation using the CAS metabolite identifiers, putative characterization of the metabolite signals was conducted by introducing the metabolite ontology system. The procedure is explained using the metabolite signal described above as an example (peak ID: aen008844). For each MS2T accession tagged to the metabolite signal, MS/MS spectra data and the exact mass number were used for ReSpect ( Figure 3A) and KNApSAcK (Figure 3B) searches. The compound ontology information instead of CAS identifiers was obtained as outputs in these procedures. The outputs of KNAsSAcK and ReSpect searches were compared to identify a common result, which is a compound ontology estimated from the MS2T accession. Repeated searching for 19 MS2T accessions of aen008844 resulted in 11 MS2Ts being identified as glucosinolate based on

PreParatIon of standard coMPound databases and the coMPound ontology systeM
Three distinct databases, KNApSAcK, ReSpect, and the PRIMe standard compound database, were employed for the structural elucidation ( Table 1). ReSpect is a new web data resource that incorporates records from existing literature as well as the MS/MS data from our standard compounds. This database contains 8,444 records corresponding to 3,595 metabolites. ReSpect is the first tool for annotation of phytochemicals that is based on downloadable MS/MS data resources and databases (Sawada et al., in preparation). KNApSAcK is a comprehensive species-metabolite relationship database developed by the Kanaya lab in NAIST (Shinbo et al., 2006;Takahashi et al., 2008). KNApSAcK contains the structural data of 50,048 metabolites and 101,500 metabolite-species pairs. In this study, KNApSAcK was used to elucidate molecular formulas of candidate metabolites from the high-resolution mass spectra data. The PRIMe standard compound database contains a retention time and m/z data of 600 authentic compounds acquired using an identical analytical method (Matsuda et al., 2009c). For the automated metabolite annotations, accessions in these databases were assigned with corresponding CAS identifiers.
Since CAS identifiers basically address a structurally confirmed metabolites (Matsuda et al., 2009a), the metabolite annotation procedure based on the identifier cannot deal with information describing partially characterized metabolites. For example, the metabolite structures were often estimated to be from a compound class such as "kaempferol glycoside" and "amino acid derivative" (Bottcher et al., 2007;Iijima et al., 2008;Matsuda et al., 2010a). In the case of gene annotation, each gene was tentatively annotated by gene ontology terms that were manually assigned or automatically estimated from the sequence similarities. Although detailed compound ontology systems and vocabularies have been developed using several databases such as CheBi and KEGG (Degtyarenko et al., 2008;Kanehisa et al., 2008;Matsuda et al., 2009a), a simple compound ontology system was newly introduced in this study to cover the wide range of phytochemicals. Here entries in the PRIMe databases were classified within three levels, ranging from basic (Class 1) to detailed (Class 3) with considering the basic skeleton and modified parts of metabolites ( Table S2 in Supplementary Material). The ontology terms prepared in this study is not comprehensive, since the classification system was arbitrary prepared by manually curating the entries of ReSpect MS/MS spectra database for an assistance of structural elucidation of metabolome data. For instance, partially characterized metabolites could be classified as follows: kaempferol-3,7-dirhamnoside is a member of Class 1: flavonoid, Class 2: flavonol, and Class 3: kaempferol glycoside; tryptophan is a member of Class 1: amino acid and Class 2: tryptophan; and pinoresinol-dihexoside is a member of Class 1: phenylpropanoid, Class 2: lignan, and Class 3: pinoresinol glycoside.
These metabolite classifications have been assigned to all accessions in the ReSpect and PRIMe standard compound databases. A detailed classification study is currently in progress for KNApSAcK, and 60% of the accessions in this database have been assigned to Class 1 or 2.

IdentIfIcatIon and PutatIve annotatIon usIng cas IdentIfIers
Based on the MS2T libraries and reference databases, the metabolite signals in the AtMetExpress 20 ecotypes dataset were identified or putatively annotated using the following automated procedure. For pattern. An additional KNApSAcK search suggested that a plausible candidate of the metabolite is fraxin (CAS 524-32-1), although the position of glycosylation is unclear. Using a similar procedure, a metabolites putatively characterized as Class 1: phenylpropanoid (aen012096: Rt 3.848 min, m/z 501) were found to be malonylhexosyl-sinapate ( Figure 4B).
Structural elucidation of the putative phenylpropanoid aen006925 (Rt 5.762, m/z 367) indicated that this metabolite is a hexoside of an unknown aglycon ( Figure 4C). Because the molecular formula of the aglycone was deduced to be C 11 H 9 O 4 (m/z 205.0502 obsd, m/z 205.0500 theor), the aglycone should be a methylated hydroxy-coumarin (according to the presence of four oxygen atoms, aglycone should contain at least two hydroxy-groups on the coumarin moiety), or dimethoxycoumarin. Thus, the compound aen006925 can be a glycoside (or C-glycoside) of these two aglycones, both of which are novel Arabidopsis metabolites. While strict structural elucidation must be conducted following the protocols accepted for natural product chemistry (Nakabayashi et al., 2009;Matsuda et al., 2010b), the results presented here demonstrate that a portion of the phytochemical diversity in Arabidopsis could be elucidated from MS/MS spectra via automated structural elucidation.
the Class 1 ontology. The Class 2 ontology benzylglucosinolate was not accepted, because the result was estimated from only 2 MS2T accessions. Using the procedure, the metabolite signal was successfully characterized as glucosinolate based on the Class 1 ontology ( Figure 3C).
This procedure was conducted for all metabolite signals of the AtMetExpress 20 Ecotype dataset, and 188 among 703 metabolite signals were automatically characterized. In the case of Class 1 ontology, 1 alkaloids, 7 amino acids, 33 flavonoids, 68 glucosinolates, 47 phenylpropanoids, 4 terpenoids, and 28 other characterizations were assigned to the metabolome data ( Table S2 in Supplementary Material).

structural elucIdatIon of MetabolIte sIgnals usIng the database search results
Based on the results obtained using the automated methods, the structures of the novel Arabidopsis metabolites were manually elucidated. Among the putatively characterized metabolite signals, the metabolite signal aen006966 (Rt 4.051 min and m/z 369) was putatively characterized as being in Class 1: phenylpropanoid. The MS/ MS spectral data for ATH67n05643 ( Figure 4A) indicated that the metabolite would be a coumarin hexoside based on the fragment Figure 2 | Procedure for peak identification and putative annotation using CAS identifiers. For the case of a metabolite signal, aen00884, the MS2T library contains 19 MS2T accessions acquired from the identical metabolite signal (A). MS/MS spectra data in MS2T were submitted to the ReSpect database (B). The exact mass number data of the precursor ion was used to search the KNApSAcK database (C). Since a common CAS number (499-30-9, 2-phenylglucolsinolate) in the outputs of both searches were observed for 11 MS2Ts, the metabolite signal was putatively annotated as 2-phenylethylglucosinolate (D). Since the authentic compound of 2-phenylethylglucosinolate,CAS 499-30-9, was also detected at a similar retention time and mass number (e), the metabolite signal was identified as 2-phenylethylglucosinolate (F).  ations among the 20 accessions. These results suggest that the levels of flavonoids and glucosinolates in rosette leaves are controlled by genetic polymorphisms, which would contribute to the adaptation of each accession to local environments (Li et al., 2008;Bednarek and Osbourn, 2009;Janowitz et al., 2009;Sawada et al., 2009;Manzaneda et al., 2010;De Kraker and Gershenzon, 2011). To investigate the association between large variations in metabolic phenotypes and genetic polymorphisms, we considered the levels of 3-hydroxy-n-propylglucosinolate (aen007244) among 20 accessions. Despite significant production of Bor-4, Tsu-1, Bay-0, and Ler-1, the glucosinolate was not detected from other accessions, including Col-0 ( Figure 6A). Single nucleotide polymorphisms (SNPs) that commonly occurred in Bor-4, Tsu-1, Bay-0, and Ler-1, as well as did not occurred in other accessions, were searched against the re-sequence data produced by Clark et al. (2007). The results revealed that 80 SNPs of 96 corresponding SNPs formed a linkage disequilibrium (LD) block along the long arm of chromosome 4 ( Figure 6B). Among the 28 ORFs in the 11-kb region (from At4g02870 to At4g03090), there is an enzyme gene responsible

PhenotyPIc varIatIons across ArAbidopsis accessIons
The structural elucidation based on the compound ontology information enabled us to deal with putatively characterized metabolite signals such as glucosinolates and flavonoids without strict metabolite identification or annotation. Here, the natural variations in accumulation levels among 20 Arabidopsis accessions were compared for metabolites belonging to lignan, amino acids, flavonoids, and glucosinolate (Figure 5). The metabolites assigned by lignan ( Figure 5A) and amino acid ( Figure 5B) were constitutively accumulated with small natural variations, suggesting that the production of those metabolites is essential for Arabidopsis (Matsuda et al., 2010a). Indeed, more than 10 genes encoding the dirigent protein for lignan biosynthesis are present in the Arabidopsis genome (Burlat et al., 2001;Davin and Lewis, 2005;Nakatsubo et al., 2008). This redundancy would contribute to the constitutive production of lignans, although the details regarding their physiological role in the growth of Arabidopsis remain unknown. In contrast, the metabolites identified as flavonoids ( Figure  5C) and glucosinolates ( Figure 5D) tended to show larger natural vari-

dIscussIon
A framework for the automated structural elucidation of LC-MS metabolome data was developed to investigate the structural diversity of phytochemicals. Although the framework requires a large amount of structure-related information (MS2T library) and for the last step of hydroxyalkylglucosinolate biosynthesis (AOP3, At4g03050). Although the biological meaning of the LD are unclear, the association between the natural variations in the 3-hydroxy-npropylglucosinolate levels and genetic polymorphisms in the AOP3 gene has been reported (Kliebenstein et al., 2001;Wentzell et al., 2007). trol since the similarity score is not based on probability. To reduce false positives, the output obtained from the ReSpect search was compared with that derived from KNApSAcK to identify common results (Figures 2 and 3). The cross-check strategy should reduce false-positive hits, but many metabolite signals were assigned with no structural information. In the case of the AtMetExpress 20 Ecotype dataset, 94% of 703 metabolite signals were tagged by at least one MS2T, and the metabolite structures could be somehow estimated for approximately 30% of the signals (Table S1 in Supplementary Material). Further development of a probability-based algorithm to determine the similarity between MS/MS spectra is required to increase the numbers of structurally elucidated metabolite signals while controlling FDR (Mylonas et al., 2009).
In the framework developed herein, putative characterization of the metabolite signal could be attained by introducing a new simple ontology system to cover the wider range of plant metabolites. Additionally, the performance of the ontology system was demonstrated for the AtMetExpress 20 ecotype datasets, which revealed the diversity of secondary metabolites in Arabidopsis based on structural elucidation using the putatively characterized information. The comparison of levels of putatively characterized metabolites revealed the genetic background of metabolotype variations, which would facilitate the analysis of these associations with genetic polymorphism and evolution.

acknowledgMent
We would like to thank Drs. K. Hanada, K. Akiyama, T. Sakurai, R. Niida, and A. Takahashi (RIKEN PSC) for their useful comments regarding this manuscript and their technical support. This work was partly supported by a grant from the Ministry of Agriculture, Forestry and Fisheries of Japan (Genomics for Agricultural Innovation, NVR-0005).

suPPleMentary MaterIal
The Supplementary Material for this article can be found online at http://www.frontiersin.org/Plant_Physiology/10.3389/ fpls.2011.00040/abstract/ intensive searches of large databases (Figures 2 and 3), the processing of the AtMetExpress 20 ecotype dataset (Figure 1) demonstrated that the method is able to reasonably estimate metabolite structures. By referring to the automatically assigned information, the effort required for the manual curation of metabolome data could be drastically reduced (Figure 4), which accelerated the investigation of natural variations in the Arabidopsis secondary metabolites (Figures 5 and 6). These results demonstrated that the framework is effective for the structural elucidation of LC-MS metabolome data, although several technical improvements are required for more comprehensive annotation of the metabolites.
Since the MS/MS spectra database is one of the most important kernels in the framework (Figures 3 and 4), the search results are highly dependent on the database quality. For example, processing of the AtMetExpress 20 ecotype dataset failed to identify metabolites belonging to alkaloids and terpenoids, probably because the current version of ReSpect contains poor entries of those metabolites in contrast to the rich flavonoids and glucosinolates data 2 . This bias is derived from the available standard compounds and published MS/MS spectra data. However, the data dependency indicated that further enrichment of the MS/MS spectra database by the addition of alkaloids, terpenoids, and other phytochemicals could directly improve the results of the structural elucidation. To promote the integration and sharing of spectral data, all ReSpect contents were opened to the public from the PRIME Web site ( Table 1).
Structures elucidated by an automated method should contain incorrect hits derived from errors in mass analyses, indicating that the false discovery rate (FDR) of large-scale search results must be evaluated (Matsuda et al., 2009b;Saito and Matsuda, 2010). In the case of the homology searches of gene sequences, the levels of FDR could be controlled using a probability-based searching algorithm such as BLAST (Altschul and Erickson, 1985). In this study, the cosine product (dot product) method was employed to search MS/ MS spectra because it is robust enough to identify identical spectra (Stein and Scott, 1994). A drawback of this method is a FDR con-