Edited by: Marta Álvarez, Instituto Español de Oceanografía (IEO), Spain
Reviewed by: John Robert Helms, Morningside College, United States; Arvind Singh, Physical Research Laboratory, India; Helena Osterholz, University of Oldenburg, Germany
*Correspondence: Daniel Petras
Lihini I. Aluwihare
Pieter C. Dorrestein
This article was submitted to Marine Biogeochemistry, a section of the journal Frontiers in Marine Science
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Dissolved organic matter (DOM) is arguably one of the most complex exometabolomes on earth, and is comprised of thousands of compounds, that together contribute more than 600 × 1015 g carbon. This reservoir is primarily the product of interactions between the upper ocean's microbial food web, yet abiotic processes that occur over millennia have also modified many of its molecules. The compounds within this reservoir play important roles in determining the rate and extent of element exchange between inorganic reservoirs and the marine biosphere, while also mediating microbe-microbe interactions. As such, there has been a widespread effort to characterize DOM using high-resolution analytical methods including nuclear magnetic resonance spectroscopy (NMR) and mass spectrometry (MS). To date, molecular information in DOM has been primarily obtained through calculated molecular formulas from exact mass. This approach has the advantage of being non-targeted, accessing the inherent complexity of DOM. Molecular structures are however still elusive and the most commonly used instruments are costly. More recently, tandem mass spectrometry has been employed to more precisely identify DOM components through comparison to library mass spectra. Here we describe a data acquisition and analysis workflow that expands the repertoire of high-resolution analytical approaches available to access the complexity of DOM molecules that are amenable to electrospray ionization (ESI) MS. We couple liquid chromatographic separation with tandem MS (LC-MS/MS) and a data analysis pipeline, that integrates peak extraction from extracted ion chromatograms (XIC), molecular formula calculation and molecular networking. This provides more precise structural characterization. Although only around 1% of detectable DOM compounds can be annotated through publicly available spectral libraries, community-wide participation in populating and annotating DOM datasets could rapidly increase the annotation rate and should be broadly encouraged. Our analysis also identifies shortcomings of the current data analysis workflow that need to be addressed by the community in the future. This work will lay the foundation for an integrative, non-targeted molecular analysis of DOM which, together with next generation sequencing, meta-proteomics and physical data, will pave the way to a more comprehensive understanding of the role of DOM in structuring marine ecosystems.
On the surface of the ocean, unicellular photosynthetic organisms fix as much atmospheric CO2 into organic carbon as their terrestrial, multicellular counterparts, despite the standing biomass of marine primary producers being just 1% of the terrestrial biosphere (Siegenthaler and Sarmiento,
From an analytical perspective DOM poses a special challenge. A single sample can be comprised of tens of thousands of individual molecules that together rarely exceed 1 mg C/L. The true chemical complexity of DOM is unknown because extraction methods capable of isolating this fraction from the much more abundant salts in seawater are not 100% efficient. The most widely used method is solid phase extraction (SPE) using the sorbent PPL, a proprietary functionalized, reversed phase, hydrophobic, styrene-divinylbenzene polymer (Dittmar et al.,
Recently, studies of DOM have focused on targeted molecules relevant for particular biogeochemical processes. These molecules have been identified in culture experiments and/or detected in field samples to highlight some of the important microbial interactions in the surface ocean (Amin et al.,
Still, thousands of unidentified molecules are present in DOM and uncovering their roles in elemental cycling and marine microbial ecology requires an unconstrained and non-targeted approach. Non-targeted studies aim to examine temporal and spatial variability of all detectable metabolites (specific to isolation method and analytical method). With the appropriate data analysis tools, this approach has the power to identify relevant metabolites and “metabolic interdependencies” at a faster pace (Sogin et al.,
In non-targeted LC-MS/MS experiments, tandem mass spectra are often acquired in data dependent acquisition (DDA), where the mass spectrometer decides in real time based on MS1 survey scans which ions to submit for subsequent MS/MS scans. This approach paired with high acquisitions speed (>1 Hz) of state of the art instruments results in thousands of spectra per LC-MS/MS run. For a reliable data analysis and reproducible interpretation of the results, bioinformatic workflows including comprehensive databases and statistical significance estimation are crucial (da Silva et al.,
Our analytical workflow, shown in Figure
Overview of data acquisition and data analysis workflow. After sampling of sea water (1), dissolved organic matter is extracted by solid phase extraction (2) and samples are analyzed by non-targeted high-resolution liquid chromatography tandem mass spectrometry (HR LC-MS/MS) in data dependent acquisition mode (3). Following raw data acquisition (4), extracted ion chromatograms (XICs) are created and relative feature intensities are defined through integration of peak areas (5). For molecular annotation of features, molecular formulas based on exact masses of the XICs from negative and positive ionization experiments are calculated and spectrum library comparison of MS/MS spectra as well as spectral networking is performed (5). To facilitate the analysis of data, multivariable statistics such as Principal Coordinate Analysis (PCoA) can be performed in order to display sample-sample distance, and metabolite data can be interpreted in the context of oceanographic features and microbial community composition (6).
The overall goal of this study was to devise an LC-MS/MS data acquisition and analysis workflow that holds the potential to access the molecular level complexity of the marine DOM reservoir. The biggest barrier to any comprehensive molecular level survey of DOM composition is that a salt-free, concentrated sample is required, which means that DOM must be isolated from seawater (Dittmar et al.,
We assessed extraction efficiency by measuring DOC concentrations in the bulk seawater sample and permeate of each replicate. Detailed results are shown in Figure
The simplest assessment of DOM MS data was performed by comparing overall intensities of total ion currents (TIC) (Figure
Total Ion Current (TIC) of different extraction volumes in positive
The overall number of features (defined as a single peak in an extracted ion chromatogram; XIC) in each sample was identified based on thresholds and deconvolution settings in the software tool used (MZmine2) and the strict requirement that features must appear in the XIC of 4 of 5 replicates. After blank subtraction, it was found that the number of features increased with injection concentration and that more features were identified in positive mode than negative mode. The overall sum of features observed in positive and negative mode across all groups was 13,987 and 7,328. The number of positive (negative) features (Figure
The two samples with the highest extraction efficiency (100 mL/0.2 g and 1,000 mL/1 g) also had the highest proportion of unique MS features in positive mode (45 and 41%, respectively, compared with 22 and 33% for the other two samples, absolute values shown in Table
From a global dataset perspective, the number of features with assigned formulas within a 5 ppm mass error tolerance increased with injected sample concentration. This finding is consistent with the hypothesis that low S/N ratios will be omitted from the molecular formula assignment calculation at lower injection concentrations. For instance, the number of features with assigned formulas within a 5 ppm mass error tolerance is slightly lower for the smaller extraction volumes (shown in Figure
Visualization of molecular diversity by Van Krevelen diagram and feature distribution over sample and chemical space. In
To increase the confidence of molecular formula annotation, we made our data analysis more stringent through the alignment of XICs from both positive and negative ionization modes. This alignment groups different ion species from the same molecule e.g., adducts (M+H+, M+Na+, M-H+, and M+Cl−). If two or more matching ion species were aligned, a consensus molecular formula (the highest ranked common formula) was created. The overall number of consensus features of all volume groups was 3,060 and resulted in 2,600 molecular formulas (shown in the Supporting Information). To display the chemical space of the molecular formulas observed in the different groups, we created Van Krevelen diagrams, displaying the H/C vs. O/C ratios of the molecular formulas, differentiated between positive mode, negative mode and consensus formulas (Figure
In order to loosely categorize “likely” and “unlikely” H/C and O/C ratios in DOM, we also mapped out the distributions of previously characterized metabolite formulas in Van Krevelen space using structures < 500 Da from the Supernatural database (Banerjee et al.,
Statistical comparison to known compounds from large scale libraries can help to increase the confidence of molecular formula assignments. The frequency of individual molecular formulas in the supernatural database provides an empirical basis to demonstrate that many molecules often share the same molecular formula, but not the same chemical structures. The frequency of molecular formulas in this database ranges from many unique to several 100 redundant formulas, with an average of 5.5 structures per molecular formula. Chromatographic retention times (e.g., Figures
For this reason, we examined the efficacy of tandem MS in data driven acquisition (DDA) mode, to provide molecular level information for DOM in an untargeted context. Unlike MS1 feature detection and molecular formula assignment, MS/MS spectra do not depend on chromatographic reproducibility and allow comparison between different instrument platforms if the same fragmentation methods and similar fragmentation energies are used. Tandem mass spectra matching to library entries are considered level two annotations according to the 2007 metabolomics standards initiative (Sumner et al.,
The spectral network of all samples acquired is shown in Figure
Molecular Network. Global spectral network from all sample groups is shown in
The overall library annotation rate of the dataset was at 0.5%, and for the two 1,000 mL (1.48 and 1.86 μmol injected C) groups around 1%. The low annotation efficiency is likely a result of a combination of factors, all of which need to be addressed in future work.
One important reason for the low annotation efficiency is linked to the fact that more than one molecule is isolated prior to fragmentation in the collision cell, which results often in chimeric spectra (i.e., DOM MS/MS spectra are often a combination of fragments from multiple molecules with very similar masses that could not be separated by the unit resolution of the quadrupole, and naturally, yield lower matching scores to library MS/MS spectra). The general field of metabolomics has been grappling with the issue of chimeric spectra. Besides technical improvements in chromatographic or gas phase precursor separation (multi-dimensional chromatography or ion mobility), repetitive or large scale analysis of different samples could help to bypass this problem. Here, a possible solution could be an alignment of numerous chimeric spectra and searching for consensus fragments. However, the bioinformatic tools for the detection (Lawson et al.,
This still poses a universal barrier to untargeted MS/MS analyses. This problem is independent from the data acquisition parameters and can be solved by expanding the chemical space in spectral libraries. Community driven databases such as GNPS (Wang et al.,
Besides looking for perfect matches, spectral libraries were searched for spectral similarity/analog hits (the same way sequence libraries are used to search for homolog hits with BLAST Altschul et al.,
Following this logic, we could observe a subnetwork of amino-sugars (Figure
To further evaluate the effect of concentration on spectral quality, we examined the distribution of MS/MS library matches across the different injection concentrations. Acetamido-oxohexanoic acid, for example, was found in at least 4 of 5 replicates across all samples except for the lowest concentration injections. N-Acetyl-glucosamine and B27A19 were present in the three highest concentration samples as well, but only occurred in some of the replicates. The underlying reason could be the automatic triggering of MS/MS acquisition through data dependent acquisition (DDA). If a precursor ion is not among the most abundant ions in a survey scan then it will not be selected for subsequent MS/MS. By using a dynamic exclusion list for precursors that had already been submitted, we assumed that DDA would consider most of the ions, depending on the complexity of a given time point in the LC run, and the scan speed of the mass spectrometer. Nevertheless, for medium and low abundant compounds, the machine might not have had enough time to acquire MS/MS scans, and for some compounds, small shifts in chromatographic profiles may have changed the order of MS/MS selection, which triggered MS/MS acquisition in some but not all samples. Repeated measurements can alleviate this bias and increase annotation rates.
To test the initial assumption of intensity dependency, we plotted the sample frequency (number of samples contributing to one consensus spectrum) of all network nodes, shown in Figure
Network Node Frequency Plot. Frequency of MS/MS features in all 20 samples. Frequency is hereby defined as the number of samples contributing to one consensus spectrum. Each dot represents one network node. Library annotations of selected nodes are shown above. The order of nodes was sorted first by frequency and in a second level by precursor intensity which is drawn on the second y-axis. The average intensity of frequency bins is shown as a solid black line.
If we inspect the nodes that occur at lower frequencies, we can see that the average precursor intensity decreases accordingly, and that only around 10% of the nodes are shared by 10 or more samples (mainly from the 1,000 mL groups with 1.46 and 1.86 μmol injected C). Of the low abundance compounds, more than 50% of all nodes are only found in one sample. Pantothenic acid, a cofactor involved in fatty acid and secondary metabolite biosynthesis, provides an example of such a compound. Next to the above described problem of precursor selection, chimeric spectra, as noted previously, may be another reason for the high number of unique nodes. Chimeric spectra can result in an artificial diversification of MS/MS patters, which then appear as unique nodes. The repetitive analysis of the same sample can help to bypass this problem, for example through an alignment of chimeric spectra and search of consensus fragments. However, the bioinformatic tools for the detection (Lawson et al.,
Our results show the successful implementation and assessment of non-targeted LC-MS/MS workflow for the analysis of DOM. We tested different sample volumes and sample volume to cartridge bed mass ratios. Both MS1 and MS/MS results indicate that the 1,000 mL sample groups with higher total carbon concentration, showed the most features and most database annotations. Given the general low variability of DOC concentrations in sea water (40–80 μmol L−1; Hansell,
Surface water from the Scripps Pier (La Jolla, USA) was collected with bucket and transported in a 20 L PTFE-carboy on February 2nd 2017 (10:05:39 (PST): temperature 14.58°C, chlorophyll
Before use, the cartridges were rinsed and activated with one cartridge volume of methanol (LC-MS grade, Fisher Chemical, Belgium) and refilled with methanol for conditioning overnight (see Figure
DOC concentrations were analyzed by high-temperature catalytic combustion using a TOC-VCPH/CPN Total Organic Carbon Analyzer equipped with an ASI-V autosampler (Shimadzu, Japan). Standard solutions ranging from 10 to 100 μmol C L−1 were used for calibration and Deep Atlantic Seawater reference material (DSR, D. A. Hansell, University of Miami, Florida, USA) as well as Deep Pacific Seawater (CCE P1604) and Scripps Pier Water reference material were measured to control for instrumental precision (1 μmol L−1) and accuracy (1.5 μmol L−1). Aliquots of the acidified filtrate (pre-extraction) were sampled for quantification of DOC. To calculate DOC concentrations of extracts, 250 μl of the methanol extracts were isolated based on weight and evaporated overnight at 50°C before re-dissolving in 15 ml ultrapure water at pH 2 for DOC analysis. Additionally, the last 40 mL of permeate from each SPE extraction was taken to determine the DOC concentration of the PPL flow through (post-extraction). Extraction efficiency was then calculated by subtracting this permeate DOC concentration from the pre-filtered DOC concentration in the seawater entering the SPE column.
DOM samples were re-dissolved in 100 μL methanol and 1% formic acid of which 10 μL were injected into a ultra-high performance liquid chromatography (UPLC) system coupled to a Q-Exactive orbitrap mass spectrometer (Thermo Fisher Scientific, Bremen, Germany) in three independent runs, first in high resolution positive mode, then in high resolution positive DDA MS/MS mode and finally in UHR negative mode. For the chromatographic separation, a C18 core-shell column (Kinetex, 100 × 2 mm, 1.8 um particle size, 100 A pore size, Phenomenex, Torrance, USA) with a flowrate of 0.5 mL/min (Solvent A: H2O + 0.1% formic acid (FA), Solvent B: Acetonitrile (ACN) + 0.1% FA) was used. After injection, the samples were eluted during a linear gradient from 0 to 0.5 min, 5% B, 0.5 to 8 min 5 to 50% B, 8 to 10 min 50 to 99% B, followed by a 2 min washout phase at 99% B and a 3 min re-equilibration phase at 5% B. For positive mode measurements, the electrospray ionization (ESI) parameters were set to 52 L/min sheath gas flow, 14 L/min auxiliary gas flow, 0 L/min sweep gas flow and 400°C auxiliary gas temperature. The spray voltage was set to 3.5 kV and the inlet capillary to 320°C. 50 V S-lens level was applied. MS scan range was set to 150–1,500 m/z with a resolution at m/z 200 (Rm/z 200) of 140,000 with one micro-scan in positive mode. The maximum ion injection time was set to 100 ms with automated gain control (AGC) of 1.0E6. MS/MS spectra were recorded in data dependent acquisition (DDA) mode. Both MS1 survey scans (150–1,500 m/z) and up to 5 MS/MS scans of the most abundant ions per duty cycle were measured with Rm/z 200 of 17,500 with one micro-scan in positive mode. The maximum ion injection time was set to 100 ms with automated gain control (AGC) targets set to 1.0E6 for survey scans and 3.0E5 for MS/MS with minimum 10% C-trap filling. MS/MS precursor selection windows were set to m/z 1. Normalized collision energy was set to a stepwise increase from 20 to 30 to 40% with
Thermo.raw datasets were converted to.mzXML in centroid mode using MSConvert (Chambers et al.,
As the first step of data analysis MS1 feature extraction was performed with MZmine2 (Pluskal et al.,
Molecular formulas of MS1 features (XIC) < 500 m/z were calculated with an in-house R script applying the Rdisop Bioconductor package (
Following the filtering process presented above, molecular formulas were subsequently filtered by presence of matching ions detected in both positive and negative modes. For that, the most common ESI ion species (Huang et al.,
MS/MS spectra were analyzed with
All LC-MS/MS data can be found on the Mass spectrometry Interactive Virtual Environment (MassIVE) at
Molecular Networking Data and all results of the Spectra Library Comparison can be found at the Global Natural Product Social Molecular Networking (GNPS) website with the links:
The code to perform ion species matching and molecular formula calculation is available at:
DP, IK, LA, and PD designed the study. DP and IK collected the samples. DP and IK extracted and prepared the samples. DP acquired the mass spectrometry data. RD implemented the code for ion species matching and molecular formula calculation. DP and BS performed the data analysis. DP, IK, RD, BS, AH, CN, LK, and LA interpreted the results. DP, IK, LA, and PD wrote the manuscript. All authors read, discussed and approved the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This work was supported by the National Institute of Health with grant numbers P41 GM103484, S10RR029121, GM097509, the Deutsche Forschungsgemeinschaft with a postdoctoral research fellowship to DP with grant number PE 2600/1, Grants from the US National Science foundation OCE–1538567 to LWK, OCE–1538393 to CEN and OCE–1313747 to LA and a UCSD Frontiers of Innovation Scholars program grant to LA and PD to fund IK's participation in this work. We furthermore would like to thank Bryce Inman for assistance with graphic design of Figure
The Supplementary Material for this article can be found online at: