Assessing the Effectiveness of Direct Data Merging Strategy in Long-Term and Large-Scale Pharmacometabonomics

Because of the extended period of clinic data collection and huge size of analyzed samples, the long-term and large-scale pharmacometabonomics profiling is frequently encountered in the discovery of drug/target and the guidance of personalized medicine. So far, integration of the results (ReIn) from multiple experiments in a large-scale metabolomic profiling has become a widely used strategy for enhancing the reliability and robustness of analytical results, and the strategy of direct data merging (DiMe) among experiments is also proposed to increase statistical power, reduce experimental bias, enhance reproducibility and improve overall biological understanding. However, compared with the ReIn, the DiMe has not yet been widely adopted in current metabolomics studies, due to the difficulty in removing unwanted variations and the inexistence of prior knowledges on the performance of the available merging methods. It is therefore urgently needed to clarify whether DiMe can enhance the performance of metabolic profiling or not. Herein, the performance of DiMe on 4 pairs of benchmark datasets was comprehensively assessed by multiple criteria (classification capacity, robustness and false discovery rate). As a result, integration/merging-based strategies (ReIn and DiMe) were found to perform better under all criteria than those strategies based on single experiment. Moreover, DiMe was discovered to outperform ReIn in classification capacity and robustness, while the ReIn showed superior capacity in controlling false discovery rate. In conclusion, these findings provided valuable guidance to the selection of suitable analytical strategy for current metabolomics.


INTRODUCTION
Liquid chromatography-mass spectrometry has been widely applied in pharmaceutical and clinical metabolomics to comprehensively reveal metabolic alteration in given biological system (Paglia and Astarita, 2017;Fu et al., 2018;Tang et al., 2018;Yang et al., 2019), identify biomarkers and therapeutic targets for a variety of complex diseases (Zhu et al., 2009;Yang et al., 2016;Hu et al., 2017;Li et al., 2017cLi et al., , 2019 and illuminate mechanism of action of drugs or drug candidates (Chen et al., 2017;Li X.X. et al., 2018;Xue et al., 2018b;Zhang et al., 2018). Because of the extended period of clinical data collection and huge size of analyzed samples, the long-term and large-scale metabolomic profiling is frequently encountered in current medical study to identify physiological perturbation in various living systems Zheng et al., 2018), analyze time-dependency of metabolic alteration (He et al., 2015;Han et al., 2018) and evaluate therapy and patient stratification in personalized medicine (Li et al., 2017a;Wang et al., 2017a). Data from large-scale metabolomics are generally collected over long period varying from months to years and must be divided into batches, which requires a comprehensive consideration of all data of various batches or studies (Brunius et al., 2016;Li Y.H. et al., 2018). So far, ReIn of multiple experiments in large-scale metabolomics has been applied to enhance the reliability and robustness in cancer-related metabolites profiling (Goveia et al., 2016;Xue et al., 2018a) and marker discovery for prediabetes or diabetes patients (Guasch-Ferre et al., 2016;Wang et al., 2017b).
However, ReIn precludes the reanalysis of original data due to the lack of quantitative metabolomics data and inevitably results in inadequate statistical power (Goveia et al., 2016). Due to the necessity of quantitative data, a database named MetaboLights providing such information has been established (Kale et al., 2016), which makes the reanalysis or integrated analysis of the quantitative data possible and convenient (Haug et al., 2013). Based on our comprehensive investigation on all metabolomics studies in MetaboLights (Figure 1), the sample sizes of the majority (>65%) and almost half (>45%) of these studies are less than 100 and 50, respectively. As reported, a total cohort of over 100 samples is essential for the identification of a maximum of statistically significant variations in any metabolic exploration (Billoir et al., 2015). Since the bias of current metabolic explorations is reported to come frequently from the inadequacy of studied samples (Zhang et al., 2006;Subramanian, 2016), there is an urgent need to maximally enlarge the sample size and in turn enhance the statistical power of a given metabolomics study (Button et al., 2013).
Till now, DiMe strategy has been adopted in OMIC studies which effectively enlarges the size of studied samples (Lazar et al., 2013;Li et al., 2014;Switnicki et al., 2016). In particular, new breast cancer biomarkers are identified by combining RNA-seq gene expression data (Switnicki et al., 2016); novel alternative splicing is found by collectively analyzing multiple RNA-seq datasets (Li et al., 2014); the removal of batch effects from transcriptomics data is investigated by microarray data integration (Lazar et al., 2013). Due to the enlargement of studied samples, DiMe demonstrates potential enhancements in the accuracy, consistency and robustness of OMIC data analysis (Larsson et al., 2006;Goveia et al., 2016), and is proposed to significantly increase statistical power, reduce experimental bias, enhance reproducibility and improve overall biological understanding . However, compared with ReIn, the DiMe of multiple experiments has not yet been widely used in current metabolomics studies, which may be attributed to two major factors Li et al., 2017b). The first is the difficulty in removing the unwanted variations among experiments and inexistence of prior knowledges on the performance of the available merging methods . In other word, it is still elusive whether the DiMe can effectively enhance the performance of metabolic profiling (Soto-Iglesias et al., 2016). The second is the existence of multiple criteria to assess the performance of DiMe and the great difficulty of selecting the optimal one Valikangas et al., 2018). As reported, a multiple criteria evaluation is more effective than the single one in assessing the reliability of integration (Lee and Smith, 2012), and a collective consideration of multiple criteria is therefore recommended to thoroughly evaluate the applied strategy from different perspectives Valikangas et al., 2018). All in all, because of the distinct underlying theory of these criteria, it is very essential to systematically assess the performance of DiMe strategy by collectively considering all criteria.
In the study, comprehensive evaluation of different analytical strategies was conducted by assessing their classification capacity, robustness and false discovery rate. First, based on a systematic review of MetaboLights a number of benchmark studies were identified to accomplish this assessment. Then, the integration/merging-based strategies (ReIn and DiMe) together with the strategies based on single experiment were collectively evaluated by multiple criteria. In conclusion, these findings provided a valuable guidance to the selection of suitable analytical strategy in a given metabolomics study.

Collection of Metabolomics Datasets to Assess the Performance of DiMe Strategy
A systematic search in the MetaboLights database (Haug et al., 2013) was collectively conducted to discover benchmark datasets for the performance assessment of DiMe. First, the MetaboLights was searched by the keyword "mass spectrometry, " which resulted in 339 projects (September 16, 2018). Second, several criteria were used to ensure the availability and processability of raw metabolomics data, which included (a) complete set of raw data files, (b) well-defined parameters (mz value, range of retention time), (c) enough samples (>10) in each experiment, (d) same classes of both cases and controls in different experiments, and (e) clear description on the sample groups. The application of the above criteria to those 339 projects resulted in eight benchmark metabolomics datasets of varied sample sizes. In particular, these eight datasets included (1) a UPLC-QTOF MS dataset based on the serums of 59 patients of HCC and 129 CIR patients collected at Georgetown University Hospital (GUH) and run in positive mode from an experiment conducted in May 2010 , (2) a metabolomics benchmark dataset of the MS positive mode based on the serums of 13 HCC and 50 CIR patients collected at GUH in July 2010 , (3) a UPLC-QTOF MS dataset based on the serums of 59 HCC and 129 CIR patients collected at GUH and run in negative mode from an experiment conducted in May 2010 , (4) the benchmark dataset of MS negative mode based on the serums of 13 HCC and 50 CIR patients collected at GUH in July 2010 , (5) the UPLC-QTOF MS dataset based on the serums of 20 HCC and 25 CIR patients collected from Egypt and run in positive mode , (6) the metabolomics benchmark dataset of the MS positive mode based on the serums of 20 HCC and 24 CIR patients collected in Egypt ), (7) UPLC-QTOF MS dataset based on the serums of 20 HCC and 25 CIR patients collected in Egypt and run in negative mode , and (8) the benchmark dataset of MS negative mode based on the serums of 20 HCC and 24 CIR patients collected from Egypt .

Direct Data Merge (DiMe) Strategy Used in This Study Based on the m/z Values
The workflow of the DiMe strategy applied in this work was systematically illustrated in Figure 2a. In this study, four pairs of metabolomics benchmark datasets were adopted to assess the performance of DiMe strategy, which included the pair of experimental dataset (1) and dataset (2) from MTBLS17 ESI+ (Haug et al., 2013), the pair of experimental dataset (3) and dataset (4) from MTBLS17 ESI- (Haug et al., 2013), the pair of experimental dataset (5) and dataset (6) from MTBLS19 ESI+ (Haug et al., 2013), and the pair of experimental dataset (7), and dataset (8) from MTBLS19 ESI- (Haug et al., 2013). In each experimental dataset, the peak detection, retention time (RT) correction and peak alignment were first applied to the UHPLC/Q-TOF-MS raw data (in CDF format) using the xcmsSet, group and rector functions in XCMS package (Smith et al., 2006) by setting both fwhm and bw equal to ten . Then, two datasets in each pair were merged based on their m/z values with tolerance of 0.05 ppm (Zhang et al., 2014). In particular, the common peaks within above tolerance between two datasets was selected, based on which these datasets were merged into a large one.
Prior to the biomarker identification, the datasets were frequently pretreated in current metabolomics study (De Livera et al., 2012;Zhu et al., 2018;Zuo et al., 2018). Herein, the pretreatment of merged dataset was then conducted, which included the missing value imputation using k-Nearest Neighbor (KNN) method and data normalization using MSTUS. The KNN method imputed values based on K features similar to the features with missing values (Shah et al., 2017). Among the available imputation methods, the KNN algorithm was reported as the most robust one for analyzing MS-based metabolomic data (Di Guida et al., 2016). By assuming that the number of increased and decreased metabolic signals is relatively equivalent, the MSTUS adopted the total signal of metabolites that was shared by all samples (Warrack et al., 2009). MSTUS was referred as one of the best choices for overcoming sample variability in urinary metabolomics and was used to identify diagnostic and prognostic biomarkers (Chen et al., 2013;Mathe et al., 2014). Therefore, the KNN algorithm and the MSTUS method were adopted in this study to impute the missing signal of metabolite and transform/normalize the data matrix. After the above preparation, the training, testing and independent test datasets were further constructed based on the random sampling of the merged dataset. These three datasets were prepared for assessing the identification precision and classification capacity of DiMe strategy (described in the last section of "Materials and Methods"). Furthermore, another 10 datasets were generated by the random sampling of half of the merged dataset for 10 times, which were further used for evaluating the robustness of DiMe strategy (described in the last section of "Materials and Methods").
After all those steps prepared above, the PLSDA was used to identify the differential metabolic peaks between distinct sample groups within each merged dataset. Particularly, the differential peaks were identified by VIP >1 and p-value < 0.05 (Fan et al., 2016), which were subsequently annotated based on human metabolome database (HMDB) (Wishart et al., 2013) by setting m/z tolerance equal to 20 ppm (Peng and Li, 2013). Those resulting metabolites annotated were the metabolic biomarkers finally identified. All in all, the workflow of DiMe strategy applied in this study was systematically illustrated in Figure 2a.

Results Integration (ReIn) Strategy Used in This Study Based on the Identified Biomarkers
The workflow of the ReIn strategy applied in this work was systematically illustrated in Figure 2b. The same four pairs of metabolomics benchmark datasets as used in DiMe strategy were used in this analysis. For the experimental dataset in each pair, peak detection, RT correction and peak alignment were first conducted using the xcmsSet, group and rector functions in XCMS package (Smith et al., 2006) by setting fwhm and bw to ten . Second, the pretreatment of each experimental dataset was conducted using KNN for missing value imputation and MSTUS for data normalization. Third, the training, testing and independent test datasets were constructed by random sampling each pretreated experimental dataset. These three datasets were prepared for assessing the identification precision and classification capacity of the ReIn strategy (described in the last section of "Materials and Methods"). Meanwhile, another 10 datasets were generated by the random sampling of half of the pretreated experimental dataset for 10 times, which were applied for the evaluation of robustness of the ReIn strategy (described in the last section of "Materials and Methods"). Fourth, PLSDA was used to identify the differential metabolic peaks between distinct sample groups within each dataset (VIP>1 and p-value < 0.05). The resulting metabolites annotated based on HMDB by setting the m/z tolerance equal to 20 ppm were the metabolic biomarkers finally identified. Finally, the metabolites annotated from two experimental datasets were collectively considered for assessing identification precision of the ReIn strategy, the classification models constructed based on experimental datasets were integrated for evaluating ReIn's classification capacity, and the robustness of the ReIn strategy was also collectively determined by the average overlap values between two experiments. All in all, the workflow of ReIn strategy applied in this study was systematically illustrated in Figure 2b.

Multiple Criteria Used for the Performance Assessment of the Strategies Applied
Three well-established criteria for the performance assessment of the strategies applied were adopted in this study, which included the identification precision, classification capacity and robustness. As reported, these three criteria were independent from each other , which was required to be collectively considered during the performance assessments . In other words, these three criteria were mutually complemental from different perspectives, and all were important for assessing the performance of the analytical strategy applied in metabolomic studies . Therefore, all these criteria were adopted in this study for performance assessment.

Identification Precision
Recent studies emphasized the importance of the experimentally validated true markers in evaluating the identification precision of analytical strategies Cai et al., 2017;Li et al., 2017b). These well-established true metabolic markers were then used as a golden standard to assess the identification precision based on the EF (Zhang et al., 2011;Liu et al., 2014). The EF was used to measure the enhanced chances of true marker identification by a given analytical strategy over the random selection of true markers from all metabolites (Zhang et al., 2011;Liu et al., 2014). In this study, a comprehensive literature review on the experimentally validated true markers differentiating HCC patients from those with CIR was first conducted. Then, the EF of each analytical strategy was calculated based on Eq. 1: EF = true marker identification rate from all markers identified true marker identification rate by random selection (1) EF denoted the level of enhancement in true marker identification rate (Zhang et al., 2011). EF = 1 meant no better than random selection. The larger EF, the greater the likelihood to find true marker.

Classification Capacity
Based on three datasets after "dataset construction" (Figure 2), the SVM was first applied to construct the classification model based on both training and testing datasets together with the biomarkers identified by Student's t-test (p-value < 0.05). Then, independent test set was used to assess the classification capacity of constructed model, which was evaluated by the ROC analysis together with the measurement of AUC (Kohl et al., 2012). The AUC values were widely considered to be one of the most objective and valid metrics for the performance evaluation of biomarker discovery (Xia et al., 2015). Moreover, the classification capacity was frequently assessed by four popular metrics including the SEN, SPE, accuracy (ACC), MCC. Particularly, SEN was defined by the percentage of true positive samples correctly identified as "positive" (shown in Eq. 2); SPE denoted the proportion of true negative samples that were correctly predicted as "negative" (shown in Eq. 3); ACC indicated the number of true samples (positive plus negative) divided by the number of all studied samples (shown in Eq. 4); MCC reflected the stability of classification capacity, which described the correlation between a predictive value and an actual value (shown in Eq. 5).

Robustness
First, ten sub-datasets were generated by the random sampling of half of the pretreated experimental/merged dataset for ten times. Second, the biomarkers were identified using Student's t-test (p-value < 0.05) for each dataset, and ten lists of biomarkers were discovered. Third, for any 2 marker lists, the fraction of shared marker appearing on both lists were used to measure the similarity of these two lists. Particularly, overlap value was calculated (shown in Eq. 6) based on marker lists a and b. The closer the overlap value equal to 1, the more robust the markers discovered in that study (Wang et al., 2014). For each experimental/merged dataset, 45 (C 1 0 2 ) overlap values denoting all possible combinations between any two sub-datasets were thus calculated and analyzed here.
where a and b indicated two maker lists, and Na and Nb denoted the number of markers in each list.

Robustness Assessment of the Markers Identified by Different Analytical Strategies
Apart from prediction capacity evaluated simultaneously by classification correctness and prediction stability, the robustness of identified metabolic markers was widely accepted to be another important metric with underlying theory distinct from that of prediction capacity Valikangas et al., 2018). So far, overlap value had been recognized as the quantitative measure of the robustness of the identified markers (Wang et al., 2014). The higher overlap values represented the more robust metabolic markers identified from a particular dataset by a given strategy. In this study, a sub-dataset was first generated by randomly selecting 50% of both cases and controls in each benchmark dataset, and ten iterations of this selection procedure resulted in ten sub-datasets. For each sub-dataset, a list of differentially expressed metabolic markers were then identified by Student's t-test (p-value < 0.05), and the value 4 | A variety of metabolite biomarkers differentiating the patients of hepatocellular carcinoma (HCC) from those of cirrhosis (CIR) identified during the past ten years.

No.
True of overlap between any two sub-datasets was calculated using their corresponding lists of markers identified. In total, there were 45 (C 2 10 ) overlap values denoting all possible combinations between any two sub-datasets. Finally, the overlap values of four different analytical strategies were compared. As shown in Table 2, the total numbers of markers identified by ten sub-datasets together with the median values of overlap were provided. It was obvious that the total numbers of identified markers among ten sub-datasets varied significantly (from 11 to 334). Moreover, although there was great difference among the median overlap values (from 0.15 to 0.40), the median overlap of DiMe was found consistently larger than that of the other three strategies.
Compared with the median value of overlap, the statistical difference of 45 overlap values between different analytical strategies was more meaningful to reveal the level of robustness for each strategy. Thus, comprehensive statistical comparison of robustness among different strategies was conducted and illustrated in Figure 4. The overlap values of SiE1, SiE2, ReIn, and DiMe were colored in light green, dark green, blue, and orange, respectively. Apart from the enhanced median values of overlap by DiMe, all overlap values of DiMe were found statistically higher (p-value < 0.05) compared with that of the other strategies. In particular, as illustrated in Figure 4, the statistical differences between DiMe and other strategies (pvalue) were always lower than 0.05 within the range from 4.25E-16 to 1.81E-02. Moreover, the majority of the overlap values of DiMe were larger than 0.3, while that of the other strategies were lower than 0.3. These findings indicated that the DiMe strategy performed better than others in the robustness of the identified markers. Additionally, Table 3 demonstrated the information of markers simultaneously discovered by N (N ≥ 6, ≥ 7, ≥ 8, ≥ 9, = 10) sub-datasets, which included the number and percentage of markers co-identified by these N datasets. It was very clearly to see that the robustness of metabolic markers identified by DiMe was much better than other three strategies in terms of both the number and the percentage of co-identified markers. Particularly, the percentages of markers identified by over five sub-datasets using DiMe were within 3.25%∼5.07%, while that using SiE1 and SiE2 were 0.87%∼2.74% and 0.93%∼2.06%, respectively. Moreover, the percentages of markers identified by all sub-datasets using DiMe were within 0.00%∼0.41%, while that using SiE1 and SiE2 were 0.00%∼0.25% and 0.00%∼0.21%, respectively.

Evaluation on the False Discovery Rates by Experimentally Validated True Markers
Recent studies emphasized the importance of spike-in metabolites and experimentally validated true markers in evaluating the false discovery rates of analytical strategy Cai et al., 2017;Li et al., 2017b). These wellestablished true metabolic markers were frequently used as the golden standard to assess the false discovery rates based on their identification EF (Zhang et al., 2011;Liu et al., 2014). Hence, a comprehensive literature review on the experimentally validated true markers differentiating HCC patients from those with CIR was first conducted in this study. As a result, thirteen discriminative markers between HCC and CIR patients were identified ( Table 4). As shown, some metabolic markers (like glycochenodeoxycholic acid) were identified from serum samples combining TOF MS/MS with UPLC-SRM-MS/MS based on the internal standard isotope dilution (Tan et al., 2012;Xiao et al., 2012;Kimhofer et al., 2015), and some other markers (like 16:0 lysophosphatidic acid and phenylalanine) were detected by the targeted analysis based on UPLC-ESI-TQMS (Patterson et al., 2011) and LC-MRM-MS/MS (Baniasadi et al., 2013). Carnitine and creatinine were first discovered by analyzing urinary 1H MRS data (Shariff et al., 2010), but carnitine was also identified as true marker in serum samples . Since the four benchmark datasets analyzed in this study were serum-based data, these experimentally validated true metabolic markers (twelve biomarkers in total, except creatinine, Table 4) were therefore used here to evaluate the false discovery rates of each analytical strategy. Table 5 provided the number of the true makers covered by both detected and identified metabolites. For each experimental dataset (MTBLS17-NEG, MTBLS17-POS, MTBLS19-NEG, and MTBLS19-POS), there were variations in their number of true markers covered by the detected metabolites. In particular, the detected metabolites in MTBLS17-POS contained the highest number of true markers (11 for all strategies) and that in MTBLS17-NEG covered the most variated numbers of true markers among four strategies (from 5 to 9). Furthermore, the number of true markers identified by strategies SiE1 and SiE2 was found to be basically no less than that of ReIn and DiMe, which represented the relatively equal abilities in true marker identification among different strategies. However, as shown in Table 5, the EF of both SiE1 and SiE2 was consistently lower than that of ReIn and DiMe, which indicated that, compared with ReIn and DiMe, the total numbers of true markers discovered by SiE1 and SiE2 were more at the cost of discovering numerous false metabolites. Moreover, among those integration/merging-based strategies (ReIn and DiMe), the EF values of ReIn in three experimental datasets (MTBLS17-POS, MTBLS19-NEG, and MTBLS19-POS) were found to be obviously higher than those of DiMe strategy, which reflected the superior ability of ReIn strategy in controlling false discovery rate. However, in one extreme case (MTBLS17-NEG), the EF of ReIn was lower than that of DiMe. Careful investigation of Table 5 revealed that only one true marker was identified by ReIn, which led to a huge decline in its EF values. Therefore, although ReIn demonstrated superior ability to control false discovery rate, its application could be limited by its relatively small number of true markers identified.

CONCLUSION
Based on the systematic review of MetaboLights, a comprehensive evaluation of different analytical strategies was conducted by assessing the classification capacity, robustness and false discovery rate. As a result, the integration/merging-based strategies (ReIn & DiMe) performed better than strategies based on single experiment (SiE1 & SiE2). Moreover, DiMe strategy was found to outperform ReIn in classification capacity and robustness, while ReIn demonstrated superior capacity in controlling false discovery rate. In summary, these findings may facilitate current metabolomics study in classification capacity, identification precision, and robustness.

AUTHOR CONTRIBUTIONS
FZ conceived the idea and supervised the work. XC, QY, and BL performed the research. XC, QY, BL, JT, XZ, SL, FL, JH, YL, YQ, and WX prepared the program and analyzed the data. FZ and JH wrote the manuscript. All authors have read and approved this manuscript.