Machine Learning of Serum Metabolic Patterns Encodes Asymptomatic SARS-CoV-2 Infection

Asymptomatic COVID-19 has become one of the biggest challenges for controlling the spread of the SARS-CoV-2. Diagnosis of asymptomatic COVID-19 mainly depends on quantitative reverse transcription PCR (qRT-PCR), which is typically time-consuming and requires expensive reagents. The application is limited in countries that lack sufficient resources to handle large-scale assay during the COVID-19 outbreak. Here, we demonstrated a new approach to detect the asymptomatic SARS-CoV-2 infection using serum metabolic patterns combined with ensemble learning. The direct patterns of metabolites and lipids were extracted by matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) within 1 s with simple sample preparation. A new ensemble learning model was developed using stacking strategy with a new voting algorithm. This approach was validated in a large cohort of 274 samples (92 asymptomatic COVID-19 and 182 healthy control), and provided the high accuracy of 93.4%, with only 5% false negative and 7% false positive rates. We also identified a biomarker panel of ten metabolites and lipids, as well as the altered metabolic pathways during asymptomatic SARS-CoV-2 Infection. The proposed rapid and low-cost approach holds promise to apply in the large-scale asymptomatic COVID-19 screening.


Cohort recruitment and data collection
We retrospectively recruited a total of 92 asymptomatic COVID-19 patients who were under quarantine in Wuhan from January to March, 2020. They were diagnosed as asymptomatic COVID-19 according to the Chinese Government Diagnosis and Treatment Guideline (Trial 6th version) (NHCPRC, 2020). Briefly, asymptomatic COVID-19 referred to those people who meet the following two clinical criteria: 1) exhibited no typical clinical symptoms, 2) tested either positive for SARS-CoV-2 nucleic acid test in respiratory specimens or seropositive for IgM antibody test. All enrolled patients were confirmed positive for SARS-CoV-2 nucleic acid except for three individuals whose RT-PCR test turned from positive to negative at the blood collection time, they were also classified as asymptomatic due to the seropositive of IgM or IgG. According to the epidemiologic investigations, none of these patients developed symptoms until they were tested negative for SARS-CoV-2. For each confirmed patient, two healthy people of the same gender and approximate age was matched as controls. These healthy controls come from the physical examination population in Wuhan Prevention and Treatment Center for Occupational Diseases during the sample collection period.
For identification of SARS-CoV-2 infections, throat swabs were collected and tested by real-time polymerase chain reaction (RT-PCR) using virus nucleotide acid extraction kit (Shanghai Zhijiang, China, NO. P20200201) and detection kit (triple fluorescence PCR, Shanghai Zhijiang, China, NO. P20200203) according to manufacturer instructions. Briefly, target genes including RdRp, E and N were simultaneously amplified and tested during RT-PCR. Patients were diagnosed as positive if RdRp gene was positive (Ct < 43), and one of E or N was positive (Ct < 43). Patients were also diagnosed as positive if two sequential tests of RdRp were positive while E and N were negative. IgG and IgM against SARS-CoV-2 were detected in serum samples using chemiluminescence immunoassay kits (Orienter Biotechnology Co., Ltd, Sichuan, China) and Access2 automatic microparticle chemiluminescence immunoassay system (BechmanCoulter, California, USA).
Antibody levels were expressed as the ratio of the chemiluminescence signal over the cutoff value (S/CO). The result was defined as positive if the S/CO value is higher than 1.00.

This study was approved by the Ethics Review Commission of WuHan Prevention and Treatment
Center for Occupational Diseases (reference no. 202002).

Extraction of metabolites
Peripheral blood samples for all participants were collected using serum separation tubes after an overnight fast. Serum was separated by centrifugation at 1,500 g for 10 min and then stored at − 80 o C after standard diagnostic tests. Before used, the frozen serum was slowly thawed at 4 o C overnight.
For virus inactivation and metabolites extraction, pre-chilled ethanol was added to each sample to make a final solution of 75% (v/v) ethanol. The mixture was shaken vigorously for 5 min to ensure inactivation of virus. The supernatant was collected by centrifuged at 1,2000 g for 15 min. A pooled sample was generated by taking equal aliquot of each experimental sample to serve as a technical replicate which was run multiple times throughout the experiment. Extracted water samples served as blanks. All samples were stored at − 80 o C until analysis.

MALDI-MS analysis
Mass spectrometric analyses were performed with a MALDI time-of-flight instrument (Autoflex Max, Bruker) with a pulsed Nd: YAG laser (355 nm), operating in negative-ion reflection mode. The high voltages for ion source 1 and 2 are 19.1 kV and 16.8 kV, respectively. The high voltages for reflector and reflector 2 are 21.1 kV and 9.7 kV, respectively. The matrix solution was prepared with N-(1naphthyl) ethylenediamine dihydrochloride (NEDC, 10 mg/mL) at 30:70 (v/v) ethanol and deionized water. One microliter of each extracted serum with mixed with 1 μL of matrix, and 1 μL of mixture was spotted on a MALDI steel plate and air-drying. Spectra were auto-generated by summing 2000 single spectra (5 × 400 shots) with the laser frequency of 2000 Hz in the range between 0-450 Da or 450-1000 Da by shooting the laser at random positions on the target spot.
Quality control (QC) samples were prepared by pooling the extracts of each sample. Four QC samples were analyzed at the start of batch and then one QC sample was analyzed at every six cohort samples.
In total, 50 of QC samples were analyzed with the same MALDI MS conditions as the cohort samples, and the mass spectra was processed with the same procedures.

Spectral preprocessing
MALDI mass spectra were processed with Flex analysis software, and the peak list with signal-tonoise more than 3 were exported for further analysis. The intensity was normalized using the total ion current (TIC) calibration method. The peaks of each sample were extracted to '.xlsx' files, and then converted to '.CSV' files using Excel. The following processes are implemented by R software. The '.txt' raw files were used to calculate the TIC of each sample. Then, the intensity of each extracted peak in '.CSV' file divided the corresponding TIC to generate the relative intensity. Peaks were binned with the tolerance of 0.05. And only peaks which were present in more than 80% spectra of all samples would be retained. The blank value of each m/z were replaced by the one-tenth of the minimum nonzero value in all samples. Through the above spectral preprocessing, there were 184 features in m/z range 0-450 and 68 features extracted in m/z range 450-1000. Then we integrated all features together and delete the features which had m/z less than 100 to generate the final intensity matrix with 219 features. The intensity matrix would be used for statistical analysis, Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP) and machine learning.

Statistical analysis
Two-tailed Wilcoxon rank sum test with BH corrections was used to measure whether each feature had a significant difference between asymptomatic and healthy. If the adjusted P value < 0.05, the feature would be considered to have significant difference between two kinds of samples. The statistical analysis was performed using R (version 4.0.2).

PCA
PCA is a linear unsupervised machine learning algorithm that can be used for data dimension reduction. PCA was performed on all samples using Python package scikit-learn (n_components = 0.9) (version 0.22.1) and matplotlib (version 3.1.3).

Machine learning
Five machine learning algorithms were separately used to classify the asymptomatic and control samples, including Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RF), Multi-Layer Perceptron (MLP) and XGBoost (XGB). Before imported to model, the intensity matrix was preprocessed using different methods.
Combined zero-mean normalization and PCA (n_components = 0.99) were used for SVM and KNN.
Only zero-mean normalization was used for MLP, RF and XGB used the intensity matrix without preprocessing.
Feature selection was carried out through model-based ranking. We put the intensity matrix with 219 features into RF and XGB respectively. Then calculate the mean value of feature importance obtained from ten training, and select 97 features ranking top 130 in both RF and XGB.
For each model, fivefold (outer) nested repeated (ten times) tenfold (inner) cross-validation (with randomized stratified splitting) was used for hyperparameters optimization and performance evaluation. The hyperparameters of each algorithm were optimized in the inner loop by grid search.
These hyperparameters included C and gamma for SVM, n_neighbors and weights for KNN, n_estimators and max_depth for RF, the number of nodes for each layer for MLP (three layers), max_depth and learning_rate for XGB. Then through repeated tenfold cross-validation in the inner loop, the models obtained with the best results were reported to the outer loop, according to the average under curve (AUC) of the receiver operating characteristic (ROC) curve. The performance of each classifier was comprehensively evaluated by several indicators in outer loop, namely accuracy, sensitivity, specificity, AUC of ROC curve and AUC of precision-recall (PR) curve.
Considering the low sensitivity of each separate classifier, we further tried stacking scheme to combine different machine learning algorithms. At least two of SVM, XGB and MLP were randomly     The compounds C00350 and C00416 represent phosphatidylethanolamine (PE) and phosphatidic acid (PA) that were in the data and used in topology analysis, respectively. The compounds with blue color represent metabolites that were not in the data and used as background in topology analysis. This plot was generated in MetaboAnalyst (https://www.metaboanalyst.ca/).