Improving the Diagnosis of Phenylketonuria by Using a Machine Learning–Based Screening Model of Neonatal MRM Data

Phenylketonuria (PKU) is a common genetic metabolic disorder that affects the infant's nerve development and manifests as abnormal behavior and developmental delay as the child grows. Currently, a triple–quadrupole mass spectrometer (TQ-MS) is a common high-accuracy clinical PKU screening method. However, there is high false-positive rate associated with this modality, and its reduction can provide a diagnostic and economic benefit to both pediatric patients and health providers. Machine learning methods have the advantage of utilizing high-dimensional and complex features, which can be obtained from the patient's metabolic patterns and interrogated for clinically relevant knowledge. In this study, using TQ-MS screening data of more than 600,000 patients collected at the Newborn Screening Center of Shanghai Children's Hospital, we derived a dataset containing 256 PKU-suspected cases. We then developed a machine learning logistic regression analysis model with the aim to minimize false-positive rates in the results of the initial PKU test. The model attained a 95–100% sensitivity, the specificity was improved 53.14%, and positive predictive value increased from 19.14 to 32.16%. Our study shows that machine learning models may be used as a pediatric diagnosis aid tool to reduce the number of suspected cases and to help eliminate patient recall. Our study can serve as a future reference for the selection and evaluation of computational screening methods.


INTRODUCTION
Phenylketonuria (OMIM:261600) (PKU) is a common inborn genetic metabolic disorder, which affects the infant's neural development and manifests as abnormal behavior and developmental delay, which becomes apparent as the child grows. In China, the incidence of PKU has a wide range [between 1/3,420 Wang et al., 2015 and1/26,668 Fan et al., 2009], and the prevalence estimates are still increasing (Gu and Wang, 2004). There are several medical tests that are used for PKU neonatal screening such as the Guthrie test (Guthrie and Susi, 1963), and tests utilizing highperformance liquid chromatography (Moretti et al., 1990) and tandem mass spectrometry (MS/MS) (Rashed et al., 1995). In most countries around the world, PKU diagnosis is performed by evaluating phenylalanine (Phe) and tyrosine (Tyr) levels in neonatal dry blood spots (DBSs) by MS/MS (Blau et al., 2014). Following a positive initial test result, the presence of Phe must be confirmed with immediacy by a repeat screening.
A triple-quadrupole mass spectrometer [multiple reaction monitoring (MRM)] is used to measure 44 metabolites in neonatal blood samples, and in clinical practice, more than 30 types of genetic metabolic diseases (including PKU) are diagnosed by using these biomarkers. However, the initial PKU screening by MRM is characterized by a high false-positive rate. Thus, clinicians have to recall a large number of false-positive patients for rescreening and further testing, which increases the pressure of medical resources and creates an additional economic and time burden on patients. Furthermore, an estimated 13% positive predictive value (PPV) in PKU neonatal screening (Zhang et al., 2019) indicates that the pathophysiology of the disease is not uniquely driven by elevated level of Phe and Tyr. Additionally, the large number of metabolites captured in experimental MRM data creates inherent complexity, which makes the overall meaningful signals non-trivial to assess by a manual process for a sizable patient population.
Machine learning methods have the advantage of utilizing high-dimensional and complex features, which can be obtained from the patient's metabolic patterns and interrogated to reduce the PKU diagnostic test's false-positive rate. New factors can also be discovered to aid PKU diagnosis without a priori knowledge. In recent years, several machine learning techniques have been used to map metabolomics databases (Baumgartner et al., 2004a), and machine learning methods have been used to construct classification models with high diagnostic prediction (Baumgartner et al., 2004b(Baumgartner et al., , 2005Chen et al., 2013) [further review Cuperlovic-Culf, 2018]. Most of those models are utilized to predict normal vs. disease states and use metabolic patterns in newborn screening MRM data to develop the classifier. At the same time, such models may be well-positioned to discover potential disease biomarkers from MRM-based high-dimensional metabolic data (Mendes, 2002). For example, Baumgartner and colleagues developed several classification models (Mendes, 2002) and identified metabolites with abnormally changed concentrations (Baumgartner et al., 2005). Moreover, some of the models have been able to predict cases within the suspected group, which were diagnosed initially as positive (exceeding screening cutoff values in initial screen by a widely used cutoffs scheme) but finally diagnosed as negative cases, and cases that had values over the screening cutoffs and diagnosed subsequently as positive cases (Chen et al., 2013).
However, such computational models are yet to be widely applied in pediatric clinical practice. Initial clinical screening for PKU is well-established and mature process; thus, the more pressing problem to be solved for this rare but treatable metabolic disorder is to develop and fine-tune classification models that pinpoint false-positives and reduce the number of suspected cases to be subject to subsequent screening and verification while ensuring that no false-negatives occur.
To address the above stated unmet need in PKU clinical screening, we employed feature selection strategies and utilized logistic regression analysis (LRA) techniques, together with metabolic data of more than 600,000 newborn screenings, to develop machine learning-based screening models for PKU. Our goal is to minimize false-positive cases, maximize specificity (S p ), and to serve as a guiding reference for the selection and evaluation of future clinically relevant screening methods.

Patient Metabolic Data
Dried blood samples are routinely collected from newborn babies at the Newborn Screening Center of Shanghai Children's Hospital for the purpose of PKU screening. Three to four DBSs are blotted from the infant's heel, and each blood spot is ∼1 cm in diameter. Subsequently, a tandem mass spectrometry system (MRM) is used, including mass spectrometer (MSMS, Waters Quattro micro, Milford, Massachusetts, USA), a highperformance liquid analyzer (Waters 1525 u), an automatic sampling system (Waters 2777 Sample manager), and a nonderivative tandem mass spectrometry kit (NeoBase TM Nonderivatized MSMS Kit; PerkinElmer, Waltham, Massachusetts, USA) to measure 11 amino acids, 32 acylcarnitines, and succinylacetone ( Table 1), and the values are entered in the hospital's computer system. For this historical retrospective study, we obtained records for 633,997 newborns, which were collected from 2010 to 2018. Each record contained measurements for the level of the 44 metabolites and a clinician-entered binary field indicating the patient's overall PKU diagnosis. The data were sourced from the hospital data warehouse, and to protect the privacy and anonymity of the patients, all patient-identifying information was removed and obfuscated prior proceeding with the analytical treatment.
Full dataset comprised 633,997 patient cases, including 326,508 boys and 307,489 girls. The average age at the time of blood collection was 3.6 days (range, 2-30 days), and the average weight was 3.3 kg (range, 1.73-4.89 kg) ( Table 2). We applied a popular PKU screening cutoff level of Phe >120 µmol/L, and patients fulfilling this criterion were included in the reduced dataset of 262 records. Six records were removed Because of data duplication for a final dataset of 256 records of PKU-suspected cases. This screening cutoff is also applied in the course of clinical practice at the Shanghai Children's Hospital; thus, these 256 newborns were also recalled for additional DBS testing at the time that the PKU suspicion was established. Newborns with DBS screening value again above the screening cutoff were then requested to participate in a confirmation test (including urine tests, blood tests, genetic tests). Of the 256 suspected cases, 49 were finally diagnosed with PKU. Thus, our dataset utilized in the model development in this study consisted of 49 positivelabeled examples (PKU-suspicion confirmed) and 207 negativelabeled examples (PKU-suspicion rejected). Figure 1 visualizes the model development process in this work.

Features and Feature Selection Strategy
Starting with the variables representing the level of the metabolites measured by MRM, we constructed additional Hexadecenoyl-carnitine (C16:1) Decenoyl-carnitine (C10:2) Ketones (1) Succinylacetone (SA) variables by mathematical operation (step 1). All samples are randomly divided into a training set (3/4) and a test set (1/4). A combination of analytical methods was applied on the training set (steps 2-4) to reduce the feature set with the goal to exclude highly correlated and irrelevant features in order to improve the model performance, prevent overfitting, and select the optimal feature combination for building an optimized model.
Step 1: Additional feature variables x ′ were constructed such that: where x ′ represents the new feature variable, x i , x j is the variable representing metabolite measured by MRM, and "/" represents the ratio of the two variables. We considered this as a suitable strategy, because the Phe/Tyr ratio is a widely used clinical indicator, and metabolite level ratios have been used in previous studies (Chen et al., 2013). The expanded feature set contained more than 700 candidate features.
Step 2: Learned vector quantization (LVQ) (Kohonen, 1998(Kohonen, , 2001) is a self-organizing neural network model, based on supervised learning, which consists of a competition layer and a linear layer. The LVQ algorithm, as implemented in the caret (Max, 2008) R package, was applied to rank the features importance (calculated by the varImp() function). The top two ranked features with the highest receiver operating characteristic (ROC) curve variable importance were selected. In addition, two diagnostic standard features for clinical biomarker diagnosis of PKU were added. These features were (Met/Phe, Phe/Tyr, Phe, Tyr).
Step 3: A linear relationship between the variables is measured by using the Pearson correlation coefficient (Pearson, 1895). Pearson correlation analysis was used to further adjust feature selection and remove highly correlated features.
Step 4: We used the Wilcoxon rank sum test to evaluate whether the metabolite concentration represented by the selected features was significantly different in the positive and negative labeled sets.
Step 5: Logistic regression analysis is widely used in biomedical applications (Pearson, 1895). To increase our model's clinical interpretability, we constructed a classification model on diagnostic flags using LRA.

Model Training
We constructed four LRA models (LRA1-LRA4) from different feature set combinations ( Table 3) and calculated the Addictive Net Reclassification Index (Add NRI) and Absolute Net Reclassification Index (Abs NRI) (Hosmer and Lemeshow, 2000) for the comparison of each model. The models were computed utilizing the R glm function.

Comparison With Previous Work
To compare our results with existing results in the literature, we calculated an additional fifth model (LRA5) ( Table 4), which utilized the optimal feature set developed in a 2013 study by Chen et al. (2013).

Cross-Validation on Testing Set and Validation on an Independent Dataset
We used a 10-fold cross-validation method to evaluate the stability of the classification model and determine that the model has achieved sufficient statistical performance on the testing dataset. We used S n , S p , PPV (precision), negative predictive value (NPV), accuracy (Acc), and area under curve (AUC) as measurements to evaluate the discriminatory power of the classification models. These metrics were calculated as follows: to validate the model, including 37 PKU patients and 74 falsepositive patients. The male-to-female ratio was ∼1:1.13; the average age at the time of blood collection was 4.5 days, and the average weight was 3.4 kg.

Metabolic Dataset Exploration and Visualization
Using the machine learning visualization methods t-SNE (Li et al., 2017), we calculated a visualization of the structure of the dataset. The three-dimensional figure computed by t-SNE ( Figure 2) illustrates that there are false-positives interspersed within the negative class space, and those can be excluded using machine learning-based analysis.

Feature Selection
The two top-ranked features by LVQ were Met/Phe, Phe/Tyr (Figure 3), and Phe, and Tyr as clinical biomarkers was considered. In addition, correlation analysis with cutoff >0.8 was used to remove highly correlated features, and as a result, Phe/Tyr was excluded. By applying a Wilcoxon test, we evaluated the means of the positive and negative groups for each corresponding feature for statistically significant difference. Table 5 summarized the results of the means computations and test results.

Model Performance
Classification models (LRA1-LRA5) ( Table 6) were trained on the training dataset containing n positively labeled cases of disorder (PKU-positive: n = 39) and m negatively labeled cases (PKU-negative, m = 156). The 156 cases were originally clinically suspect for PKU but were diagnosed as PKUnegative in additional clinical screening. Figure 4 summarizes the comparison results of the performance of each model. We calculated and compared reclassification of risk between PKU patients and false-positive patients in the LRA1-LRA5 models to determine the performance of different models in screening PKU false-positive samples. The optimal model LRA3 with the optimal feature set was characterized with the results of risk reclassification (Figures 4B-D). The features included in this model were traditional biomarkers Phe, and Tyr, and the new potential biomarker Met/Phe. More PKU patients are all subject to a higher risk assessment, and more non-PKU patients were reclassified to a lower risk assessment.
In this analysis, both LRA1 and LRA2 models ( Table 3) were constructed using the traditional clinical screening markers of PKU. The feature(s) included in LRA1 was Phe, and in LRA2, Phe and Tyr. We started by comparing LRA1 and LRA2. According to the classification threshold of the resulting event, we set the low risk to <0.0270, the medium risk to between 0.0270 and 0.101, the high risk to >0.101 and then calculated Add NRI and Abs NRI (Figure 4A). Model LRA2 performed better than LRA1. Choosing the better performing model from the above results for comparison with LRA3 (Figure 4B), we also set thresholds according to the resulting events (low risk <0.0270, medium risk 0.0270-0.0307, and high risk > 0.0307). The results of risk reclassification showed that one PKU patient received a higher risk assessment, and another 26 false-positive patients received a low risk assessment.
We compared LRA3 with LRA4 (which included the features Met/Phe) ( Figure 4C) and LRA5 [which was selected from literature (Chen et al., 2013) and included the features Met, Phe, C4, Ala, Eu × Tyr, and C16:1] (Figure 4D). The cutoff values of the resulting event used in the LRA4 comparison were low risk <0.0307, medium risk 0.0307-0.0336, and high risk >0.0336, and in the comparison with LRA5: low risk <0.0067, medium risk 0.0067-0.0307, and high risk >0.0307. The results of the risk reclassification show that there is no obvious difference between the performance of LRA3 and LRA4, but compared with LRA5, all 39 PKU patient samples are subject to a higher risk assessment in the LRA3 model (Figures 4C,D).

Cross-Validation and Independent Validation
A 10-fold cross-validation was used to examine and evaluate the classification performance of the LRA1-LRA5 models developed in this study (Figure 5 and Table 7). Except for the mean S n of LRA1 and LRA4 models <95%, the other LRA models ensure that S n is >95% ( Figure 5B); the mean S p improved compared to traditional screening methods to values ranging from 28.03% (LRA5) to 53.14% (LRA3). Positive predictive value increased from 19.14% to values ranging between 23.67% (LRA5) and 32.16% (LRA3).  Between-model comparison showed that all five models are feasible; however, LRA3 with feature set (Met/Phe, Phe, Tyr, and median and mean of AUC = 0.9313 and 0.9237) (Figure 5A) exhibited improved performance compared to the others as assessed by AUC. In addition, from a standpoint of improved screening performance, LRA3 showed improvement compared to the other models. Its S p had the highest mean S p in the model with S n > 95% (Table 7).
Next, in order to verify the reliability and applicability of our LRA models, independent data of control and diagnostic cases that satisfy the level of Phe >120 µmol/L were used to validate the model. All models expressed 100% S n , and the LRA3 model still had the highest screening performance with S p of 39.19%.

DISCUSSION
The motivation for our analysis was to develop a suitable logistic regression-based machine learning model using metabolomics data, tuned to minimize the number of false-positives in PKU diagnosis during the PKU screening process. Our additional goal was to drive biomarkers discovery and thus to provide and improve precision medicine approaches in rare genetic diseases such as PKU and serve as a reference point for future implementations.
In this work, we sourced pediatric patients' metabolic data from MRM screening, applied the LVQ method to perform feature importance ordering, and used correlation analysis and logistic regression to establish an optimal classification algorithm. Overall, the results show that despite inherently noisy clinical data, meaningful features can be extracted from metabolic data to screen for false-positives in PKU. Significantly, reducing falsepositives can lighten the workload of the screening medical professionals, improve detection efficiency, and reduce the cost and inspection time expenditure of pediatric patients and their caregivers. Supplementary screening models based on MRM data can also provide more efficient cohort identification for prospective studies.

Data, Study Design, and Populations
Our data approach is distinctive from previous studies. In our design, the control group consists of high-risk individuals, which   greatly reduces the unbalance problem of the dataset and reduces the possibility of overfitting of the model. In contrast, other works employing development of decision trees (DT), support vector machines (SVM), artificial neural networks (ANN), k-nearest neighbor classifier (k-NN), discriminant analysis (DA), and LRA models use normal and disorder patient cases in order to perform such classification in neonatal screening (Baumgartner et al., 2004a(Baumgartner et al., ,b, 2005. Our work utilizes a dataset that is Chinese-focused, an approach motivated by the potential differences between Chinese and other ethnic groups. For example, the mutation spectrum of PKU in Chinese population is similar to other Asian populations but significantly different from European populations (Song et al., 2005). Furthermore, the prevalence of PKU varies geographically and ethnically from race to race; PKU birth prevalence per 10,000 live births was estimated to be 1.14 (0.96-1.33) among white, 0.11 (0.02-0.37) among black, and 0.29 (0.10-0.63) among Asian ethnic groups (Hardelid et al., 2008). The prevalence of PKU ranged from 0.005 to 0.0167% in Arab countries (El-Metwally et al., 2018). In China, the prevalence was estimated as 1 in 3,795 (Shi et al., 2012). Our work helps bridge this knowledge gap and provide insights that are applicable to the context of the Chinese health system.

Feature Selection Strategy
Feature selection usually plays a key role in machine learning to exclude attributes, which may cause overfitting results in classification analysis and reduce interpretability (Bagherzadeh-Khiabani et al., 2016). The genetic metabolic disease data based on mass spectrometry have characteristics such as limited number of samples, many features, and noise interference, to name a few. Because of unrelated and redundant attributes, the traditional unsupervised dimensionality reduction methods do not use the label information effectively, so the subspaces they find may not be the most separable in the data (Liu et al., 2017). On the one hand, the feature subset selection can identify and remove as many irrelevant and redundant variables as possible, thereby reducing the data dimensions. By selecting only the relevant attributes of the data, the machine learning prediction accuracy and classification performance can be improved (Saeys et al., 2007;Walter and Tiemeier, 2009). On the other hand, it is also valuable for pediatric clinicians to know disease-influencing variables; those new variables can be validated and can become part of an updated screening plan. The selection of variables by experts and literature review, however, may introduce bias (Matalon and Michals, 1991).
In our models, we proposed to better understand which signals might be closely related to PKU. We applied a feature selection strategy that calculates the Euclidean distance between input sample and weight vector until it finds the prototype vector closest to the sample. All the variables were ranked according to the ROC curve variable importance in the LVQ algorithm, most of the top two features are indicated or used as screening markers for PKU, supporting the validity of this strategy. When new features are included in the model, NRI can be used to compare the performance between the original model and the model after incorporating the new features. For example, we compared LRA1 (Phe) and LRA2 (Phe, Tyr); here, LRA1 is the old model, and LRA2 is the new model (Figure 4). The new proposed features differ from traditional clinical indictors; thus, clinical validation will be necessary to establish the accuracy of these features. Some related evidence has already been reported (Chen et al., 2013).

Model Differences
So far, several screening and classification models using machine learning methods have been reported for PKU (Baumgartner et al., 2004a(Baumgartner et al., ,b, 2005Chen et al., 2013). When applied to our dataset, these models performed with difference in results and achieved a low S p ranging from 0.2752 to 0.4510. Our work chooses the LRA method, which is a traditional clinical model with high clinical interpretability. Our LRA model shows improved performance compared to existing models, with a cross-validation S p = 0.5314 ± 0.0800. Another difference is that other studies have focused on constructing primary screening models. Such models perform less effectively if applied to our dataset.

Odds Ratio Comparison
Odd ratios (ORs) is a commonly used indicator in case-control studies in epidemiology, reflecting the strength of the association between disease and exposure. A value of OR >1 indicates that the factor is a risk factor; if OR <1, the exposure to the factor is protective, and if the OR = 1, this indicates that the factor does not contribute to the occurrence of the disease. Table 5 shows the features OR of our model. An increase in the concentration of Phe may cause disease, which corresponds well to OR >1, whereas Tyr shows an OR < 1 with decreasing levels. The new marker Met/Phe may be a factor rather similar as Tyr, which can be confirmed by clinical verification.

Future Direction
The application of machine learning in metabolic diseases research continues to evolve and improve. Additional pediatric clinical data, such as the child's height, weight, gestational age, and aspects of family history and -omics data such as genomics, transcriptomics, and proteomics data can be incorporated in model development, which together with the analysis of correlation between multigroup data and clinical outcomes, is our goal in future work. Additionally, our data are sourced from a single center in Shanghai. In the future, a shared platform incorporating data from multiple sources and centers would be beneficial for medically relevant discovery based on a heterogeneous population. A basis for such platform would be the development of a standardized terminology system and a harmonization of instrumentation and diagnostic measures such as cutoff values in various clinical sites.
Machine learning methods are still in an emerging stage in the research and application in rare genetic metabolic diseases, and there are still many unsolved problems to be explored in the future. For example, whether it is equally possible to use machine learning applications in other rare genetic diseases or to discover new biomarkers or even new unknown metabolic pathways and biochemical reactions remains to be explored by future research.

SUMMARY
In this study, we applied multiple LRA models, a supervised machine learning algorithm for constructing method applicable in pediatric diagnostic screening in PKU utilizing highdimensional metabolic data. These models achieved performance of guaranteed S n >95%, achieved AUC higher than 90%, and improved S p and PPV on both the test set and independent test set. We reported a new marker of PKU-Met/Phe. The model with a feature set combining this new marker with traditional biomarkers (Phe, Tyr) can reduce more than half of the falsepositives. Our study can serve as a relevant reference for the selection and evaluation of PKU screening methods in pediatric medical practice.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

ETHICS STATEMENT
This study was approved by the Ethics Review Committee of the Shanghai Children's Hospital, Shanghai Jiao Tong University (Approval No. 2019R075-E01). Informed consent was obtained from a parent or guardian of each patient before newborn screening and the study.

AUTHOR CONTRIBUTIONS
ZZ initiated this research project. ZZ, YW, JGuo, and GT processed the data. ZZ and JGu designed this study. ZZ, GG, XC, GT, and HL conducted the statistical modeling and performed the data analysis. ZZ, GG, HL, and GT designed, wrote, and reviewed the manuscript. All authors read and approved the final manuscript.