The Modified Heidelberg and the AI Appendicitis Score Are Superior to Current Scores in Predicting Appendicitis in Children: A Two-Center Cohort Study

Background: Acute appendicitis represents the most frequent reason for abdominal surgery in children. Since diagnosis can be challenging various scoring systems have been published. The aim of this study was to evaluate and validate (and improve) different appendicitis scores in a very large cohort of children with abdominal pain. Methods: Retrospective analysis of all children that have been hospitalized due to suspected appendicitis at the Pediatric Surgery Department of the Altonaer Children's Hospital and University Medical Center Hamburg-Eppendorf from 01/2018 until 11/2019. Four different appendicitis scores (Heidelberg Appendicitis Score, Alvarado Score, Pediatric Appendicitis Score and Tzanakis Score) were applied to all data sets. Furthermore, the best score was improved and artificial intelligence (AI) was applied and compare the current scores. Results: In 23 months, 463 patients were included in the study. Of those 348 (75.2%) were operated for suspected appendicitis and in 336 (96.6%) patients the diagnosis was confirmed histopathologically. The best predictors of appendicitis (simple and perforated) were rebound tenderness, cough/hopping tenderness, ultrasound, and laboratory results. After modifying the HAS, it provided excellent results for simple (PPV 95.0%, NPV 70.0%) and very good for perforated appendicitis (PPV 34.4%, NPV 93.8%), outperforming all other appendicitis score. Discussion: The modified HAS and the AI score show excellent predictive capabilities and may be used to identify most cases of appendicitis and more important to rule out perforated appendicitis. The new scores outperform all other scores and are simple to apply. The modified HAS comprises five features that can all be assessed in the emergency department as opposed to current scores that are relatively complex to utilize in a clinical setting as they include of up to eight features with various weighting factors. In conclusion, the modified HAS and the AI score may be used to identify children with appendicitis, yet prospective studies to validate our findings in a large mutli-center cohorts are needed.

factors. In conclusion, the modified HAS and the AI score may be used to identify children with appendicitis, yet prospective studies to validate our findings in a large mutli-center cohorts are needed.
Keywords: appendicitis, children, diagnosis, predicition, scores BACKGROUND Abdominal pain is one of the most common reasons for pediatric emergency presentation and acute appendicitis represents the most frequent reason for abdominal surgery in children (1). The incidence of appendicitis is reported to be 151/100,000 person-years in Western Europe (2). Diagnosis can be very challenging, particularly in the early stages of appendicitis when clinical manifestations may be less typical and is particularly demanding in younger children (3). The reported negative appendectomy rate in patients with clinically diagnosed suspected acute appendicitis is about 15-20% and can be associated with considerable morbidity (4). Rapid diagnosis is important, because increased time between onset of symptoms and surgical intervention is associated with increased risk of appendiceal perforation and therefore with increased morbidity (5).
Thus, different elements of history, clinical examination, as well as laboratory and radiologic findings are used to diagnose appendicitis. In order to further improve diagnosis various algorithms and scoring systems have been developed. The most widely used scores are the Heidelberg Appendicitis Score (HAS), the Pediatric Appendicitis Score (PAS), the Alvarado-Score and the Tzanakis-Score (6)(7)(8)(9). While both the HAS and the PAS are designed for pediatric patients (7,9), the Alvarado-and the Tzanakis-Score were conceptualized in a broader patient population (6,8). The HAS appears to be the simplest to apply, but has not been validated other than in its initial cohort. Even though the different scores are suitable to predict appendicitis, their practicability is limited due to the multitude of different features. The clinician has to access up to eight factors, which are weighted unequally, making them difficult to use in an emergency department setting. The aim of this current study was to evaluate and validate the different scoring systems in two large pediatric surgery centers.

Study Cohort Characteristic
Retrospective cohort study of all children (age 1-17) that were hospitalized for suspected appendicitis as assumed by a triage nurse at the department of Pediatric Surgery at the Altonaer Children's Hospital in Hamburg (AKK) and the University Medical Center Hamburg-Eppendorf (UKE) between January 2018 and November 2019. The nurses direct the patients either to the surgical or the pediatric department in the interdisciplinary emergency department. The study is a sub-analysis of two prospective cohort studies that are in accordance with the guidelines of the Medical Research Ethics Committee of Hamburg (Ethik-Kommission der Ärztekammer Hamburg, PV5459, and PV5891).

Study Cohort Outcomes
Patients were selected from the hospital database by ICD-10-Codes for abdominal pain, appendicitis and functional intestinal disorder. Children with chronic medical conditions were excluded (i.e., chronic constipation and Hirschsprung's disease). Medical files, including patient charts, operating theater records, and office notes, were reviewed and routinely obtained characteristics were recorded. These included demographic data (age and sex), clinical history (duration of abdominal pain, nausea/vomitus, stool consistency, dysuria, and pyrexia), physical signs (tenderness right lower quadrant, rebound tenderness, cough/hopping tenderness in the right lower quadrant, and psoas sign), laboratory results (white cell counts (WBC) >11 × 10 9 /L, C-reactive protein (CRP) >20 mg/L, neutrophilia >75% respectively >7.9 × 10 9 /L), urine-analysis (nitrite, ketone, and leucocytes), ultrasound findings (appendix outer diameter > 6 mm, surrounding tissue involvement, appendix wall hyper-perfusion, free fluids, wall edema, and signs of constipation), intraoperative findings and histopathology of the appendix. All items listed were evaluated and utilized for the scores. Physical examinations were performed by a resident and/or a senior physician of pediatric surgery and ultrasound examinations were performed by a resident of Pediatric Surgery or Pediatric Radiology. All removed appendices underwent a standardized histopathological examination at the same pathology. The differentiation of distinct stages of appendicitis was based on histopathological findings, which is usual practice in studies of appendicitis (10). The assessment of perforation was based on the operating theater records. Four different appendicitis scores including the HAS (7), the Alvarado score (6), the PAS (9) and the Tzanakis (8) score were assessed in all children (compare Table 1).

Development of the Scores
In addition to current scores, a modified Heidelberg Appendicitis Score (mHAS) based on a classification and regression tree (CART) analysis and an AI based score were developed. Statistics were performed using SPSS Statistics 26 (IBM, NY, USA) and R 4.0 (Foundation for Statistical Computing, Vienna, Austria. The HAS was improved using CART analysis, resulting in the "modified HAS." Classification and regression tree has some advantages over logistic regression analysis such as the ability to utilize large numbers of predictor variables and non-reliance on the underlying distributions for statistical inference (11). For validation, data was randomly divided into two parts and 70% of the data were used for training and 30% for testing. To avoid over-fitting the decision tree was pruned for a small risk value. Ultimately, the AI-based approach was calculated using the R package Random Forest (12). Random Forest is a method of regression which can capture non-linear relationships by averaging the prediction of multiple decision trees (13). For validation, the model was trained by randomly splitting the entire data into two parts, where 70% of the data were used for training and 30% of the data for testing. This was performed 20 times with new random distribution of the data, to eliminate outliers, and the average of the results were taken. In order to prevent over-fitting of the model to the data of the current study which could limit generalization in future real-world use, the random forest method comprises the use of different trees of which each tree is trained on a different bootstrapped dataset. In our case 200 trees were used.
For both analysis (AI and modified HAS) all features listed in the methods section were utilized. Data is presented as mean (standard deviation). The level of significance was set to 0.05.

RESULTS
In total 550 patients were included in the current study. After exclusion of children with chronic medical conditions, 463 patients fulfilled the inclusion criteria. Of those, 348/463 were operated for suspected appendicitis and in 336/348 of these patients, appendicitis was confirmed histopathologically. The negative appendectomy rate was thus 12/348 (3.4%). In 234/348 (67.2%) children appendicitis was simple (phlegmonous and gangrenous) and 102/348 (29.3%) children had perforated appendicitis. Children without appendicitis (127/463) suffered from constipation, mesenteric lymphadenitis, gastroenteritis, unspecific abdominal pain, ovarian cysts, or adnexitis. Diagnosis was confirmed by observation. Mean age of the cohort was 10.9 (3.7) years and 264/463 (57.0%) were female. Children with simple appendicitis were significantly younger than patients with perforated appendicitis The clinical feature with the highest predictive value of the patient's history was anorexia; if presented it had high positive predictive values ( Table 2). Clinical signs such as tenderness in the right lower quadrant, rebound tenderness, and cough/hopping tenderness had excellent predictive values ( Table 2). WBC, CRP, neutrophils, and urine ketones were considerably elevated in most, but not all children with appendicitis ( Table 2). Moreover, most children with appendicitis had pathological ultrasound findings ( Table 2).
Clinical data was available in all patients. Laboratory and ultrasound were available in 456/463 patients. For the scores only patients with complete data samples (456/463 patients) were included. Applying the current scores revealed low to moderate sensitivity (31.0-80.4%) with moderate to high specificity (78.7-96.1%) for simple appendicitis and a slightly improved sensitivity (40.2-84.3%) with low to moderate specificity (20.1-73.5%) for perforated appendicitis (Table 3). Thus, a CART analysis was performed and five factors quite  similar to those included in the HAS were identified. In the modified HAS, pain quality, which had been present only in 7/463 patients, was "replaced" by WBC and CRP. It consists of US demonstrating APP (importance 0.18), CRP > 20 mg/L (importance 0.08), rebound tenderness (importance 0.01), WBC > 11 × 10 9 /L (importance 0.006), tenderness in the right lower quadrant (importance 0.001). The features importance indicates what features contribute most to the decision making in the model. Conversely, the AI based approach resulted in four factors only, interestingly unlike all other scores, tenderness of the right lower quadrant did not seem to be a decisive determinant of an acute appendicitis using this method. It consists of US demonstrating APP (importance 0.37), CRP > 20 mg/L (importance 0.25), rebound tenderness (0.21), and WBC > 11 × 10 9 /L (importance 0.08).
Both new scores achieved excellent diagnostic values for simple appendicitis and identified almost all cases of perforated appendicitis (Tables 3, 4). If only children with a positive AI score had an appendectomy, only 2/102 (2.0%) children with perforated appendicitis would have been missed, and 3/102 (2.9%) children if the modified HAS would have been applied. In contrast, the Alvarado score would have missed 16/102 (15.7%), the PAS 61/102 (59.8%) the Tzanakis score 19/102 (18.6%) and the original HAS 60/102 (58.8%). Both new scores also yield a decent specificity (modified HAS: 70.9%, AI Score: 70.1%), albeit the other scores are superior in this regard. Given the excellent sensitivity, the new scores outperform all currently available ones (Tables 3, 4).

DISCUSSION
Different scoring systems have been published to aid diagnosing acute appendicitis. Among these scores, the Alvarado-, the Tzanakis, and the PAS have been validated most often. For these scores varying numbers of sensitivity (71.9-100%) and specificity (66.6-100%) were described in previous studies (14)(15)(16)(17). Although in principal qualified to predict acute appendicitis, these scores lack clinical practicability due to their complexity. For instance, the Alvarado score and the PAS comprise of eight different features, which are weighed differently, complicating their applicability even further. The HAS comprises only of five features without any differentiated weighing and is therefore very easy to apply. Unfortunately, in the current study pain quality was rarely present. Additionally, it might be difficult to assess in younger children (18). On account of this considerable shortcoming, two new appendicitis scores were developed. Both scores outperform all previous scores by far with regards to their excellent sensitivity and decent specificity which will ensure clinical application.
The modified HAS and AI score strengthen the case for each other as they were generated by entirely different techniques but include almost the same items. They yield a nearly identical sensitivity and specificity. However, in contrast to all other scores, the AI based approach does not consider tenderness of the right lower quadrant to be a significant predictor of an acute appendicitis. It is for this unexpected finding, that we have serious doubts that this new AI score could find acceptance amongst physicians. Furthermore, tenderness of the lower right quadrant can be assessed very easily and quickly even by a physician without much experience and therefore is a very suitable indicator in the clinical practice. Furthermore, reproduction of the random forest-based determination of the important predictors of appendicitis is rather challenging for the reader or the intrigued clinician, whereas CART analysis is straight forward and can easily be reproduced and adjusted if desired. Therefore, we primarily suggest to promote the use of the modified HAS rather than the AI score.
In general, the modified HAS and AI score may suggest that clinical signs are less important than previously assumed in the diagnosis of pediatric appendicitis (19). This may be explained by atypical presentation of appendicitis in young children and limited communication skills in this age group (20).
Both new scores include CRP and WBC. They are nonspecific markers of inflammation, whose diagnostic value regarding appendicitis have been evaluated numerously (21)(22)(23)(24). Previous publications have stated a 100% negative predictive value for acute appendicitis, if both WBC and CRP are normal, whereas other reports found appendicitis in the presence of normal CRP, WBC, and neutrophils (22,25). One recent study found the absence of neutrophilia to be the most important item to determine children with low risk of having appendicitis (26). Furthermore, in some publications, measurements of CRP and/or WBC revealed additional diagnostic value regarding the severity of appendicitis (e.g., advanced and perforated), whereas some publications negate such correlations (22,23,27). Some authors hypothesized that these confronting findings are the result of the different pathogenesis of simple and complicated appendicitis. Individual differences in immunity have been proposed as indicated by increased neutrophil and monocyte counts in complicated appendicitis and eosinophilia in simple appendicitis (28).
Both the original and the modified HAS include ultrasound (US) as an essential component. In our cohort, US demonstrating appendicitis seems to be one of the best predicting factors for appendicitis in children with abdominal pain. US has been used to aid diagnosing acute appendicitis since the 1980s and has already been evaluated extensively in pediatric populations (29)(30)(31)(32)(33). US is found to be the most cost-effective diagnostic approach in children with suspected appendicitis (34). The drawback of US is that its application can potentially delay surgical treatment and its diagnostic value is quite dependent on the experience of the user (29,30,32,35). In this current and in previous studies, the embedding of US in the diagnostic algorithms or scoring system, improved sensitivity, and specificity (30). To some extent US outperforms clinical assessment of emergency physicians at diagnosing acute appendicitis (36).

Limitations
Most limitations of the current study are inherent in a retrospective study. Furthermore, at both centers, nurse-directed triage pre-selected patients for the surgical department with a high probability of appendicitis which could affect the results of the current study limiting the generability of the two new scores. Hence, the two new scores should be evaluated in a cohort of patient with abdominal pain in order to validate our findings. Moreover, as in most studies that rely on clinical features, another limitation is the inter-observer variability, as experience may significantly affect the examiner's interpretation of the clinical findings (16). However, the modified HAS and the AI score should be very robust in this regard, as tenderness in the right lower quadrant and rebound tenderness are very basic factors which can be assessed easily. Ultimately, US is very user-dependent and in the absence of a specialized Pediatric Radiologist, trained Pediatrician, or Pediatric Surgeons are needed to assess the factors of the modified HAS. Since, the modifications of the score refer to the results in our cohort, prospective validation should be performed in future studies.
In summary, in our cohort all presented scoring systems are qualified to predict acute appendicitis with considerable restrictions regarding the clinical practicability. Since the modified Heidelberg Appendicitis and the Artificial Intelligence score demand the fewest clinical features without differentiated weighing, they provide the best clinical practicability while providing a good predictability for both appendicitis in general and perforated appendicitis. We recommend the modified HAS as it has excellent predictive capabilities, it is easy to assess and more likely be adopted by clinicians than the AI score which does include a key symptom of appendicitis. The modified HAS resembles the current diagnostic work up in most centers treating appendicitis in children. However, our findings should be validated in prospective, multicenter studies.

DATA AVAILABILITY STATEMENT
The original contributions generated in the study are included in the article/supplementary materials, further inquiries can be directed to the corresponding author.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Ethik-Kommission der Ärztekammer Hamburg, PV5459, and PV5891. Written informed consent from the participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
CS and JE designed the study, collected data, analyzed the data, and drafted and revised the paper. MK, JH, C-MJ, TG, and KR collected data and drafted and revised the paper. MB designed the study, collected data, analyzed the data, performed statistics, and drafted and revised the paper. All authors contributed to the article and approved the submitted version.