Inter- and Intraobserver Variation in the Assessment of Preoperative Colostograms in Male Anorectal Malformations: An ARM-Net Consortium Survey

Aim: Male patients with anorectal malformations (ARM) are classified according to presence and level of the recto-urinary fistula. This is traditionally established by a preoperative high-pressure distal colostogram that may be variably interpreted by different surgeons. The aim of this study was to evaluate the inter- and intraobserver variation in the assessment by pediatric surgeons of preoperative colostograms with respect to the level of the recto-urinary fistula. Materials and Methods: Sixteen pediatric surgeons from 14 European centers belonging to the ARM-Net Consortium twice scored 130 images of distal colostograms taken in sagittal projection at a median age of 66 days of life (range: 4–1,106 days). Surgeons were asked to classify the fistula in bulbar, prostatic, bladder-neck, no fistula, and “unclear anatomy” example. Their assessments were compared with the intraoperative findings (kappa) for two scoring rounds with an interval of 6 months (intraobserver variation). Agreement among the surgeons' scores (interobserver variation) was also calculated using Krippendorff's alpha. A kappa over 0.75 is considered excellent, between 0.40 and 0.75 fair to good, and below 0.40 poor. Surgeons were asked to score the images in “poor” and “good” quality and to provide their years of experience in ARM treatment. Results: Agreement between the image-based rating of surgeons and the intraoperative findings ranges from 0.06 to 0.45 (mean 0.31). Interobserver variation is higher (Krippendorff's alpha between 0.40 and 0.45). Years of experience in ARM treatment does not seem to influence the scoring. The mean intraobserver variation between the two rounds is 0.64. Overall, the quality of the images is considered poor. Images categorized as having a good quality result in a statistically significant higher kappa (mean: 0.36 and 0.37 in the first and second round, respectively) than in the group of bad-quality images (mean: 0.25 and 0.23, respectively). Conclusions: There is poor agreement among experienced pediatric colorectal surgeons on preoperative colostograms. Techniques and analyses of images need to be improved in order to generate a homogeneous series of patients and make comparison of outcomes reliable.

Aim: Male patients with anorectal malformations (ARM) are classified according to presence and level of the recto-urinary fistula. This is traditionally established by a preoperative high-pressure distal colostogram that may be variably interpreted by different surgeons. The aim of this study was to evaluate the inter-and intraobserver variation in the assessment by pediatric surgeons of preoperative colostograms with respect to the level of the recto-urinary fistula.
Materials and Methods: Sixteen pediatric surgeons from 14 European centers belonging to the ARM-Net Consortium twice scored 130 images of distal colostograms taken in sagittal projection at a median age of 66 days of life (range: 4-1,106 days).
Surgeons were asked to classify the fistula in bulbar, prostatic, bladder-neck, no fistula, and "unclear anatomy" example. Their assessments were compared with the intraoperative findings (kappa) for two scoring rounds with an interval of 6 months (intraobserver variation). Agreement among the surgeons' scores (interobserver variation) was also calculated using Krippendorff's alpha. A kappa over 0.75 is considered excellent, between 0.40 and 0.75 fair to good, and below 0.40 poor. Surgeons were asked to score the images in "poor" and "good" quality and to provide their years of experience in ARM treatment.
Results: Agreement between the image-based rating of surgeons and the intraoperative findings ranges from 0.06 to 0.45 (mean 0.31). Interobserver variation is higher (Krippendorff's alpha between 0.40 and 0.45). Years of experience in ARM treatment does not seem to influence the scoring. The mean intraobserver variation between the two rounds is 0.64. Overall, the quality of the images is considered poor. Images categorized as having a good quality result in a statistically significant higher kappa (mean: 0.36 and 0.37 in the first and second round, respectively) than in the group of bad-quality images (mean: 0.25 and 0.23, respectively).
Conclusions: There is poor agreement among experienced pediatric colorectal surgeons on preoperative colostograms. Techniques and analyses of images need to be improved in order to generate a homogeneous series of patients and make comparison of outcomes reliable.

INTRODUCTION
The majority of patients born with anorectal malformations (ARM) have a fistula between the blind ending colon and the lower urinary tract in males and the genital apparatus in females. The classifications for ARM proposed over the years are based on specific anatomic characteristics, such as the level of the fistula in males and females (1)(2)(3). An appropriate classification is important as different levels of fistula correlate with different colorectal outcomes. The possibility to group patients based on similar characteristics is of great importance to uniform follow up and make different centers' series comparable (4). Indeed, one of the problems, when dealing with adolescents and adult patients, is to understand the original anatomic defect and preceding treatments in order to correctly address the complications (5)(6)(7)(8).
In male patients, the presence and level of the urinary fistula is radiologically identified before surgical reconstruction by means of high-pressure distal colostogram. Based on this diagnostic study, male patients are labeled as bulbar, prostatic, bladder-neck, or no fistula, and surgery is planned accordingly (4). This sort of "label" defines the male patient throughout his follow-up and possible complications.
In spite of the fact that distal colostograms have been performed for many years all over the world and, very recently, specific papers have been published (9)(10)(11), images may be variably described and interpreted by radiologists and pediatric surgeons. There are multiple reasons for the different interpretations of the images, and these include heterogeneous techniques, positioning of the patient, type of contrast media, experience of the radiologist, presence of the surgeon during imaging, quality of images, and complexity of cases. For all the above mentioned variables, radiologists and/or surgeons can differently interpret the same radiologic study, and as a consequence, patients and corresponding clinical assessment during follow-up are not always comparable among different centers.
In 2010, a group of European clinicians founded the ARM-Net Consortium with the purpose to collaborate in genetic, epidemiological, and clinical research, to set up an anonymized registry of new ARM patients from the participating centers, and to improve the care for these patients (12,13). More than 1900 ARM patients from 30 European pediatric surgical centers have been registered so far, thus making it the second largest cohort of ARM patients after the single center series of Peña and Levitt (14).
The purpose of this study was to collect radiological images of distal colostograms of male patients with ARM and circulate them among pediatric surgeons of the ARM-Net Consortium (15) in order to verify the concordance of interpretations to highlight pitfalls of images and assessment and, ultimately, to suggest indications for the proper execution of such an important diagnostic tool.

Study Design
The study is a diagnostic study on anonymized images (https://is.gd/colostogram), which assess the validity of surgical preoperative classification based on colostogram images. To exclude information bias, images were randomly mixed, and raters had no access to origin of images. The second rating was performed after 8.4 months. The study was approved by the Committee for Clinical Research of Cà Foncello Hospital (Treviso, Italy) with number 847/CE Marca.

Colostogram Images
Sixteen pediatric surgeons from 14 ARM-Net Consortium centers participated in this study. Thirteen surgeons were asked to send images of distal colostograms of male patients who underwent surgery for ARM with either recto-bulbar, rectoprostatic, bladder-neck, or no fistula. The fistula was defined as bulbar when it was positioned at the level or below the external sphincter and prostatic when it was between the external sphincter and the bladder neck. FIGURE 1 | Recto-bulbar fistula: the distal colon is filled with contrast (asterisk), the entire urethra is visible (thin arrows), the recto-bulbar fistula is clearly recognizable (fat arrow), and the anal dimple is marked (arrowhead).
One image per patient, taken in sagittal projection and judged as the most representative of the patient, was submitted. Participants were instructed that, on an ideal study, the distal colon should be completely filled with contrast with the entire urethra and fistula clearly visible (Figure 1). All pictures were anonymized for patient and center before being sent to all participants. For each image, the final diagnosis of the type of fistula was provided based on the intraoperative finding as reported in the surgical report.
All surgeons were also asked to provide information on the age of the child at the time of imaging, the type of operation performed, and years of experience in ARM treatment and to justify why they decided that colostograms were considered of "poor" or "good" quality.
Images were judged as poor quality either because of insufficient contrast or too low of pressure or lacking a cystourethrogram. Surgeons found it difficult to classify images if any of the following situations occurred: the sacrum and/or distal urethra were not shown, the fistula was not clearly visible, there was a short distal colon or part of the urethra missing, the Foley catheter was in the urethra during imaging, rotated pictures, and missing perineal marker. Finally, some surgeons claimed they would have needed more pictures to reliably classify the image.

Sample Size Calculation
According to Cantor (16), for all kappa-like agreement coefficients, the required number of subjects that needs to be included in a study depends on the relative acceptable error and the difference between the overall agreement probability and the chance-agreement probability. For the latter, we conservatively assumed 0. We defined a relative error of 20% and a probability difference of 0.5. We anticipated that the raters would agree about 50% of the time. Hence, we needed at least 100 images with a relative error of 20% (n = 44 with an error of 30%). With a lower inter-rater agreement of 0.4 (n = 156 or 69) or 0.3 at least 256 (20% relative error) or 123 images (30% relative error) would be the minimal size for valid results. Hence 130 images should give a valid answer.

Data Management
Survey data were collected online using Research Electronic Data Capture (REDCap) tools hosted at the Department of Child and Adolescent Psychiatry and Psychotherapy, University Medical Center of the Johannes Gutenberg University Mainz, Germany (17). REDCap is a secure, web-based application designed to support data capture for research studies. Each surgeon received a password and independently scored all images, including their own, anonymously, and blinded to others. They were asked to classify the fistula as bulbar, prostatic, bladder-neck, no fistula, or unclear anatomy.
The surgeons scored the images twice with an interval of ∼6 months between the two scoring rounds. In the second round of scoring, they also had to score the quality of the image into low (score = 1), medium (score = 2), or high quality (score = 3), or "cannot decide about quality" (score = 0). In both rounds, the information on the surgical report or the origin of the image was concealed, and in the second round, the voting from the first round was omitted to exclude information bias.
Each surgeon was also asked to provide information on their number of years of experience in scoring images of colostograms and the number of images scored per year.

Statistical Analyses
To measure the level of agreement between the scoring of the surgeons and the intraoperative findings, the kappa was calculated for both the first and the second round of scoring. The differences between the scoring of the images in the first and second rounds for each surgeon is called the intraobserver variation (kappa). The mean intraobserver variation was calculated for the 16 pediatric surgeons. Finally, the agreement among the scoring of different surgeons was also calculated in both rounds as a measure of the interobserver variation. We used an online calculator for this measurement featuring the Krippendorff 's alpha that also supported missing data in multiple raters (18,19). All missing data remained in the data set although two images from the first round and three from the second round were excluded for this analysis because more than 85% of the ratings were missing.
We performed additional analyses by dividing the images based on quality to see whether high-quality images scored better than low-quality images. The average quality of each image was calculated. When the average was medium or higher (score ≥ 2), the quality of the image was categorized as good; if lower than medium (score < 2) it was categorized as bad. The mean kappa and intraobserver variation was compared between the group of images of good quality and those of bad quality using Student's t-test because of its normal distribution (Shapiro Wilk test). A kappa over 0.75 is usually considered to be excellent, between 0.40 and 0.75 is fair to good, and below 0.40 is poor. Statistical analyses were performed using SPSS Statistics, Version 22.0 for Windows (IBM SPSS Inc., Chicago, IL, USA).

RESULTS
In total, 135 distal colostograms were submitted by the participating centers. Five pictures were excluded due to evident low quality or as a second picture of the same patient. Therefore, 130 images were analyzed. The number of images per center ranged from 3 to 15.
The median age of patients at the time of the colostograms was 66 days of life (range: 4-1,106 days). The images were taken between the years 2000 and 2015 but mainly (70%) after 2010. Most surgeons (n = 11) had more than 10 years' experience in scoring images of ARM patients. Seven surgeons reported that they usually scored the images with a radiologist, five with other surgeons, three on their own, and one surgeon did not specify.
Agreement between the image-based rating of the surgeons and the surgical reports was low in both the first (kappa ranging from 0.06 to 0.43, mean 0.31) and second (0.14-0.45, mean 0.31) rounds ( Table 1). The interobserver variation was generally higher but still quite low (Krippendorff 's alpha of 0.40 and 0.45, respectively). The intraobserver variation between the first and second rounds with the same surgeon was higher with a mean kappa of 0.64 (SD: 0.11; range: 0.41-0.77). Years of experience in ARM treatment did not seem to influence the scoring.
More than half (n = 69) of the total number of images were categorized as poor quality. These colostograms were mainly performed before 2010 ( Table 2). The type of fistula did not substantially differ between images of good and bad quality. Agreement between surgeons' scoring and surgical reports was still low in the group of good-quality images (mean 0.36 and 0.37 in the first and second round, respectively), but significantly higher compared to the bad-quality images (mean 0.25 and 0.23, respectively). The intraobserver kappa was higher with the goodquality images, but the difference with the bad-quality images did not reach statistical significance. The same applied for the interobserver Krippendorff 's alpha, being 0.46 for good-and 0.35 for bad-quality images in the first round and 0.49 and 0.41, respectively, in the second round.

DISCUSSION
This study shows poor agreement among experienced colorectal pediatric surgeons scoring the preoperative colostograms of male ARM patients even when the quality of images was assessed as medium or good. The reasons for poor agreement may be a combination of technical aspects, complexity of cases, and experience of both the radiologist and the pediatric surgeon in performing the study. To overcome the problem of personal interpretation of images, a more expensive and sophisticated technique, such as the pelvic-perineal MRI has been proposed (20). Conversely, in developing countries with limited access to contrast studies, the trans-perineal ultrasound-guided colostogram with saline has been proposed as an alternative method. However, that is an even more operator-dependent procedure (21,22). Hosokawa et al. also proposes the use of ultrasound in combination with a voiding cystourethrogram in male neonates undergoing primary repair without colostomy with the aim to detect and locate the fistula as precisely as possible (23), and recently, the importance of the preoperative colostogram has been published (9-11). These last papers address both radiologists and pediatric surgeons and provide a series of technical details as well as the most common pitfalls that are extremely important and useful for correct clinical practice and fill the gap of knowledge highlighted in the present study. From a technical point of view, when performing the colostogram, it is important to (a) fully distend the colon with the contrast medium, using a balloon-tip catheter with the inflated balloon inserted into the distal stoma in order to visualize the fistula and to avoid the false negative no fistula; (b) administer enough contrast to display both bladder and urethra and, therefore, determine the exact level of fistula; (c) mark the anal dimple in order to calculate the distance between the fistula and the perineum; (d) visualize the entire sacrum; and (e) correctly position the patient on sagittal view to avoid colon and bladder overprojection (11). The radiologists' experience in performing this contrast study may also play a key role and can only be acquired through the reporting of many cases. However, even in referral centers, the number of ARM male patients needing a colostogram is quite low; therefore, in each center, a team of pediatric radiologists and surgeons should be dedicated to these special cases, and diagnostic interpretations should be performed by multiple raters. Dedicated teams, in turn, should refer to standards of diagnosis provided by the international networks, such as the European Reference Network e-UROGEN (http:// eurogen-ern.eu). It is the surgeons' responsibility to ensure the best for every ARM patient, and this includes rejecting suboptimal or, even worse, inadequate pictures for interpretation. The importance of correctly classifying patients cannot be underestimated. When the anatomy of the patient is incorrectly defined before the surgical reconstruction, significant and severe consequences may occur. First, surgeons confident with laparoscopy may be induced to use this technique whenever the fistula is incorrectly described at a higher level, thus adding unnecessary known risks (24,25). Second, based on incorrect preoperative information, the surgeon might look for the fistula at the wrong level along the urethra causing, again, avoidable complications. Finally, the outcome is expected to be different according to the different groups of patients, namely better, in terms of continence, for patients with bulbar fistulas and worse for those classified as prostatic and bladder-neck fistulas (26)(27)(28)(29). Inadequate interpretation of the severity of malformations may lead to incorrect information being conveyed to parents and incorrect follow-up care. Moreover, the exchange of information about the outcome of patients with ARM among different centers might be incorrect and misleading if the wrong classifications are used.
A major strength of the study is the adequate sample size, available data set, and replicable evaluation for training purposes (https://is.gd/colostogram).
However, some limitations were also encountered. We considered the surgical report to be the gold standard when comparing the scoring of the colostogram. However, even this cannot be considered as a perfect standard because sometimes, during surgery, it may be difficult to exactly determine the level of the fistula. Moreover, it was clear from the scoring by the surgeons that a few of them tend to score more for one specific type of fistula, such as the bulbar fistula, and much less for the others. This may have biased the intraoperative finding as well. When looking at the scoring of the images, the agreement was somewhat higher for those images the surgeons provided themselves than for the total group of images, but it still did not exceed the excellent kappa border of 0.75 (data not shown). The fact that the intraobserver rate was 0.64 and, therefore, not perfect suggests that the surgeons had some doubts about the type of fistula and provided different answers at different times. A few centers provided almost exclusively good-quality images, but surgeons from these centers did not score better than others. In the setting of this study, the surgeon had to score the images alone although, in clinical practice, most surgeons perform the diagnostic assessment of the image together with other surgeons, and this might also have caused a bias. The fact that single images per patient were analyzed certainly limited the ability to accurately assess the location of the fistula along the urethra. In clinical practice, indeed, multiple images and views are reviewed to better understand the location of the fistula. This study was designed and performed by pediatric surgeons without involvement of radiologists. This contributed to making the study more uniform, but it might also have reduced the accuracy of the image assessment as radiological expertise was not taken into consideration. Finally, an explanation for the poor interobserver agreement could be that the location of the fistula does not always fall neatly into the three categories that were used as some fistulas occur in the transition areas of bulbar to prostatic urethra or prostatic urethra to bladder neck, making the precise determination of urethral location difficult to classify. This is the first multicentric study that investigates the validity of a very diffuse practice: that is, the preoperative, high-pressure distal colostogram for male patients affected by ARM. The poor agreement among pediatric colorectal surgeons and the questions raised by the participants call for an improvement of images and analyses in order to provide more valid and valuable preoperative information. Moreover, it is important to generate a homogeneous series of patients and make comparison of outcomes among studies reliable. Training is necessary for pediatric surgeons to interpret the images as well as for radiologists to provide and interpret the radiological studies.