Salivary Gland Ultrasonography in Sjögren's Syndrome: A European Multicenter Reliability Exercise for the HarmonicSS Project

Objectives: Salivary gland ultrasonography (SGUS) is increasingly applied for the management of primary Sjögren's syndrome (pSS). This study aims to: (i) compare the reliability between two SGUS scores; (ii) test the reliability among sonographers with different levels of experience. Methods: In the reliability exercise, two four-grade semi-quantitative SGUS scoring systems, namely De Vita et al. and OMERACT, were tested. The sonographers involved in work-package 7 of the HarmonicSS project from nine countries in Europe were invited to participate. Different levels of sonographers were identified on the basis of their SGUS experience and of the knowledge of the tested scores. A dedicated atlas was used as support for SGUS scoring. Results: Twenty sonographers participated in the two rounds of the reliability exercise. The intra-rater reliability for both scores was almost perfect, with a Light's kappa of 0.86 for the De Vita et al. score and 0.87 for the OMERACT score. The inter-rater reliability for the De Vita et al. and the OMERACT score was substantial with Light's Kappa of 0.75 and 0.77, respectively. Furthermore, no significant difference was noticed among sonographers with different levels of experience. Conclusion: The two tested SGUS scores are reliable for the evaluation of major salivary glands in pSS, and even less-expert sonographers could be reliable if adequately instructed.


INTRODUCTION
Primary Sjögren's syndrome (pSS) is a systemic autoimmune and lymphoproliferative disease, mainly involving the salivary glands (SGs) (1). In pSS, the SGs inflammatory process ultimately results in glandular structural damage (2,3). Active glandular lesions are characterized by inflammation and lymphoproliferation, with varying degrees of glandular damage by fibrosis, fatty accumulation, and loss of acinar and ductal parenchyma (4)(5)(6). These pathological abnormalities, for whose characterization SG biopsy is the gold standard technique, lead to the typical glandular inhomogeneity detected by salivary gland ultrasonography (SGUS), with hypo/anechoic areas and hyperechoic bands (7)(8)(9)(10). So far, parenchymal inhomogeneity proved to be the main sonographic feature to build SGUS scores in pSS (7,8,11). In 1992, De Vita et al. firstly developed a comprehensive sonographic score of major SGs in pSS defining, by means of a discriminant analysis, inhomogeneity as the main SGUS abnormality associated with pSS; the developed semiquantitative score ranged from 0 to 3 in each gland, from normal-appearing morphology to severe inhomogeneity (7). Several scoring systems have been proposed subsequently, and most of them used glandular inhomogeneity as the key SGUS abnormality. In addition, several clinicians now routinely use SGUS to assess patients with suspected or established pSS (9,10,12). However, even if many authors strongly believe that SGUS is relevant for the management of pSS, definite recommendations are still lacking, and this technique is not yet part of the classification criteria for pSS. This is mainly due to: (i) the absence of consensus on elementary SGUS lesions and scoring when pSS classification criteria were set up (12); (ii) the evidence of significant intra-and inter-rater disagreement (12)(13)(14); (iii) the use of old pSS cohorts for validation of pSS classification criteria, when SGUS was not yet fully developed (15). In order to overcome these issues, the Outcome Measure in Rheumatology (OMERACT) working group on the use of ultrasonography in pSS recently generated, after a three-round Delphi process, the definitions for the SGUS elementary lesions in pSS, and the Abbreviations: ACR-EULAR, American College of Rheumatology/European League Against Rheumatism; GRAAS, Guidelines for Reporting Reliability and Agreement Studies; NU-E, non-users and experts; NU-NE, non-users and non-experts; OMERACT, Outcome Measures in Rheumatology Clinical Trials; OMI, outcome measure instruments; PG, parotid gland; pSS, primary Sjögren's syndrome; SGs, salivary glands; SGUS, salivary gland ultrasound; SMG, submandibular gland; U-E, user and experts; U-NE, users and non-experts. scanning procedure (16). Lastly, the same OMERACT group developed a four-grade SGUS score, which showed excellent intra-rater reliability and a good inter-rater reliability between experts (16). By the possible addition of SGUS as a new criterion, the 2016 ACR-EULAR criteria for the classification of pSS may be ameliorated in sensitivity without modifying the specificity (17).
On the other hand, very few data exist to support the reliability of SGUS in pSS, and all the published studies involved only experts and well-trained sonographers. Therefore, it definitely remains to be investigated how reliability varies along with the observer training level and experience, and this still represents the major obstacle for a wider acceptance of SGUS in pSS evaluation (13).
Furthermore, the images from previous studies were not publicly available, making rather difficult for subsequent studies to reproduce and/or to objectively compare their findings. Accordingly, the leading pSS experts (35 partners from 13 countries) have recently started the HarmonicSS (http:// harmonicss.eu) initiative to envelop independently reported cohorts and metacentric data, including SGUS in a dedicated Workpackage (WP), namely WP7. This study was created within the HarmonicSS initiative, and it is preliminary to further studies on the application of artificial intelligence in SGUS. At the current stage of the initiative, this study aims to: (i) evaluate the reliability among sonographers with different levels of expertise in SGUS; (ii) compare the reliability performance between two different semi-quantitative SGUS scores, widely used and easy to perform, being the "extremes" in the year of publication (i.e., the scores by De Vita et al., 1992, and by OMERACT, 2019); (iii) provide a data set that will be publicly available, to serve as a standardized benchmark for further studies.

MATERIALS AND METHODS
The Guidelines for Reporting Reliability and Agreement Studies (GRAAS) were followed for the preparation of the manuscript (18).

Salivary Gland Ultrasound Scores
A simple, semi-quantitative 0-3 SGUS scoring system was recently selected by a systematic review and meta-analysis as more appropriate for diagnostic purposes in terms of specificity and heterogeneity in pSS, with respect to the other scoring systems available (e.g., 0-16 and 0-48) (19). One of the aims of the work-package 7 (WP7) of the HarmonicSS project is to develop and improve the role of SGUS for pSS management. Participants and coordinators of WP7 agreed to use two four-grade semi-quantitative scoring systems, namely De Vita et al. score (7) and OMERACT score (16), for the assessment of major SGs morphology in pSS patients enrolled in the HarmonicSS project.
The score by De Vita et al. is the long-standing available in the literature in pSS, is easy-to-perform, and includes both hypo/anechoic areas and hyperechoic bands as the main sonographic features to define parenchymal SG inhomogeneity, while in the OMERACT score parenchymal inhomogeneity is supported only by the presence of hypo/anechoic areas. The De Vita et al. score comprises: grade 0, normal-appearing parenchyma; grade 1, mild inhomogeneity with isolated and small hypo/anechoic areas, without hyperechoic bands; grade 2, evident inhomogeneity with multiple scattered hypo/anechoic areas and/or few hyperechoic bands; grade 3, severe/gross inhomogeneity due to large and confluent hypo/anechoic areas and/or diffuse hyperechoic bands (Figure 1). In this exercise, as well as in the recent studies where the De Vita et al. score was applied (20)(21)(22), the grade 1 was better specified, since the term of "mild inhomogeneity" was initially included, as a diffuse or localized micro-areolar structure. The OMERACT score is the most recent one, proposed in 2019 according to guidelines for selecting outcome measure instruments (OMI) (23) and it is defined as follows: grade 0, normalappearing SG parenchyma; grade 1, minimal change: mild inhomogeneity without hypo/anechoic areas; grade 2, moderate change: moderate inhomogeneity with focal hypo/anechoic areas; grade 3, severe change: diffuse inhomogeneity with hypo/anechoic areas occupying the entire gland surface (Figure 1).

Participants
Twenty-seven sonographers involved in the WP7 of the HarmonicSS project (https://www.harmonicss.eu/the-project/ project-structure/) from nine countries in Europe (Austria, France, Germany, Italy, Norway, Serbia, Slovenia, The Netherlands, and United Kingdom) were invited to participate. The years of experience in SGUS, the number of pSS patients evaluated per year, and scores usually used in clinical practice were collected for each participant. Sonographers with at least 6 years of experience in SGUS were identified as experts, while the user definition was applied in those who already applied the De Vita et al. and/or OMERACT scores in clinical practice or for research purposes. Four different levels of sonographers were then identified as follows: user and experts (U-E group), non-users and experts (NU-E group), users and non-experts (U-NE), non-users, and non-experts (NU-NE group). A dedicated SGUS atlas was sent to all participants, with general SGUS issues, and with definitions and examples for each grade of glandular inhomogeneity for both scores.

Reliability Exercise and HarmonicSS Data Set
A pool of 225 sonographic static images (83 normal-appearing images, 42 images with mild inhomogeneity, 47 images with moderate inhomogeneity, and 53 images with severe inhomogeneity) of major SGs [114 parotid glands (PGs) and 111 sub-mandibular glands (SMGs)], from 150 patients with suspected pSS or definite pSS, was independently scored in two rounds. The sonographic images were previously collected and de-identified from the database of four rheumatologists involved in the exercise (AZ, AH, VM, ODL) and were different from those presented in the atlas. Four different ultrasound machines were used to store images, i.e., Samsung RS85, Philips Epiq, GE Logiq E9, and ESAOTE MyLab70. For both scoring rounds, each observer was provided with an anonymized and uniquely randomized data set in order to ensure that scorings performed in this study could be not influenced with others' scorings. The reliability exercise was carried out remotely by using the HarmonicSS web-based platform (https://private.harmonicss. eu). For each round, the participants had to apply the De Vita et al. score and the OMERACT score for each of the 225 images. The described data set, together with accompanying script files and instructions for their usage, are publicly available on the GitHub repository (https://github.com/ ArsoVukicevic/Assessment-of-pSS-fromSGUSimages/tree/ master/3%20HarmonicSS%20benchmark%20dataset) that will be further managed by the HarmonicSS group and authors of this study.

Statistical Analysis
Inter-rater reliability was assessed by using kappa statistics and computing the linear and squared weighted and unweighted kappa coefficient (Fleiss-Cohen weights) for each pair of raters for both scores considered. In the analyses the weighted kappa (i.e., linear and squared kappa), in addition to unweighted kappa, was performed since the use of weighting schemes allows to take into account the closeness of agreement between categories. The weights are presented in Supplementary Table 1. The mean, median, 1st and 3rd quartile, minimum (min), and maximum (max) kappa values were calculated. Then Light's kappa was considered as the mean kappa value. To assess intra-rater reliability, we computed the linear and squared weighted and unweighted kappa coefficients between two readings by each rater for both scores. We then computed Light's kappa (mean of intra-rater kappa values), median, 1st and 3rd quartile, minimum (min), and maximum (max) kappa values. The bootstrap percentile method was used to compute the 95% CI of every Light's, Fleiss and Cohen kappa. Furthermore, we converted every 0-3 score database in 0-1 score database considering scores 0 and 1 as normal-appearing scores and converted to 0; whereas scores 2 and 3 were held as pathological scores and were converted to score 1. Every analysis was repeated in the 0-1 score database. We stratified squared weighted kappa coefficients by levels of SGUS experience and the knowledge of the tested scoring systems. We obtained kappa for four levels combining experience and use information: U-E, U-NE, NU-E, and NU-NE.
The results were then compared to evaluate the overlap between 95% confidence intervals. Kappa coefficients were interpreted according to Landis and Koch (24).

Participants Scores
The mean De Vita et al. score and OMERACT score of the first round is reported in Supplementary Figure 1 Table 3). U-E group, user and experts; U-NE group, users and non-experts; NU-E group, non-users and experts; NU-NE group, non-users and non-experts.

Intra-Rater Reliability Among Sonographers With Different Levels of Experience and Use of SGUS Scores
The  Table 4).

Inter-Rater Reliability Among Sonographers With Different Levels of Experience and Use of SGUS Scores
The Light's kappa of the four groups showed a substantial level of inter-reliability for the De Vita et al. and the OMERACT scores in both rounds (

DISCUSSION
In this web-based reliability exercise, two different semiquantitative SGUS scores for pSS proved to be reliable for the sonographic evaluation of the major SGs. In addition, regardless of the level of the sonographer's experience, an almost perfect intra-rater reliability, and substantial inter-rater reliability, were reached. Finally, images and data of the present study will be publicly available to facilitate further investigations. Few previous studies evaluated the reproducibility of SGUS in pSS, usually with few experts and with variable results (8,9,11). Recently the authors of the OMERACT score of SGUS in pSS highlighted a good reliability for their score among 25 experts (16). This study involved an equally relevant number of raters, but also with different levels of expertise. In order to use SGUS as an OMI and as an item for pSS classification, its reliability must be tested, and it is recommended that the weighted kappa should be >0.7 (25). Over the past years, SGUS has received a growing interest as it is a non-invasive and easily performed technique for the management of pSS (26). Furthermore, in clinical practice, SGUS semi-quantitative scores are easy to apply and have a good discriminatory power between pathological and normal-appearing major SGs (7,8,10). This exercise tested the application of two different, easy-to-apply, 0-3 semi-quantitative scores for SGUS, namely the ones developed by De Vita et al. in 1992 and by OMERACT in 2019. The former includes features of both inflammation (i.e., hypo/anechoic areas) and damage (i.e., hyperechoic bands), whereas the latter, including mainly features of glandular inflammation, was built following the recent OMI recommendations. In this study, both the scores showed an almost perfect intra-rater reliability and substantial interrater reliability among 20 sonographers with different levels of experience and knowledge of SGUS scores. The number of involved sonographers and the stratification of the sonographers, based on their experience, are the main strengths of this study. In this reliability exercise, being a non-expert and/or a non-user did not significantly impact the level of agreement among raters. Importantly, however, a support for SGUS evaluation and scoring was given to raters, by means of a dedicated atlas of images. We did not investigate whether the less expert sonographers were those mainly using the atlas or not, since it was poorly feasible. Further studies, in any case, should better define the optimal way to support SGUS rating in pSS. The automatic scoring of SGUS by image segmentation and artificial intelligence is also being evaluated in HarmonicSS.
As already highlighted by the OMERACT study, the reliability among PGs and SMGs was different also in this study, and worse in the SMGs than in the PGs for both the tested scores (16). This could be in part expected since a mild inhomogeneity (e.g., grade 1 for both the scores) of the SMGs can be difficult to be differentiated from the normal gland. The main limitation of the study was the absence of a patient-based exercise; ideally, reliability testing should also be performed in the clinical setting with the patient, and not with the sole images. In this scenario, practical sonographic skills could make the difference among groups with different levels of experience, but a high number of sonographers are needed, making this type of exercise challenging to plan. Furthermore, the presence of only two raters in the NU-NE group could be another study limitation. In this multi-center reliability exercise, groups with an equal number of participants could not be defined a priori, since the choice of sonographers was made by each center involved in the European project.
In conclusion, this study focused on SGUS reliability, i.e., the main limit for a wider use of SGUS in pSS. Both the tested SGUS scores proved to be reliable for the evaluation of pSS patients and this reinforces and supports the reliability of SGUS as highlighted by the OMERACT study (16). Furthermore, in this study the agreement was independent of the years of experience of sonographers and of their previous use of the tested scores. Overall, in our opinion, major evidence to further encourage the use of SGUS in pSS in the next future is provided.

PATIENT AND PUBLIC INVOLVMENT IN RESEARCH
EULAR PARE and SSF-Sjögren's Syndrome Foundation have an advisory role in HarmonicSS and will continuously monitor and evaluate the project (outcomes) in terms of impact to patients. Also, EULAR PARE and SSF have a major role in the dissemination of the results to patient associations and the public. https://www.harmonicss.eu/patients-advisory-group/.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by CEUR-2017-Os-027-ASUIUD. The patients/participants provided their written informed consent to participate in this study.