Quality appraisal of clinical practice guidelines for motor neuron diseases or related disorders using the AGREE II instrument

Objectives This study aimed to systematically assess the quality of CPGs for motor neuron diseases (MNDs) or related disorders and identify the gaps that limit evidence-based practice. Methods Four scientific databases and six guideline repositories were searched for eligible CPGs. Three researchers assessed the eligible CPGs using the Appraisal of Guidelines Research and Evaluation II instrument. The distribution of the level of evidence and strength of recommendation of these CPGs were determined. The univariate regression analysis was used to explore the characteristic factors affecting the quality of CPGs. Results Fifteen CPGs met the eligibility criteria: 10 were for MND and 5 were for spinal muscular atrophy. The mean overall rating score was 44.5%, and only 3 of 15 CPGs were of high quality. The domains that achieved low mean scores were applicability (24.4%), rigor of development (39.9%), and stakeholder involvement (40.3%). Most recommendations were based on low-quality evidence and had a weak strength. The CPGs that were updated, meant for adults, and evidence based, and used a CPG quality tool and a grading system were associated with higher scores in certain specific domains and overall rating. Conclusion The overall quality of CPGs for MNDs or related disorders was poor and recommendations were largely based on low-quality evidence. Many areas still need improvement to develop high-quality CPGs, and the use of CPG quality tools should be emphasized. A great deal of research on MNDs or related disorders is still needed to fill the large evidence gap.


Introduction
Motor neuron diseases (MNDs) or related disorders, such as MND, spinal muscular atrophy (SMA), and post-polio progressive muscular atrophy (PPMA), are a group of disorders characterized by progressive weakness secondary to degeneration of the motor neurons (1). MND is caused by the loss of motor neurons leading to muscle atrophy, paralysis, and ultimately death within 3-5 years after the onset of the disease (2). Although the treatment options for MND are limited and patient care primarily focuses on controlling symptoms and optimizing functioning and quality of life (3), better multidisciplinary care and a better understanding of interventions may allow patients with MND to live longer life (4). The data from the Global Burden of Disease study show that the burden of MND is increasing (5).
SMA is an autosomal recessive disease caused by biallelic mutations in the survival motor neuron 1 gene, causing progressive muscle weakness and atrophy (6). It imposes a significant burden on patients, caregivers, and the healthcare system (7), but the financial burden could be significantly reduced by expanding newborn screening for SMA in combination with early treatment interventions for newborns with SMA (8). For example, newborn screening for patients with SMA is associated with earlier diagnosis and intervention, and motor milestones are often achieved at an earlier age, compared with clinically diagnosed patients (8). PPMA, also known as post-polio syndrome (PPS), is a chronic progressive disorder that may appear decades after the initial acute polio infection (9), affecting 20-40% of polio survivors and manifesting as neuromuscular complications (10). Although the incidence of paralytic poliomyelitis has declined significantly (11), PPS will remain a major health problem for many years (12). At present, the diagnosis and management of MNDs or related disorders remain a challenge for clinicians and vary widely in practice.
One of the foundations of improving healthcare work are clinical practice guidelines (CPGs) (13). CPGs are defined as "statements that include recommendations intended to optimize patient care that are informed by a systematic review of evidence and an assessment of benefits and harms of alternative care options" (14). CPGs help improve the quality of health care, such as informing clinicians about treatment decisions for patients and determining the appropriate standard of care, thus identifying gaps between evidence and practice (15).
The usefulness of CPGs depends primarily on quality, rigorous methodology, and transparent development processes (16). Better CPG quality appears to be associated with more positive treatment outcomes (17). Some common problems with CPGs include the sheer volume available, large amount of documentation that is difficult to assimilate or use, lack of clear supporting evidence, exclusion of key stakeholders, insufficient editorial independence, and poor applicability (18-20). Besides, determining the level of evidence on which the recommendations are based is important (21,22). However, the systemic appraisal of CPGs for MNDs or related disorders remians unreported, and the distribution of the level of evidence on which these recommendations are based has not been described.
Various tools have been developed to assess the quality of CPGs (23). The Appraisal of Guidelines Research and Evaluation (AGREE) Instrument was first published in 2003 and updated in 2009 (AGREE II). It is designed to assess the methodological quality of CPGs and also provide methodological strategies for the development of new CPGs (24). The AGREE II instrument is currently the preferred tool for the quality assessment of CPGs in the world and can be used to assess CPGs of multiple diseases (15,(24)(25)(26)(27).
Therefore, the present study aimed to assess the quality of CPGs for MNDs or related disorders using the AGREE II instrument, identify the distribution of the level of evidence and strength of recommendations of these CPGs, and also identify the potential influencing factors for the quality of CPG. The findings of this study would help identify the gaps that hinder evidence-based practice and highlight potential opportunities for improvement.

Study selection and data extraction
Two researchers (Jia-yin Ou and Jun-Jun Liu) independently performed the study selection and data extraction. Any disagreements between the two were resolved through discussion or consultation with a third researcher (Jing Xu). For study selection, the search results were first exported into the EndNote X7 literature management software (Thomson Reuters Corporation, CA, United States), excluding duplicates. Two researchers reviewed the titles and abstracts of the studies for screening to exclude the explicitly irrelevant studies, and then read the full texts of the remaining studies to determine whether they were finally included. For each CPG included in the end, Frontiers in Neurology 03 frontiersin.org the accompanying technical and supporting documents were thoroughly searched to better inform our evaluation. The data were extracted into a specially designed spreadsheet. The extracted variables included the year of publication, disease, development organization, first author (if applicable), country/region of origin, version, age range of target population (adult/children/all ages), development method, search dates covered, CPG quality tool used, CPG methodologist included, title, funding sources, and accompanying documents. The country/region of origin was described as "Europe" if the CPG was jointly developed by multiple European countries, and "international" if it was jointly developed by multiple countries from different continents. The development method was described as "evidence based" if the CPG performed a systematic search and evaluation of evidence and made recommendations based on the evaluation results during the CPG development process; otherwise, it was described as "consensus based. "

Quality assessment
Three researchers (Jia-Yin Ou, Jun-Jun Liu, and Jing Xu) assessed the quality of each CPG independently using the AGREE II tool under the guidance of a methodological expert (Liming Lu). The AGREE II tool included 23 key items organized in 6 domains followed by 2 global rating items ("Overall Assessment"). The six domains included scope and purpose, stakeholder involvement, rigor of development, clarity of presentation, applicability, and editorial independence. Each item was rated on a 7-point scale, with 1 indicating strongly disagree and 7 indicating strongly agree. Strongly disagree meant no information was relevant to an item, and strongly agree meant the quality of reporting was exceptional and the full criteria and considerations were met for an item. A score between 2 and 6 was assigned when the reporting did not meet the full criteria or considerations for an item. According to the calculation formula, the scores in each domain were calculated and the calculation formula was as follows: each domain score = (obtained score -minimum possible score)/ (maximum possible score -minimum possible score) × 100%. Consistent with a previous study (28), the score of a domain or overall rating ≤ 40% was considered as a low rating, >40 and ≤ 70% as a moderate rating, and > 70% as a high rating.
In the overall assessment, the first global rating item ("overall rating") was scored on a 7-point scale and then calculated as a percentage, which was the same method used to calculate domain scores, as described in previous studies (15,29). For the second global rating item, a CPG was rated as high quality when the score of three domains considered as the most important was ≥50% of the maximum possible score, consistent with previous studies (15,30,31). The three domains were as follows: stakeholder involvement (domain 2), rigor of development (domain 3), and editorial independence (domain 6).
Before the assessment, researchers received online training using the AGREE II Online Training Tool. Then, a meeting was held to discuss the specific assessment criteria of AGREE II, and researchers assessed some CPGs of different levels and discussed the results with each other. The formal assessment was performed only when the intraclass correlation coefficient (ICC) was >0.8. During the assessment process, researchers carefully read the document of each CPG and its accompanying documents or information on the Internet to make an accurate judgment. If the researchers' score on an item differed significantly (more than 2 points), a consensus was reached through discussion.

Statistical analysis
The researchers' AGREE score was entered into Microsoft Excel (Microsoft, WA, United States), and the standardized score of each domain and the overall score of each CPG were calculated. Continuous variables were expressed as mean ± standard deviation (SD) (normal distribution) or median (Q1-Q3) (skewed distribution), and categorical variables were expressed as frequencies and percentages. The independent-sample t test/nonparametric tests (Kruskal-Wallis test)/chi-square tests/Fisher exact tests were used to compare the differences between the two groups. As the overall scores of AGREE II domain, overall rating, and item of included CPGs had both normal and nonnormal distributions, consistent with the previous studies (27), the mean (SD) and median (Q1-Q3) of overall scores were both presented for the convenience of observation and comparison. The number of each level of evidence and the strength of recommendation of each CPG were evaluated. The univariate linear regression model and the logistic regression model were used to assess the associations between the characteristics and each AGREE II domain score and overall assessment of included CPGs. An ICC with a 95% confidence interval (CI) with a two-way random-effects model was used to detect the inter-rater agreement to ensure that researchers' understandings of each item were basically the same, and ICCs for each domain and overall rating scores were calculated. Consistent with previous studies (25,26) and according to Landis and Koch (32), the degree of agreement between 0.01 and 0.20 was considered minor, between 0.21 and 0.40 as fair, between 0.41 and 0.60 as moderate, between 0.61 and 0.80 as substantial, and between 0.81 and 1.00 as very good.
In the sensitivity analysis, a series of analyses were performed to test the robustness of our findings. First, other criteria were used for the overall assessment to identify whether the overall rating score and the number of high-quality CPGs differed from the initial results. For the first global rating item ("overall rating"), the score of each CPG was based on the average score of the six domains, consistent with previous studies (27,33). A stricter standard was used for the second global rating item, consistent with previous studies (27,34). A CPG was classified as of high quality if the score of domain 3 (rigor of development) was >70% and the scores of all other domains and overall rating were > 50%. Second, the univariate regression analysis, restricted to CPGs published after 2015 or evidence based or MND as the target disease, was performed separately. This helped assess the association between characteristics and each AGREE II domain score and overall assessment of included CPGs, determine whether these associations were consistent across different types of CPGs, and reduce confounders.

Study selection
A total of 5,899 records were yielded initially, and 15 CPGs were finally included for assessment after screening by title, abstract, and Frontiers in Neurology 04 frontiersin.org full text ( Figure 1). Two CPGs were both published in two parts but considered as one (6, 35-37).

Characteristics of CPGs
The characteristics of included CPGs are displayed in Table 1 and  Supplementary Table S1. Supplementary Table S2 shows the results of descriptive statistics of these characteristics. Nine CPGs were published after 2015. Eight CPGs were developed by a medical society and five by an expert panel. Ten CPGs were developed for MND and five for SMA; however, no CPG existed for PPMA. Five CPGs were originally from individual European countries, five were from individual North American countries, and the remaining were international, Europe, and Brazil. Six CPGs were updated. Three CPGs were for adults, one for children, and five for all ages; the rest did not specify the age group. Eight CPGs were considered evidence based and used a grading system. Seven CPGs reported the search dates. Only three CPGs used the CPG quality tool, such as the AGREE II tool. One CPG reported the inclusion of methodologists in the development team. Five CPGs were funded, four were not, and the remainder did not report the funding status. Compared with SMA CPGs, MND CPGs were significantly more likely to state the search dates (70.0% vs. 0.0%, p = 0.026) (Supplementary Table S3). Table 2 displays the overall mean (SD) and median (Q1-Q3) scores for each AGREE II domain, item, and overall rating of included CPGs.    Table S4).  Table 3 displays the AGREE II domain scores and overall assessment of each CPG. Figure 2 displays the distribution of the degree of score across each domain and the overall rating. In the domain scope and purpose, seven CPGs received high ratings, which were scored >70%. Six CPGs received moderate ratings, which were scored >40% and ≤ 70%. The National Institute for Health and Clinical Excellence (NICE) CPG received the highest scores (100%), and the Canadian Thoracic Society (CTS), Brazilian Medical Association, and the 2020 Canada CPG also got high scores, which were all >90%. The distribution of the rating degree was similar in the stakeholder involvement and rigor of development domains, with eight CPGs receiving low ratings, which were ≤ 40%. The 2020 Canada CPG received the highest scores in these two domains (domain 2: 89%, domain 3: 91%). In addition, the NICE CPG also received high ratings in both domains (domain 2: 70%, domain 3: 76%). The CTS received the second highest score in domain 3 (90%). Regarding the domain clarity of the presentation, the results were satisfactory. More than half of the CPGs received high ratings, and none received low ratings. In the domain applicability, thirteen CPGs   Frontiers in Neurology 08 frontiersin.org received low ratings and only two CPGs received high ratings, and these two CPGs were the CTS (85%) and NICE CPG (74%). In the editorial independence domain, five CPGs received high ratings (33.3%) and six received moderate ratings (40.0%). Many CPGs lacked funding information and statements of competing interests. In the overall rating, eight CPGs received low ratings, and only the CTS (89%), NICE (78%), and 2020 Canada CPG (83%) received high ratings. Among the 15 CPGs, 3 were of high quality, which was consistent with the result of the overall rating. These three CPGs were all for MND, while the NICE and CTS CPGs were for adults, and the 2020 Canada CPG was for all ages. The CTS and 2020 Canada CPGs were from Canada, and the NICE CPG was from the UK. All three CPGs were evidence based and used the CPG quality tool.

Quality assessment of CPGs
The inter-rater reliability was very good for all domains and the overall rating (Table 4). Supplementary File 2 displays the individual scoring of the AGREE II tool for each CPG.

Level of evidence and strength of recommendation of CPGs
Among the 15 CPGs, 8 were evidence based. Table 5 displays the grading systems used and the distribution of the level of evidence and strength of recommendation among these CPGs. Two CPGs used the adapted Grading of Recommendations Assessment, Development and Evaluation (GRADE) system, one used the combined Oxford Centre for Evidence-based Medicine and GRADE system, one used the American Academy of Pediatrics criteria, two used the adapted American Academy of Neurology (AAN) criteria, one used the AAN criteria, and one did not report the name of the grading system. Four CPGs graded evidence by a body of evidence, whereas four CPGs graded only individual studies. One CPG did not grade the strength of recommendations, and two CPG did not report the strength of specific recommendations. Distribution of the degree of score across each domain and overall rating. CPG, Clinical practice guideline.  Although the CPGs used different grading systems, their grading criteria for the level of evidence and strength of recommendations were similar. Most of the evidence for which CPGs graded specific evidence (85.7%) were C, low, D, very low, EC, III, and IV level, indicating low-quality evidence. In the same way, most of the recommendations for which CPGs graded specific recommendations (88.2%) were graded as C, GCPP, U, 2, and weak strength, indicating weak recommendations.  A comparison between CPGs using and not using the grading system showed the same results as that between consensus-based and evidence-based CPGs.

Relationship between characteristics and AGREE II domain score and overall assessment of CPGs
We also found that the year of publication, type of development organization, country/region of origin, and funding sources were not associated with the AGREE II domain score and overall assessment of included CPGs.

Sensitivity analysis
Supplementary Table S5 displays the overall assessment of included CPGs using other criteria. The mean overall rating score for all CPGs was 51.5% [SD = 19.9%, median = 51.0% (Q1-Q3, 36.5-61.0%)]. For the second global rating item, only the CTS CPG and the NICE CPG were of high quality under the stricter evaluation criteria. The 2020 Canada CPG was rated as low quality because its domain 5 score was less than 50%. Supplementary Tables S6-S8, respectively, display the relationship between characteristics and AGREE II domain score and overall assessment of included CPGs published after 2015, evidencebased CPGs, and CPGs for MND. The relationship between characteristics and each AGREE II domain score and overall assessment of different types of CPGs were generally consistent with the main analyses that the results in all included CPGs.

Discussion
This study was novel in assessing the quality of CPG for MNDs or related disorders using the AGREE II instrument and identifying the distribution of the level of evidence among these CPGs. The quality of CPGs for MNDs or related disorders varied widely. The overall quality of these CPGs was generally poor, and only three CPGs were rated as high quality. However, in a more stringent assessment standard, only two CPGs were of high quality. In comparison, the highest domain score was for clarity of presentation (domain 4) and the lowest domain score was for applicability (domain 5). Eight CPGs were considered evidence based, and most of the evidence (85.7%) on which these CPGs' recommendations were based was low-quality evidence. Despite the improvement in the quality of CPGs in recent years, contemporary CPGs for MNDs or related disorders still lacked a consolidated evidence basis to provide recommendations for clinical practice. We also identified some factors that could affect the quality of CPGs, which could also serve as aspects for improvement.
Consistent with other studies (15,30,31), we used a less-stringent criterion in the overall assessment to judge whether a CPG was of high or low quality, but most CPGs were of poor quality. The development of CPGs requires significant resource consumption. Spending resources on low-quality CPGs with ineffective care recommendations is a waste and confusing for users (15). Coexisting problems are the uneven geographical distribution of CPGs and the duplication of CPGs. Most CPGs were from Europe and North America, only two CPGs were international, and one was from Brazil. Only Canada and the United Kingdom had high-quality CPGs, while other countries lacked them. All CPGs were for MND and SMA, of which SMA CPGs were all of low quality. No CPGs existed for PPS, although the prospect for the future is a continuous and ever-increasing demand for rehabilitation programs and management of patients with PPS (48,49); currently around 18 million people are still affected by paralytic poliomyelitis (11). Dedicating resources to develop fewer, higher quality, and less-"redundant" CPGs can help reduce inefficient resource usage and user confusion (15,51). Cooperation between countries and associations should be strengthened to reduce overlapping efforts and focus efforts and resources on developing high-quality CPGs and areas that need to be addressed.
Another problem is the inconsistent terminology used by CPG developers to describe the MND condition for which the CPGs are used. Also CPGs for MND include some CPGs for "amyotrophic Frontiers in Neurology 10 frontiersin.org    Frontiers in Neurology 11 frontiersin.org lateral sclerosis (ALS)" and some for "MND. " Terminology can be confusing. MND is mainly divided into four categories: ALS, progressive bulbar palsy, progressive amyotrophy, and primary lateral sclerosis. When patients present with both upper and lower motor neuron signs, the disease is referred to as ALS, which is the most common form of MND (52). However, in the United Kingdom and Australia, MND is used as a general term for these disorders and also refers to the ALS subtype; however, in the United States, ALS is more commonly used as a general term and also denotes the ALS subtype (52). Consistent terminology is needed to define MND conditions, regardless of developer/professional group, to reduce CPG duplication and user confusion. Many aspects of CPG development need to be improved. The domains of AGREE II that these aspects relate to are stakeholder involvement (domain 2), rigor of development (domain 3), applicability (domain 5), and editorial independence (domain 6). CPGs having problems in these domains were consistent with other studies (15,19,51). Among these, the problems of poor applicability and editorial independence of CPGs always existed, although the overall quality of CPGs in different health fields improved (51).
Stakeholder involvement reflects how well the CPG represents the views of its intended users, including patients. During the CPG development process, patients' views, preferences, and expectations regarding care have become increasingly important (53). However, the overall score was low, with 53.3% of CPG scores ≤40% and only one CPG scoring greater than 70%. Most CPGs do not provide details about the involvement of patients or their representatives. Patients are important stakeholders and should be involved in the development process of CPGs, although this may introduce patient biases about costs, cultural background, and expectations (54).
Rigor of development is the most critical domain, as it significantly affects confidence in implementing CPGs (55), while nonsystematic development tends to lead to poor-quality CPGs (56). The strong heterogeneity found among the scores of included CPGs (Table 2) attested to the existence of significant gaps in the methodological development of the CPGs. Seven CPGs were not considered evidence based and did not use the grading system, or describe literature search and selection methods. The low score might be due to insufficient methodological consultation (55) or unfamiliarity with CPG development standards, poor reporting (19), or poor performance of the external peer review and update process (53).
The domain applicability considers factors that may affect guideline implementation, including identifying facilitators and barriers, providing tools for applying recommendations, identifying potentially relevant resources, and auditing standards (24). Thirteen CPGs received low ratings scoring ≤40%. The low applicability significantly hindered the implementation of CPG recommendations. CPG developers should conduct pilot tests to ensure feasibility before publication (57). In addition, all CPGS were just published as articles in journals. New ways to increase user adoption and usability should be considered by CPG developers (15), such as the use of smartphone applications (58,59) or digital CPG platforms (60).
Editorial independence is also an important domain of CPG quality (61), and with only two statements required, it should be easy to score high (15). However, the overall score in this domain was not high [mean = 55.9% (SD, 34.4%)], with only five CPGs receiving high ratings, scoring >70%. Six CPGs did not state funding sources. Conflicts of interest are a common source of bias (62) and are often underestimated (63). CPG developers should clearly report conflicts of interest, including rigorous review processes and transparent review rules (64).
Overall, the NICE, CTS, and 2020 Canada CPGs (3, 41, 65) had the highest overall rating scores and were also rated as high quality. Based on this, these three CPGs should be favored by clinicians and policy makers, and are worthy of application in clinical practice.
Besides focusing on improving the transparency and methodological rigor of the CPG development process, CPGs should rely more on the growing body of evidence (26). However, nearly half of the CPGs for MNDs or related disorders were not considered evidence based, and most of the recommendations (85.7%) in the evidence-based CPGs were based on low-quality evidence, which were largely from observational studies or expert consensus. This finding constituted an obstacle to establishing CPGs for MND or related disorders, as recommendations were based on low-quality evidence, and it also showed a gap between clinical practice evidence and current medical research. Further research is needed on managing MND or related disorders, and more evidence-based recommendations would be extremely important to improve the standard of care for patients with MND or related disorders. Moreover, given that certain clinical questions may not addressed by high-quality research, it is expert consensus that can fill these knowledge gaps. Despite expert consensus may lack support from high-quality evidence, they can still provide valuable information. Consequently, CPGs developers should pay more attention to the rigor and standardization of consensus method and explore how to guide users to adopt expert consensus accurately.  Frontiers in Neurology 12 frontiersin.org Some characteristics and factors were found to be associated with the quality of CPGs. Specifically, among the CPGs for MND or related disorders, CPGs that were for MND, stated search dates, and included CPG methodologist were associated with higher scores in some specific domains, whereas CPGs that were updated, for adults, and evidence based, and used CPG quality tool and a grading system were associated with higher scores in both some specific domains and overall rating. In addition, the three CPGs that used the CPG quality tools were all rated as high quality, two of which used the AGREE II instrument. Therefore, the use of CPG quality tools, especially the AGREE II instrument, needs to be emphasized and improved during the development of CPGs. In addition, attention should also be paid to the use of the grading system, which has the greatest impact on the domain 3 score. The GRADE system is used to assess the level of evidence, while the AGREE II instrument is used to guide the CPG development process and set reporting standards (27), and they complement each other. Moreover, methodologists should be brought into the development team and should pay attention to CPG development details, such as providing the search date range of literature evidence.
This study had several limitations. Firstly, the search might have missed some CPGs, although the authors systematically searched major scientific databases and online guideline repositories. Secondly, only CPGs published in English were included, which might have excluded high-quality non-English CPGs, resulting in a lack of representation of CPGs from less-developed countries. Thirdly, the assessment of CPGs might reflect the researcher's perspective, although our research team was multidisciplinary (including neurologists and other specialists, methodologists and other researchers), these limitations were unavoidable. Fourth, the AGREE II scoring system relied on the understandability and comprehensiveness of CPGs' reporting and did not reflect the quality and strength of evidence (29). However, this study additionally analyzed the distribution of the level of evidence among included CPGs. Fifth, the AGREE II instrument did not provide a clear cut-off point to distinguish between high-quality and low-quality CPGs. To this end, based on some previous studies, we used a more widely used and less-stringent method to distinguish between high-quality and low-quality CPGs as the main analysis. Also, we used a more sensitive and stringent method to distinguish between high-quality and low-quality CPGs as a sensitivity analysis. Sixth, due to the small sample size, only univariate analysis was used to explore the relationship between CPG characteristics and AGREE II scores. However, we tried to reduce confounders by limiting the analysis to CPGs with certain characteristics, as a sensitivity analysis.

Conclusion
The quality of CPGs for MNDs or related disorders varied widely, and the overall quality was poor. No CPG existed for PPMA. Most recommendations were based on low-quality evidence. Many areas still need improvement, especially in the domains of stakeholder involvement, rigor of development, and applicability. CPGs for MNDs or related disorders should formulate recommendations with high-quality evidence and should be developed through rigorous methodology and transparency to minimize bias from external sources, and CPG quality tools should be used. In addition, a significant amount of studies on MNDs are still needed to fill the large evidence gap in the CPGs for these diseases.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding authors.