Interrater reliability of the Fugl-Meyer Motor assessment in stroke patients: a quality management project within the ESTREL study

Introduction The Fugl-Meyer Motor Assessment (FMMA) is recommended for evaluating stroke motor recovery in clinical practice and research. However, its widespread use requires refined reliability data, particularly across different health professions. We therefore investigated the interrater reliability of the FMMA scored by a physical therapist and a physician using video recordings of stroke patients. Methods The FMMA videos of 50 individuals 3 months post stroke (28 females, mean age 71.64 years, median National Institutes of Health Stroke Scale score 3.00) participating in the ESTREL trial (Enhancement of Stroke Rehabilitation with Levodopa: a randomized placebo-controlled trial) were independently scored by two experienced assessors (i.e., a physical therapist and a physician) with specific training to ensure consistency. As primary endpoint, the interrater reliability was calculated for the total scores of the entire FMMA and the total scores of the FMMA for the upper and lower extremities using intraclass correlation coefficients (ICC). In addition, Spearman’s rank order correlation coefficients (Spearman’s rho) were calculated for the total score and subscale levels. Secondary endpoints included the FMMA item scores using percentage agreement, weighted Cohen’s kappa coefficients, and Gwet’s AC1/AC2 coefficients. Results ICCs were 0.98 (95% confidence intervals (CI) 0.96–0.99) for the total scores of the entire FMMA, 0.98 (95% CI 0.96–0.99) for the total scores of the FMMA for the upper extremity, and 0.85 (95% CI 0.70–0.92) for the total scores of the FMMA for the lower extremity. Spearman’s rho ranged from 0.61 to 0.94 for total and subscale scores. The interrater reliability at the item level of the FMMA showed (i) percentage agreement values with a median of 77% (range 44–100%), (ii) weighted Cohen’s kappa coefficients with a median of 0.69 (range 0.00–0.98) and (iii) Gwet’s AC1/AC2 coefficients with a median of 0.84 (range 0.42–0.98). Discussion and conclusion The FMMA appears to be a highly reliable measuring instrument at the overall score level for assessors from different health professions. The FMMA total scores seem to be suitable for the quantitative measurement of stroke recovery in both clinical practice and research, although there is potential for improvement at the item level.


Introduction:
The Fugl-Meyer Motor Assessment (FMMA) is recommended for evaluating stroke motor recovery in clinical practice and research.However, its widespread use requires refined reliability data, particularly across different health professions.We therefore investigated the interrater reliability of the FMMA scored by a physical therapist and a physician using video recordings of stroke patients.

Methods:
The FMMA videos of 50 individuals 3 months post stroke (28 females, mean age 71.64 years, median National Institutes of Health Stroke Scale score 3.00) participating in the ESTREL trial (Enhancement of Stroke Rehabilitation with Levodopa: a randomized placebo-controlled trial) were independently scored by two experienced assessors (i.e., a physical therapist and a physician) with specific training to ensure consistency.As primary endpoint, the interrater reliability was calculated for the total scores of the entire FMMA and the total scores of the FMMA for the upper and lower extremities using intraclass correlation coefficients (ICC).In addition, Spearman's rank order correlation coefficients (Spearman's rho) were calculated for the total score and subscale levels.Secondary endpoints included the FMMA item scores using percentage

Introduction
Motor impairment is one of the most important disabilities associated with stroke and can significantly affect the quality of life (1).Muscle weakness, abnormal synergy, and spasticity are among the motor deficits commonly assessed in stroke patients (2).Considering the repair processes, measuring motor recovery after stroke is very important.The Fugl-Meyer Motor Assessment (FMMA) (3) is strongly recommended as a clinical and research tool for the evaluation of changes in motor impairment after stroke (4).It was a key component of the assessment recommendations for improving the methodology of adult rehabilitation and recovery trials (5) and clinical motor rehabilitation (6), which should be repeated at different measurement time points.The inclusion of the upper extremity FMMA (FMMA-UE) in further recommendations for outcome measurement after stroke has confirmed its importance (7,8).
The maximum total score per side is 66 points for the FMMA-UE and 34 points for the lower extremity FMMA (FMMA-LE) (4).The FMMA items are rated on an ordinal scale with the scores 0 = cannot perform, 1 = performs partially and 2 = performs fully (4).The practical implementation of the test and the assessment of its individual items require standardized, sound training as well as routine.These aspects can be promoted by a uniform test version in the different languages of the respective countries of application.Upon completion of the present project, standardized FMMA test forms translated into more than 10 different languages were available [e.g., at https://www.gu.se/en/neuroscience-physiology/fugl-meyerassessment (9)].However, to the best of our knowledge, no standardized, validated German version of the test is currently available.Therefore, we developed an adapted German version of the assessment, based on the original article and protocols of the University of Gothenburg (3,10,11).The corresponding assessment forms can be found in the Supplementary Table S1.The interprofessional application of this German version of the FMMA into clinical trials requires good psychometric properties in terms of the validation process.
A high interrater reliability of the German version of the FMMA across different health professions is essential for the use of the assessment in clinical studies, but also for its application in daily rehabilitation practice.The English version of the FMMA showed excellent intra-and interrater reliability (4).Platz et al. (12) found a very high interrater reliability of the FMMA-UE with intraclass correlation coefficients (ICC) based on video recordings.In the Sullivan et al. (13) study, interrater agreement between expert and therapist raters using video recordings was high for the FMMA total scores with an ICC value of 0.98 as well as for total scores of the FMMA-UE with 0.99 and moderate to high for the FMMA-LE total scores with 0.91.Based in part on the strong evidence for validity, reliability, responsiveness, and clinical utility, the FMMA-UE was incorporated into the core set of European evidence-based recommendations for Clinical Assessment of Upper Limb In Neurorehabilitation (CAULIN) (7).
In this context, refined reliability data and the availability of transculturally adapted, validated FMMA versions in different languages are even more important.Investigating the interrater reliability of new FMMA versions using sufficiently large samples is a relevant component in this regard.Therefore, we aimed to investigate (i) the interrater reliability of the German FMMA across health professions and (ii) the comparability of the psychometric properties of the German FMMA with those of the English version.

Project objectives and design
The aim of this research project was to study the interrater reliability of the German version of the FMMA in terms of its consistent and accurate application across health professions.The FMMA is used in the ongoing Swiss multicentre ESTREL trial (Enhancement of Stroke Rehabilitation with Levodopa: a randomized placebo-controlled trial, BASEC-number 2018-02021, ClinicalTrials.gov NCT03735901) (14), in which the current reliability study with a cross-sectional design was embedded.

Study population and procedure
All patients in this study had a video recorded FMMA at their regular three-month visit as part of their participation in ESTREL (14, 15).In brief, ESTREL investigates whether Levodopa, compared to placebo, given in addition to standardized rehabilitation based on the principles of motor learning, is associated with a patient-relevant enhancement of functional recovery in acute ischemic or haemorrhagic stroke patients, as measured by the FMMA after 3 months (14, 15).
The present project followed the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) (16).As a preparatory The sampling method of the recordings was consecutive, following a standardized procedure.Eligibility criteria: We took the first 72 available FMMA videos from ESTREL participants who were eligible for the trial (14).Of these, 62 videos were identified in Basel and 10 in Zurich.The first author (KW) performed a quality check of all collected videos based on the criteria of (i) completeness, (ii) visibility of the entire examination, and (iii) potential source of bias.Video recordings were excluded, if (a) the FMMA was incomplete, (b) a FMMA subscale was not fully visible, and (c) the evaluation sheet with the FMMA ratings was visible on the video.In addition, recordings were excluded if one of the assessors of the videos was the FMMA examiner being videotaped.A flowchart of the video selection process is presented in Figure 1.

Independent assessors
Two independent assessors -one from each participating centrerated the FMMA videos.Rating was limited to the hemiparetic side in each case.The assessors consisted of one research physician (LM) and one research physiotherapist (AS) from the two different centres, each with a master's degree and clinical experience -who met the following criteria: First, both assessors had participated at least twice in a standardized, fourhour in-person FMMA training course by an FMMA expert (JH), based on the German version of the FMMA (see Figure 2, FMMA training).Second, both assessors had applied the German version at least 50 times on stroke patients in a standardized setting.
The assessors scored the videos separately in space and time, and independently of each other and other study personnel.The scores were directly entered in coded electronic case report forms (eCRF) of the German version of the FMMA within the secure web application REDCap (Research Electronic Data Capture) (17).Regarding clinical information, the assessors were unaware of the initial stroke severity, including the FMMA scores at baseline, but were not blinded to the medical history of the subjects in the video recordings.Both assessors had the same access to on-site training and additional video tutorials for recapitulation.They were given additional guidance and explanation on how to proceed in special situations where FMMA items could not be completed for non-stroke-related reasons (e.g., due to pain) or where items were incomplete on video.The flow chart of the study procedure can be found in Figure 3.

Data recording and confidentiality
The videos were recorded with a GoPro camera, GoPro Incorporation (San Mateo, California, US).Camera positions (heights, Flow chart of the selection process for FMMA video recordings.FMMA, Fugl-Meyer Motor Assessment; n, number of subjects.The informed consent form for the ESTREL trial specifies that the FMMA tests may only be recorded and used for internal research purposes in order to conduct a thorough evaluation.

Statistical reliability analysis
A sample size of 50 subjects was recommended for reliability studies in order to reasonably determine kappa values (19).In our project, we followed this recommendation, as well as appropriate reference studies that included between 10 and 60 individuals after stroke in their reliability analyses (12,13,(20)(21)(22)(23)(24)(25)(26).
Our primary endpoint was the interrater reliability of the FMMA, calculated for the total scores of the entire FMMA and the total scores of the FMMA for the affected extremities using ICCs with the corresponding 95% confidence intervals (CI).The following ICC form fitted the model best: Two-way mixed effects, absolute agreement, multiple raters/measurements (27,28).ICC values were also calculated for the FMMA subscales, as these parameters were recommended for use with continuous variables (19).For comparison with the reference literature, ICCs were calculated for all FMMA subscales.Since it is questionable whether the ICC -as a parameter for continuous variables (19) -is suitable for variables with few levels, the ICC was not considered as the only parameter for the coordination subscales (three items/0-6 levels) of the FMMA and for the wrist (five items/0-10 levels) and hand (seven items/0-14) subscales of the FMMA-UE.For these subscales, weighted Cohen's kappa values with associated CIs were calculated.Since the data were non-parametric, Spearman's rank-order correlation coefficients (Spearman's rho values) with appropriate p-values and 95% CI were calculated to document the strength of association for total and subscale evaluations between assessor 1 and assessor 2.
Several statistical procedures formed the secondary endpoints for assessing the reliability of the FMMA at the item level: (i) Percentage agreement values between the two ratings were calculated for all 50 FMMA individual tasks of the affected extremities.(ii) Weighted Cohen's kappa (29,30) values and corresponding 95% CI were obtained from the FMMA ordinal variables.(iii) Gwet's AC1/AC2 coefficients with corresponding 95% CIs were calculated at the item level in addition to the weighted Cohen's kappa values.
Statistical procedures to determine all end points were performed using RStudio software, version 1.2.1335.

Evaluation of parameters
Reliability parameters were categorized according to appropriate classifications (see Supplementary Tables S2, S3): We applied the 95% CIs of the ICC estimates for interpretation instead of the ICC estimates themselves (27) and used the Landis and Koch (31) classification for the weighted Cohen's kappa and the Gwet's AC1/AC2 values to compare the results of the German FMMA with those of previously published studies.

Results
Fifty video recordings were eligible to study the interrater reliability of the German FMMA version (Figure 1).There were no missing data that affected the statistical analysis.

Patient characteristics
50 individuals with stroke were recorded 3 months ±14 days after randomization in the ESTREL trial.28 of the participants were female, the mean age was 71.64 years, and the median National Institutes of Health Stroke Scale (NIHSS) score was 3.00.All patient demographic and clinical characteristics are presented in Table 1.

Descriptive findings of the Fugl-Meyer Motor assessment
Between June and September 2021, 50 FMMA ratings were carried out by both assessors.The difference in median total scores between assessors was less than three points for the FMMA-UE and one point for the FMMA-LE.The median values with the corresponding first and third quantiles of the two assessors' total FMMA scores are shown in Table 2.

Primary endpoint
All interrater reliability parameters at the overall score and subscale levels of the FMMA are shown in Table 3.
For all total scores (FMMA-UE, FMMA-LE and entire FMMA) as well as for the proximal part subscale of the FMMA-UE and the hip, knee, ankle subscale of the FMMA-LE, the ICC values were between 0.80 (95% CI 0.64-0.88)for volitional movement within flexor and extensor synergies of the lower extremity and 0.98 (95% CI 0.96-0.99)for the total scores of the FMMA-UE.The total scores of the entire FMMA were very similar at 0.98 (95% CI 0.96-0.99).Using Koo and Li′s (27) classification for the 95% CI of the ICC values, the reliability of the meaningful subscales as well as that of the total scores (values written bold in Table 3) was classified as moderate to excellent.Weighted Cohen's kappa values ranged from 0.62 (95% CI 0.42-0.83)for the coordination subscales of the FMMA-LE to 0.91 (95% CI 0.91-0.91)for the hand subscales of the FMMA-UE.Using Landis & Koch (1979) (31) benchmarking for kappa statistics, the strength of agreement was found to be moderate to almost perfect.
The Spearman's rank-order correlation coefficients for the total score and subscale levels ranged from 0.61 to 0.94 (median 0.91), with the lowest value for the hip, knee, ankle subscales of the FMMA-LE (values <0.7).The highest values were obtained for the FMMA-UE total scores, the total scores of the entire FMMA, and the wrist, hand, and the coordination subscales of the FMMA-UE (values >0.9).All p-values of Spearman's rank-order correlation coefficients were smaller than 0.001.

Secondary endpoints
All item-based interrater reliability parameters of the German version of the FMMA-UE are summarized in Table 4 and those of the   CI 0.00-0.00)for item Ib. as well as for item Va. and the highest was 0.74 (95% CI 0.74-0.74)for one component of the tasks performed in a sitting position.The median was 0.47.Thus, the degree of agreement was slight to substantial.In most cases, the Gwet's AC1/AC2 coefficients were higher than the weighted Cohen's kappa coefficients.Based on these Gwet's AC1/AC2 values (median for the items of the entire FMMA 0.84, range 0.42-0.98;median for the items of the FMMA-UE 0.82, range 0.58-0.98;median for the items of the FMMA-LE 0.87, range 0.42-0.97),moderate to almost perfect agreement according to the classification of Landis and Koch (1977) (31) was found for the FMMA-UE, while it was also moderate to almost perfect for the lower extremity.

Discussion
The results indicate the following key findings: (i) The total scores of the entire FMMA show excellent interrater reliability of the German FMMA version across different health professions.This makes it suitable for quantitative measurement of stroke recovery in both clinical practice and research.(ii) Interrater reliability at the item level was lower than in comparable studies with FMMA versions in other languages, leaving room for potential improvement in this area.

Interrater reliability at the overall score level
For the total scores of the entire FMMA, which includes both the FMMA-UE and FMMA-LE, the ICC was 0.98 (95% CI 0.96-0.99),which is considered excellent (27).This finding is consistent with studies that investigated the interrater reliability of the English version of the FMMA in different settings (12,13,26).

Item-level interrater reliability
The percentage agreement values in the present study, which ranged from 44 to 98% for the FMMA-UE and from 44 to 100% for the FMMA-LE, were lower than those of the Colombian Spanish version of Hernández et al. (2019Hernández et al. ( , 2020) ) (24, 25), which ranged from 88 to 100% for the FMMA-UE and FMMA-LE.The level of agreement for the items of the FMMA-UE and FMMA-LE in the transculutural/cross-cultural translation and validation studies was above 70% for an Italian version (23) and for a Danish version (22).Both working groups classified an agreement of ≥70% as satisfactory (22,23).In contrast, the agreement values for our German version were below 70% for eight items of the FMMA-UE (seven of them within the proximal part subscale) and five items of the FMMA-LE (two of them within the coordination subscale).
Particularly noticeable are the lower percentage agreement values of the respective three items from the coordination subscales of the FMMA-UE and FMMA-LE compared to the data of the above-mentioned articles.In this study, the FMMA-UE coordination item values ranged from 48 to 96% (with the lowest value for dysmetria followed by tremor) and FMMA-LE values ranged from 44 to 82%, whereas the reference studies reported FMMA-UE coordination item values of at least 80% and FMMA-LE values of at least 70% (22-25).These discrepancies raise the question of whether the items of the coordination subscale of the German FMMA should be defined more specifically.
Another explanation for the lower interrater reliability values at the item level in the present project could be that the assessors of the reference studies were therapists (22)(23)(24)(25).In the present project the assessors consisted of a physician and a physiotherapist.At the item level, profession-specific differences in rating may well be apparent.

Implications for research and clinical practice
According to expert recommendations (5,6,8), the FMMA should be implemented as important assessment for the body function and structure domain of the International Classification of Functioning, Disability and Health (ICF) throughout the continuum of stroke care to optimize the quality of the rehabilitation pathway.The results of the present project make a small but important contribution on this way.
To ensure a consistent and uniform application of the FMMA, a clear, standardized training and refresher training structure as well as a lively exchange between assessors during the training process are of great importance.These elements are largely similar to the procedures used in our training setting.Therefore, and in line with See et al. (26), we recommend the creation of instructional videos as well as test patient videos to compare scoring as a supplement to FMMA presence training in small groups with an expert and standardized assessment forms.
Based on the proposed measures, the assessment forms of the German version of the FMMA can be further developed and the training structure can be adapted.In the future, international standardization and harmonization of FMMA protocols might be useful.

Strengths
To the best of our knowledge, this is the first reliability study with a cross-sectional design at a predefined measurement time point in which two assessors evaluated the interrater reliability of a German FMMA version using video recordings of 50 individuals after stroke.Except for the studies by Hernández et al. (24,25) with 60 stroke patients, all selected reference studies with similar populations had smaller sample sizes (12, 13, 21-23, 26, 32).Furthermore, the consistency of the ICC values across different calculation methods indicated the robustness of our key findings.
The assessors of the current project belong to two different health professions (a physician and a physical therapist), which can be seen as a strength considering that the FMMA is meant to be used more widely in the future.Therefore, and for the envisioned higher acceptance of the FMMA as key motor recovery assessment tool, a high reliability across different professions is essential.Another strength was that the video recordings could be evaluated remotely, avoiding repetitions of the FMMA, which would have introduced the risk of bias due to learning effects.Furthermore, the video approach may allow centralized adjudication within the multicentre ESTREL trial and could improve the quality of future stroke recovery and rehabilitation studies.

Limitations
We are aware of the following limitations of our project.Firstly, the generalisability of the findings based on video recordings with only two assessors has not been demonstrated.A future statistical reliability analysis should incorporate the original FMMA scores from the ESTREL database obtained from real measurements at the time of videotaping in the presence of patients.In this way, the original on-site FMMA ratings might be compared with the ratings of the two assessors based on the videotaped FMMA.This would allow additional comparison of interrater reliability with that reported in the literature based on FMMA ratings in the presence of patients.
Secondly, different statistical approaches to calculating interrater reliability were described in the literature.The parallel calculation of Gwet's AC1/AC2 coefficients for the item level of the FMMA can be considered a useful complement to the weighted Cohen-Kappa coefficients.The statistic of Gwet, in turn, is not well known because studies of interrater reliability in the current research field rarely report these coefficients.The comparability of study results is important in this context, which is why the use of Svensson's method (33), for example, should be considered in future cross-cultural translations and adaptations.Likewise, the determination of systematic disagreement would be an interesting approach.For example, the tasks actively performed by patients with combined movement levels and directions, which are difficult to assess from only one perspective, tended to reflect more systematic inconsistencies.This was evident in some large movement tasks as well as in the evaluation of dysmetria of the FMMA-UE, but also in the standing items and coordination tasks (dysmetria and tremor) of the FMMA-LE.Thirdly, the results of our study are not necessarily applicable to populations other than stroke patients and to assessors from other health professions (e.g., study nurses).

Conclusion
The FMMA appears to be a highly reliable measuring instrument at the overall score level for assessors from different health professions.The FMMA total scores seem to be suitable for the quantitative measurement of stroke recovery in both clinical practice and research, although there is potential for improvement at the item level.
step, an extensive literature research on relevant FMMA publications in the English-speaking world was conducted.After the selection of adequate reference literature, different FMMA versions were analyzed in detail and their contents were precisely compared.Between December 2019 and May 2021, FMMA video recordings of the threemonth visit of ESTREL stroke patients were performed at the two best recruiting centres, Basel and Zurich, Switzerland.The FMMA was applied in an outpatient visit setting, in most cases by the first author (KW).

FIGURE 3 Flow
FIGURE 3Flow chart of the study procedure.FMMA, Fugl-Meyer Motor assessment.

TABLE 2
FMMA median and quantile values of the two assessors.

TABLE 1
Patient demographic and clinical characteristics.
FAC, functional ambulance categories; mRS, modified Ranking Scale; n, number of subjects; NIHSS, National Institute of Health Stroke Scale; Q, quantile; SD, standard deviation of the mean.

TABLE 3
Interrater reliability parameters of the German version of the FMMA at the overall score level.