- 1The University of Arizona College of Nursing, Tucson, AZ, United States
- 2The University of Arizona Mel and Enid Zuckerman College of Public Health, Tucson, AZ, United States
- 3Indiana University School of Medicine, Indianapolis, IN, United States
- 4Regenstrief Institute Inc, Indianapolis, IN, United States
- 5VA Center for Health Information and Communication, Indianapolis, IN, United States
Background: Menstrual pain affects 45%–95% of reproductive-age females and increases the risk of other chronic pain conditions. Psychometrically sound measurement tools are essential for advancing research and clinical care in menstrual pain. Numerical rating scales (NRS) are widely used to measure pain severity. However, the minimally important difference (MID) and responsiveness to change of the NRS in the context of menstrual pain are not well understood. Understanding MID and responsiveness to change helps guide the evaluation of treatment efficacy and clinical decision-making. This study evaluated the MID and responsiveness to change in the NRS, ranging from 0 to 10, for menstrual pain severity.
Methods: Participants who were menstruating (aged 14–42, N = 100) completed two surveys 24 h apart. In both surveys, we measured menstrual pain severity (worst, least, average menstrual pain in the past 24 h, and current menstrual pain) on a 0 (no pain) to 10 (extremely severe) NRS. MIDs were estimated using distribution-based approaches (standard error of measurement and effect size) and anchor-based approaches (using symptom interference and retrospective recall of change as anchors). Responsiveness to change was evaluated using standard response means and area-under-the-curve analysis.
Results: The MID estimates were close to 1 point. The NRS of menstrual pain severity was responsive to menstrual pain improvement (standard response means ranged from 0.44 to 0.61, p < 0.001 for between-group comparisons). Area-under-the-curve estimates ranged from 0.66 to 0.70.
Conclusions: The findings can inform the design and interpretation of studies testing interventions for menstrual pain, while also guiding clinicians in monitoring and adjusting treatment.
Introduction
Menstrual pain, or dysmenorrhea, is characterized by abdominal or pelvic pain just before or during menstruation. It affects 45%–95% of reproductive-age females and leads to impaired sleep, decreased physical activity, and absences from school and work (1, 2). Additionally, dysmenorrhea is linked to an increased risk of developing other chronic pain conditions later in life, such as irritable bowel syndrome, fibromyalgia, and non-cyclic chronic pelvic pain (3, 4). Various interventions, both pharmacological and non-pharmacological, have been proposed and tested for the treatment of dysmenorrhea (1, 5, 6). To assess the effectiveness of interventions for dysmenorrhea, it is essential to use psychometrically sound measures. Two important psychometric domains are the minimally important difference (MID) and responsiveness to change. MID is the smallest change in a health outcome that prompts a clinician to consider altering the treatment plan (7). The MID estimate helps researchers/clinicians interpret the magnitude of treatment effects, distinguish between statistically significant and clinically important changes, and serves as a benchmark for calculating statistical power (8). Responsiveness indicates how well a measure detects changes over time (8). Using measures with high responsiveness helps avoid false-negative results, reduces the need for larger sample sizes, and guides clinicians in monitoring and adjusting treatment.
Numerical rating scales (NRS) for pain severity, typically ranging from 0 to 10, are widely used in pain research, including dysmenorrhea studies (9). Research suggests that the MID for NRS ranges from 0.8 to 4 for acute pain and from 1 to 2.5 points for chronic pain (10–12), indicating changes beyond these MID thresholds are generally perceived as meaningful by patients. However, MID can be context-dependent (8). What is considered a meaningful change in one population may not apply to another (8), especially in the case of menstrual pain, which is episodic in nature. Existing studies on MID for NRS for menstrual pain severity are scarce. Two studies conducted outside the United States suggested that an approximately 3-point change represents a clinically important difference for menstrual or pelvic pain (13, 14), and a study of patients with moderate-to-severe endometriosis proposed a 4-point threshold as the clinically important difference for menstrual pain (15). However, these studies focused on clinically important differences rather than minimally important differences, which may represent smaller yet meaningful changes from the patient perspective. In addition, they either relied on highly selective clinical samples (e.g., individuals with moderate-to-severe, surgically diagnosed endometriosis) or had relatively small sample sizes (typically 200–300 participants). Given that MID estimates can be context-specific, there remains a clear need for research to establish MID values for menstrual pain.
The 0–10 NRS in measuring pain severity has been shown to be responsive and more responsive than other common tools like the visual analogue scale and verbal rating scale (16). However, the responsiveness of the NRS in the context of menstrual pain has rarely been evaluated. Pain severity can be measured using various references, including current pain, average pain, worst pain, and least pain. Each reference provides different insights into the pain experience. For instance, measuring current pain gives an immediate snapshot of the pain severity, while average pain provides an overview of the pain experienced over a period of time. Worst pain captures the highest level of pain experienced, and least pain reflects the lowest level of pain severity. Comparing how different references perform in the responsiveness measure is important for selecting the appropriate reference for research studies.
While NRS are commonly used in pain and menstrual pain studies, there has been limited research on their MID and responsiveness to change in the context of menstrual pain. This study aims to fill this gap by estimating the MID and evaluating the responsiveness to change for pain severity measured by a 0–10 NRS.
Methods
Design
We used a short-term longitudinal repeated measures design with a 24-hour follow-up. Participants completed a baseline survey on days 1–3 of their menstrual cycle when menstrual pain is typically most intense to reduce floor effects and maximize the chance of detecting a change (17). Menstrual pain can vary significantly across menstrual cycle days during menstruation, especially during the first few days of menstruation, making a 24-hour window clinically relevant to detect change (17, 18). A 24-hour window also minimizes recall bias, improving the reliability of MID and responsiveness estimates (9). The study was approved by the institutional review board at the Indiana University.
Sampling
Eligible criteria were: (1) female, (2) aged 14–42, (3) able to read and converse in English, (4) currently living in the United States, and (5) having had menstrual pain in the past 6 months at enrollment, (6) menstruating and in days 1–3 of the menstrual cycle at the time of the baseline survey. The lower age limit reduced the likelihood of infrequent menstruation after menarche, while the upper age limit reduced the likelihood of enrolling perimenopausal females. The excluded age groups (≤13 and ≥43) often experience unpredictable menstrual cycles and pain patterns that may not reflect typical dysmenorrhea, potentially confounding the interpretation of psychometric results.
Participants were recruited from an opt-in survey panel managed by Qualtrics (Provo, UT). Invitations were emailed to 65,625 females aged 14–42. Participants who were on day 4 or later of their menstrual cycle were excluded to minimize floor effects, as they were less likely to be experiencing intense menstrual pain (17).
Interested individuals clicked a survey link (n = 1,654) and provided self-reported responses on age, sex, menstrual cycle day, and whether they had experienced menstrual pain in the past six months. Eligible individuals were shown a study information sheet. Those who decided to participate completed the survey. To ensure data quality, we embedded three attention checks (i.e., “trap questions”) in the survey and excluded participants who failed any of them. For example, one attention check stated: “This is an attention check; please select ‘very much’ for this statement.” We also excluded responses completed in less than one-third of the group's median completion time. To ensure adequate representation of participants in the early phase of menstruation, we implemented recruitment quotas to over-sample individuals (targeting at least one-third) who were on days 1–3 of their menstrual cycle. Out of the eligible individuals (n = 1,032), 836 responded to the survey, and 686 provided legitimate responses at baseline (19). Of the 260 participants surveyed on days 1–3 of their menstrual cycle at baseline, 100 provided valid follow-up responses, constituting the sample for analyses. Participant recruitment and data collection were conducted in 2019. There are no specific sample size guidelines for estimating MID and responsiveness. However, a sample size of 100 is recommended to generate precise reliability coefficients for MID estimates (20).
Measures
The baseline survey included self-reported demographics, menstrual and gynecological information, menstrual pain severity, and dysmenorrhea symptom interference. The follow-up survey (24 h later) included menstrual pain severity, dysmenorrhea symptom interference, and a retrospective global rating of change.
Menstrual pain severity
The NRS was used to assess menstrual pain severity at baseline and follow-up. Its reliability and validity are supported in the literature on pain and menstrual pain (9, 13, 16). We evaluated the severity of menstrual pain by asking four questions: In the last 24 h, what number best describes your menstrual pain at its worst? At its least? On average? Right now? For each question, the scale ranged from 0 (“No pain”) to 10 (“Extremely severe”).
Dysmenorrhea symptom interface scale (DSI)
The DSI scale measures how dysmenorrhea symptoms interfere with daily activities. Its reliability, validity, and responsiveness are supported by previous research (19, 21, 22). We used the on-menses version of the scale at both baseline and follow-up. Participants were asked, over the last 24 h, how much their menstrual pain and menstrual gastrointestinal symptoms interfered with nine aspects of their daily lives, including physical activities, sleep, daily activities, work, concentration, enjoyment of life, leisure activities, social activities, and mood. Each item was rated on a scale of 1 (“Not at all”) to 5 (“Very much”). The DSI scale score was calculated as the average of the ratings across the nine items, with higher scores indicating greater interference (19).
Retrospective global rating of change
The retrospective global rating of change is widely used to assess the MID and responsiveness to change for patient-reported outcome measures (8). During the follow-up survey, participants rated how their menstrual pain had changed from the previous day on a 7-point scale with options of “Much worse”, “Somewhat worse”, “Slightly worse”, “The same”, “Slightly better”, “Somewhat better”, and “Much better”. Responses were recorded into numerical values from −3 (“Much worse”) to 3 (“Much better”).
Data analysis
MIDs and responsiveness to change were estimated using distribution-based and anchor-based approaches, as these offer complementary insights. Distribution-based approaches rely on statistical characteristics of data, providing a more efficient measure of change by accounting for variability within the data and statistical distribution. Anchor-based approaches use an external criterion (anchor) to interpret the change score, ensuring the change in scores are meaningful and relevant to patients' experiences. There were no missing data. Data were analyzed using SAS software (Version 4.3, SAS Institute Inc., Cary, North Carolina, USA).
MID estimation
We estimated the MIDs using a triangulation approach that combined distribution- and anchor-based methods, as recommended in prior research (7, 23–26). Distribution-based techniques, grounded in statistical properties, offer objectivity, efficiency, and a safeguard against measurement error; however, they may overlook changes that are meaningful to patients or clinicians (26). In contrast, anchor-based methods use clinically relevant external criteria, enhancing interpretability (26). By integrating both approaches, we leveraged their complementary strengths to ensure greater robustness and clinical relevance of our findings.
Distribution-based approaches
We used effect size and the standard error of measurement (SEM) to derive distribution-based estimates. Effect sizes of 0.2, 0.35, and 0.5 standard deviation (SD) of baseline pain severity were calculated. According to Cohen, an effect size of 0.2 is considered small, and 0.5 is moderate (27). Research suggests that score differences greater than 0.5 standard deviations—considered a medium effect size—are likely to exceed the minimally important difference (24). Thus, we chose 0.35 SD as the MID estimates, representing a midpoint between the lower (0.2 SD) and upper (0.5 SD) boundaries of the MID estimates (24, 28).
SEM was calculated using baseline pain severity (24). SEM, defined as “the variation in the scores due to the unreliability of the scale or measure used,” was computed by standard deviation multiplied by the square root of 1 minus the reliability coefficient (29, 30). Typically, Cronbach's is used as the reliability coefficient for a multi-item scale. However, since the NRS for pain severity is a single-item scale, we used the test-retest reliability coefficient (31). We calculated this coefficient for a subgroup of participants who reported “No change” on the retrospective global rating of change score, indicating stable menstrual pain between the baseline and the follow-up. There is no universal standard for determining how many SEM correspond to MID. One SEM has been suggested as the lower MID boundary, as a change smaller than one SEM is likely due to measurement error rather than reflecting a genuine observed change. Studies have shown that 1 SEM aligns with MID values when defined using the anchor-based approach (24, 32, 33). As a result, 1 SEM is often used as a benchmark for identifying MIDs (24, 32, 33). Two SEMs, on the other hand, have been suggested as the upper boundary of MIDs, as a 1.96 SEM or 2 SEM can represent a 95% confidence interval (CI) marginal error (24, 34, 35). Therefore, we used 1 SEM to estimate the lower bound and 2 SEM to estimate the upper bound of the MID.
Anchor-based approaches
Anchor-based approaches rely on an external reference (anchor) to interpret differences. These differences can be assessed cross-sectionally, by comparing clinically defined groups at a single time point, or longitudinally, by evaluating changes in scores within the same group over time (36). In this study, we used both cross-sectional and longitudinal anchor-based approaches. Anchors were selected based on their conceptual relevance to menstrual pain severity and their established use in the literature. We evaluated the correlation between an anchor and the NRS measure. A Pearson correlation of at least 0.3 indicated an appropriate anchor measure for estimating an MID (8, 24). In distinct anchor subgroups, we excluded subgroups with fewer than 10 samples from MID estimations, as estimates based on such small numbers would be unreliable (23). For each anchor-based estimate, we calculated 95% confidence intervals (CIs) using bootstrap procedures (37).
Cross-Sectional Anchor-Based Analysis. NRS measures were mapped onto a clinically meaningful cross-sectional anchor for between-person analysis, the DSI scale. Pearson correlations between NRS and DSI at baseline were 0.64 for worst pain, 0.53 for least pain, 0.59 for average pain, and 0.67 for current pain (p's < 0.0001), supporting the use of DSI as an appropriate anchor. The MID for DSI was reported to be 0.3 point on a 1–5 scale (19). We examined the difference in NRS that corresponds to 0.3 point MID in DSI using linear regression analysis. The linearity assumption was confirmed by scatter plots.
Longitudinal Anchor-Based Approaches. Longitudinal anchor-based approaches address within-individual change (24). Changes in the NRS from baseline to follow-up were mapped onto DSI change score (the prospective anchor) and retrospective global rating of change score (the retrospective anchor).
For prospective anchor-based analysis, the within-person change of DSI was calculated by subtracting an individual's follow-up DSI score from their baseline DSI score. The within-person change of NRS was calculated similarly. A negative change score indicates worsened pain, while a positive change score indicates improved pain. As a 0.3 point change in DSI score has been previously demonstrated to be the MID (19), we calculated the within-person change in NRS that corresponds to 0.3 point change in DSI. Pearson correlations between the NRS change scores and DSI change scores were 0.48 for worst pain, 0.23 for least pain, 0.53 for average pain, and 0.40 for current pain. Since the correlation between the NRS change for least pain and DSI change scores was 0.23 (i.e., less than 0.3), DSI was not a proper anchor when estimating MID for the NRS for the least pain.
For retrospective anchor-based analysis, the global rating of changes collected at follow-up was used as the between-group anchor. This is the most widely used anchor, focusing on whether an individual has no change, improved, or worsened experience (38). Pearson correlations between NRS change scores and the retrospective global rating of changes were 0.32 for worst pain, 0.35 for least pain, 0.38 for average pain, and 0.35 for current pain (p < 0.0001), supporting its appropriateness as an anchor in this context. NRS change scores corresponding to one category shift (i.e., between “The same” to “Slightly better”) were used as between-group MID estimates (23, 24).
To synthesize MID estimates across different anchors, we calculated a correlation-weighted average, assigning greater weight to anchors more strongly correlated with the target measure (39).
MID Estimation Methods Reconciliation. As MIDs estimated by different methods can differ, the final recommended MIDs were derived from considering various distribution- and anchor-based methods (25). Distribution-based estimates were used to set approximate bounds of MID estimates. The MIDs should not be notably lower than a 0.2 effect size and one SEM to ensure they are more than a trivial difference and exceed the measurement error. At the same time, the MID estimates should not be notably higher than a 0.5 effect size or 2 SEMs to ensure that the difference is minimally important rather than moderately or substantially important (24).
Evaluation of responsiveness
The global rating of change was used to assess NRS's responsiveness to change. It served as an anchor to identify individuals who experienced changes from baseline to follow-up. Participants were categorized as “Improved,” “No Change,” or “Worsened” based on global rating of change. Both within-group and between-group responsiveness were evaluated.
Within-group responsiveness
Standardized response means (SRMs) were used to assess within-group responsiveness, reflecting the degree of change over time for each pain severity measure (e.g., worst pain, least pain, average pain, and current pain) within the “Improved,” “No Change,” and “Worsened” groups. SRM was calculated by standardizing the mean change scores using the standard deviation of these changes (40). Additionally, 95% CIs for the SRMs were generated using the bootstrapping procedure (37). According to established criteria, an absolute SRM of 0.3 indicates responsiveness (8, 41). Specifically, absolute SRMs between 0.3 and 0.5 indicate small responsiveness, values between 0.5 and 0.8 indicate moderate responsiveness, and values of 0.8 or higher reflect large responsiveness (40).
Between-group responsiveness
To assess between-group responsiveness, changes in scale scores across global rating of change categories were compared across “Improved”, “No change”, and “Worsened” groups. An omnibus analysis of variance (ANOVA) was conducted to compare mean changes among these three groups. post hoc pairwise comparisons were performed using the Tukey-Kramer test to compare (1) the “Improved” and “No Change” groups, (2) the “Worsened” and “No Change” groups, and (3) the “Improved” and “Worsened” groups, with the family-wise Type I error rate controlled at 0.05.
Receiver operating characteristic (ROC) curve analysis was also performed to quantify the ability of each measure to detect improvement. The area under the curve (AUC) represents the probability of accurately distinguishing between individuals who improved and those who did not (42, 43). For each pain severity measure, the AUC was calculated using the global rating of change as the anchor, dichotomized into “Improved” (“Slightly better,” “Somewhat better”, “Much better”) and “Not Improved” (“Much worse,” “Somewhat worse,” “Slightly worse,” or “The same”). AUC values range from 0.5 to 1.0. An AUC of greater than 0.5 would be considered meaningful, as an AUC of 0.5 represents a performance equivalent to random guessing (44).
To examine the comparative responsiveness of measures with different references (i.e., current menstrual pain, average menstrual pain, worst menstrual pain, and least menstrual pain) in detecting improvement, we statistically compared AUC values across these references.
Results
Study sample
Table 1 summarizes sample characteristics. Among 100 participants, the mean age was 29.3 years (SD = 6.9). Most (80%) reported using pain medication for dysmenorrhea. At baseline, the mean scores for worst, least, average in the last 24 h, and current menstrual pain were 6.21 (SD = 2.58), 3.00 (SD = 2.45), 4.94 (SD = 2.44), and 4.33 (SD = 2.92), respectively.
MID results
Distribution-based estimates
As shown in Table 2, an effect size of 0.35 corresponded to a change of 0.90, 0.87, 0.85, and 1.02 points, respectively, for NRS measuring worst pain, least pain, average pain, and current pain. Based on 0.2 and 0.5 effect size estimates, the MID bounds should range from 0.5 to 1.5.
One SEM ranged from 0.72 to 1.57 points (lower bound of MID) and two SEM ranged from 1.44 to 3.14 (upper bound) for the NRS measuring menstrual pain severity. Specifically, 1 SEM equated to a 0.72-point change for the worst pain, a 1.03-point change for the least pain, a 1.02-point change for average pain, and a 1.57-point change for current pain.
Anchor-based estimates
As seen in Table 2, cross-sectional analyses indicate that each 0.3-point difference in the baseline DSI score corresponded to a 0.46-point change in the worst pain, a 0.36-point change in the least pain, a 0.40-point change in the average pain, and a 0.54-point change in the current pain.
Based on the prospective anchor-based analysis, each 0.3-point within-person change in DSI from baseline to follow-up (i.e., MID for the DSI) corresponded to a 0.39-point within-person change in the worst pain, a 0.15-point change in the least pain, a 0.41-point change in the average pain, and a 0.37-point change in the current pain (Table 2).
Table 2 also shows the results of the retrospective anchor-based analysis using a global rating of change as the between-group change anchor. For worsened pain, the MID estimates ranged from −0.11 to 0.31 for one category of global change. For improved pain, the MID estimates ranged from 0.33 to 0.93 for one category of global change.
Triangulating across anchor-based estimates, the correlation-weighted MID was 0.51 for worst pain, 0.42 for least pain, 0.41 for average pain, and 0.51 for current pain.
Summary of MID estimates across approaches
Table 2 and Figure 1 illustrate the MID estimates derived from both distribution- and anchor-based approaches. Using the 1 SEM and 0.2 effect size (whichever is higher) as the approximate lower bound and 2 SEM and 0.5 effect size (whichever is lower) as the approximate higher bound, and considering anchor-based estimates, the MID for worst pain was within the range of 0.7–1.3, for least pain was in the range between 1.0 and 1.2, for average pain was within the ranged from 1.0 to 1.2, for current pain was about 1.5. In summary, the MID estimates for NRS of menstrual pain severity were close to 1 point.
Figure 1. Minimally important difference estimates using different approaches. MID, Minimally Important Difference; ES, Effect Size; DSI, Dysmenorrhea Symptom Interference; Retro, Retrospective; SEM, One Standard Error of Measurement. The small subgroup for the retrospective change anchor (n ranged from 10 to 30) likely contributed to the wide confidence intervals.
Responsiveness results
Within-group responsiveness
Table 3 presents the SRMs for NRS of menstrual pain severity for the improved and worsened groups. For the improved group, SRMs suggested a small to moderate and statistically significant responsiveness, ranging from 0.44 to 0.61 across NRS measures with different references. In contrast, NRS demonstrated minimal responsiveness in the worsened group, with the SRMs ranging from −0.36 to −0.24, respectively. The 95% CIs included a null value of zero, suggesting the results for the worsened group were not statistically significant.
Between-group responsiveness
As seen in Table 4, results were statistically significant for all four NRS measures when comparing the “Improved” and “Worsened” groups. When comparing the “Improved” and “No Change” groups, results were statistically significant for the “worst pain” and “least pain” measures. However, when comparing the “No Change” and “Worsened” groups, results were not statistically significant. These findings further support the responsiveness of NRS in detecting menstrual pain improvement, but not in detecting menstrual pain worsening.
AUC analysis demonstrated that pain severity measures effectively captured responsiveness to improvement, with values ranging from 0.66 to 0.70 (Figure 2). No statistically significant differences were found in AUC among the NRS measures with different references (worst, least, average, current).
Figure 2. Area under the curve comparison for detecting pain improvement. ROC, Receiver Operating Characteristic, p > 0.05 for comparison.
Discussion
In this study, we triangulated various approaches to estimate MIDs and assess responsiveness to change for NRS measuring menstrual pain severity. We found that the MID estimates were close to 1 point on a 0–10 scale of menstrual pain severity and the 0–10 NRS was responsive in detecting menstrual pain improvement.
We estimated the MIDs by triangulating distribution-based and anchor-based methods, leveraging their complementary strengths. As shown in Table 2, anchor-based estimates were consistently smaller than those derived from distribution-based methods. To account for potential measurement error, we set the lower bound using distribution-based estimates and selected values larger than the anchor-based estimates. Our MID estimate is similar to the SEM of NRS for menstrual pain severity reported in a Brazilian study, which was 0.97 (13). However, our MID estimates were smaller than some reported in the literature. For example, several studies on other chronic pain conditions (e.g., musculoskeletal pain, neuropathy) showed MIDs of about 2 points on a 0–10 scale (11, 12, 45). A study of patients with moderate-to-severe surgically confirmed endometriosis suggested 4-point as the clinically important difference for “worst pain in the previous 24 h” (15). Two studies outside of United States suggested that an approximately 3 point change represents a clinically important difference for menstrual pain or pelvic pain (13, 14). These differences may be due to variations in pain conditions, study sample characteristics such as initial pain severity, and methodologies, further suggesting that MID estimates can be context-dependent. MID estimates from other chronic pain populations may not apply to menstrual pain. Some previous studies had baseline pain entry requirements, resulting in higher mean baseline pain than our study population (11, 14, 15, 45). For individuals with high pain severity, a larger change may be needed to be minimally important. Methodological differences also exist between our study and previous ones. The endometriosis study grouped “very much improved,” “much improved,” and “minimally improved” into one category (15), whereas we treated these as distinct categories for MID estimation. A Brazilian study used 2.77 SEM as the estimate for the clinically important difference in menstrual pain, while we used 2 SEM as the upper limit for our MID estimates (13). Another study across several countries reported a similar 0.5 SD (i.e., approximately 1.5 points) aligning with our findings (14). However, unlike our approach, they did not use 0.5 SD as the upper bound for the MID estimate. Their estimates may be clinically meaningful, but exceed the threshold for minimal importance, indicating moderate or substantial importance. Compared to the endometriosis study where participants were followed up at 3 and 6 months (15), we followed up with participants 24 h after the initial assessment to reduce recall bias for the retrospective global rating of change. Importantly, evidence-based reports using a 0-to-10 NRS to compare pain treatments have defined small, moderate, and large differences as 0.5, 1.0, and 2.0 points (46). Notably, the majority of our MID estimates in Figure 1 fell into the 0.5–1.0 point range.
Regarding responsiveness, we found that the 0–10 NRS of pain severity was responsive in detecting pain improvement, as supported by the previous literature on other chronic pain conditions (16). As shown in Figure 2, the AUCs were within a similar range to those reported in other studies that used retrospective global ratings of change (47–49), where AUC values are typically lower than in studies evaluating diagnostic tests for disease detection. While the NRS was responsive to pain improvement, it was less so for pain worsening. In our study, pain change scores for the improvement group had moderate effect sizes measured by SRM, with magnitudes above the MID. For the worsening group, the NRS showed minimal to small effect sizes, with magnitudes below the MID. The larger SRMs for improvement compared to worsening might be due to the typical improvement of menstrual pain during a menstrual period, with or without treatment. Interestingly, interventional studies on other chronic pain conditions also suggest that self-report pain measures are more responsive to pain improvement than worsening (15, 47). Additionally, scales for symptoms other than pain—such as fatigue, depression, and anxiety—have also proven better at detecting improvement than worsening (48, 50, 51). Individuals undergoing treatment might expect improvement, making them more attuned to positive changes. When comparing NRS with different references (worst, least, average, current), we found MID estimates were largely consistent. The MID for current pain was slightly larger, likely due to the lower test-retest reliability coefficient for the current pain rating used for the SEM calculation. We did not find any statistically significant differences in responsiveness to change across different references. This finding differed from that of a study of patients with chronic pain undergoing a pain management program, where researchers showed that current pain was more responsive to detect pain improvement than least, average, and worst pain measures (52). Menstrual pain is episodic and can vary within a day and across days. Findings from other chronic pain conditions may not be applicable to menstrual pain. In addition, our sample size may limit our ability to detect any statistically important differences in responsiveness across different references.
Our study has several strengths. We used multiple approaches to estimate MID and assess responsiveness, enhancing the robustness of our findings. Additionally, we employed a 24-hour retrospective recall, minimizing recall bias and providing a more reliable assessment of pain changes over a short time frame.
We acknowledge several limitations. First, our analysis was based on data from a single study cohort. Although literature supports the legitimacy of estimating MID and responsiveness in observational studies and real-world settings (53), further research across multiple cohorts and within clinical trial contexts is needed to validate these findings. Second, in some distinct anchor categories, the sample size was small. We excluded subgroups with fewer than 10 samples from MID estimations, as estimates based on such small numbers would be unreliable (23). Third, our sample likely included participants who used a range of treatment approaches as well as those who did not use any. We did not assess specific interventions for menstrual pain due to the wide variety of both pharmacological and non-pharmacological strategies people use for menstrual pain and variability in their effectiveness across individuals (54, 55). While our findings offer real-world evidence on the NRS's performance, future research should evaluate its responsiveness and MID estimates in different settings to enhance generalizability and clinical relevance. Fourth, at baseline, participants were on days 1–3 of their menstrual cycle. Due to the limited sample size, we did not conduct stratified analyses by cycle day, which could help determine whether MID and responsiveness estimates vary by timing within the cycle. Fifth, although we triangulated distribution- and anchor-based approaches in this study, we did not incorporate qualitative data—such as asking participants what kinds of changes in menstrual pain they perceive as meaningful—which could offer valuable, person-centered insights (56). It is possible that a larger change is needed for participants to perceive it as truly meaningful and further investigation is warranted.
This study has implications for future research. Additional studies with larger and more diverse samples and in clinical trials are needed to validate our findings and improve their generalizability. Our MID estimates were derived from remotely administered surveys but may be cautiously considered for clinical contexts, with appropriate validation. Comparing 0–10 NRS with other outcome measures and scales, such as the Visual Analog Scale and the pain interference scale, would also be informative. More granular assessments of menstrual pain severity—such as multiple daily ratings using ecological momentary assessment—particularly during the days leading up to and at the onset of menstruation, may better capture pain exacerbation and enhance the evaluation of MID and responsiveness. Additionally, incorporating qualitative data from participants could help clarify what constitutes a minimally important difference from a person-centered perspective.
The findings of this study have implications for clinical practice. Identifying a MID of approximately one point on a 0–10 numerical rating scale provides clinicians with a clear, quantifiable threshold to gauge meaningful improvement in a patient's menstrual pain. This benchmark can be used to assess the efficacy of interventions and inform treatment adjustments. Given that the NRS demonstrated responsiveness to menstrual pain improvement, clinicians can use this tool to track patient progress over time, enabling timely modifications to management strategies.
Conclusion
As the NRS takes integer values between 0 and 10, the MID for menstrual pain severity can be rounded up to 1 point for practical use. The NRS is responsive in detecting menstrual pain improvement. These findings can inform the design and interpretation of studies testing interventions for menstrual pain and guide clinicians in interpreting the magnitude of treatment effects.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Ethics statement
The studies involving humans were approved by Indiana University Institutional Review Board. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants' legal guardians/next of kin in accordance with the national legislation and institutional requirements.
Author contributions
CC: Software, Resources, Funding acquisition, Writing – review & editing, Formal analysis, Writing – original draft, Data curation, Methodology, Visualization, Supervision, Project administration, Conceptualization, Validation, Investigation. JW: Methodology, Writing – review & editing, Writing – original draft, Visualization, Investigation, Formal analysis, Project administration, Validation. CL: Investigation, Writing – review & editing. JP: Investigation, Writing – review & editing. HA: Investigation, Writing – review & editing. LL: Writing – review & editing, Investigation. KK: Investigation, Writing – review & editing, Conceptualization, Methodology.
Funding
The author(s) declare that financial support was received for the research and/or publication of this article. Dr. Chen was supported by NIH Grant Numbers KL2 TR002530 and UL1 TR002529 (PI: A. Shekhar) from the National Center for Advancing Translational Sciences of the National Institutes of Health, the EMPOWER Grant (PI: Chen) from Indiana University–Purdue University Indianapolis, and R01HD110994 (PI: Chen) from the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health during the conduct of this study and the preparation of this manuscript. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Generative AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
1. Iacovides S, Avidon I, Baker FC. What we know about primary dysmenorrhea today: a critical review. Hum Reprod Update. (2015) 21(6):762–78. doi: 10.1093/humupd/dmv039
2. Armour M, Parry K, Manohar N, Holmes K, Ferfolja T, Curry C, et al. The prevalence and academic impact of dysmenorrhea in 21,573 young women: a systematic review and meta-analysis. J Women’s Health. (2019) 28(8):1161–71. doi: 10.1089/jwh.2018.7615
3. Li R, Li B, Kreher DA, Benjamin AR, Gubbels A, Smith SM. Association between dysmenorrhea and chronic pain: a systematic review and meta-analysis of population-based studies. Am J Obstet Gynecol. (2020) 223(3):350–71. doi: 10.1016/j.ajog.2020.03.002
4. Vincent K, Warnaby C, Stagg CJ, Moore J, Kennedy S, Tracey I. Dysmenorrhoea is associated with central changes in otherwise healthy women. Pain. (2011) 152(9):1966–75. doi: 10.1016/j.pain.2011.03.029
5. Deodato M, Grosso G, Drago A, Martini M, Dudine E, Murena L, et al. Efficacy of manual therapy and pelvic floor exercises for pain reduction in primary dysmenorrhea: a prospective observational study. J Bodyw Mov Ther. (2023) 36:185–91. doi: 10.1016/j.jbmt.2023.07.002
6. Abdelrahman AY, El-Kosery SM, Abbassy AH, Botla AM. Effect of aquatic exercise versus aerobic exercise on primary dysmenorrhea and quality of life in adolescent females: a randomized controlled trial. Physiother Res Int. (2024) 29(3):e2095. doi: 10.1002/pri.2095
7. Guyatt G, Osoba D, Wu AW, Wyrwich KW, Norman GR. Methods to explain the clinical significance of health status measures. Mayo Clinic Proc. (2002) 77:371–83. doi: 10.4065/77.4.371
8. Revicki D, Hays RD, Cella D, Sloan J. Recommended methods for determining responsiveness and minimally important differences for patient-reported outcomes. J Clin Epidemiol. (2008) 61(2):102–9. doi: 10.1016/j.jclinepi.2007.03.012
9. Chen CX, Kwekkeboom KL, Ward SE. Self-report pain and symptom measures for primary dysmenorrhoea: a critical review. Eur J Pain. (2015) 19(3):377–91. doi: 10.1002/ejp.556
10. Stjernberg-Salmela S, Karjalainen T, Juurakko J, Toivonen P, Waris E, Taimela S, et al. Minimal important difference and patient acceptable symptom state for the numerical rating scale (NRS) for pain and the patient-rated wrist/hand evaluation (PRWHE) for patients with osteoarthritis at the base of thumb. BMC Med Res Methodol. (2022) 22(1):127. doi: 10.1186/s12874-022-01600-1
11. Farrar JT, Young JP Jr., LaMoreaux L, Werth JL, Poole MR. Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain. (2001) 94(2):149–58. doi: 10.1016/s0304-3959(01)00349-9
12. Salaffi F, Stancati A, Silvestri CA, Ciapetti A, Grassi W. Minimal clinically important changes in chronic musculoskeletal pain intensity measured on a numerical rating scale. Eur J Pain. (2004) 8(4):283–91. doi: 10.1016/j.ejpain.2003.09.004
13. de Arruda GT, Driusso P, Rodrigues JC, de Godoy AG, Avila MA. Numerical rating scale for dysmenorrhea-related pain: a clinimetric study. Gynecol Endocrinol. (2022) 38(8):661–5. doi: 10.1080/09513590.2022.2099831
14. Nguyen AM, Arbuckle R, Korver T, Chen F, Taylor B, Turnbull A, et al. Psychometric validation of the dysmenorrhea daily diary (DysDD): a patient-reported outcome for dysmenorrhea. Qual Life Res. (2017) 26(8):2041–55. doi: 10.1007/s11136-017-1562-0
15. Pokrzywinski RM, Soliman AM, Snabes MC, Chen J, Taylor HS, Coyne KS. Responsiveness and thresholds for clinically meaningful changes in worst pain numerical rating scale for dysmenorrhea and nonmenstrual pelvic pain in women with moderate to severe endometriosis. Fertil Steril. (2021) 115(2):423–30. doi: 10.1016/j.fertnstert.2020.07.013
16. Ferreira-Valente MA, Pais-Ribeiro JL, Jensen MP. Validity of four pain intensity rating scales. Pain. (2011) 152(10):2399–404. doi: 10.1016/j.pain.2011.07.005
17. Jukic AM, Weinberg CR, Baird DD, Hornsby PP, Wilcox AJ. Measuring menstrual discomfort: a comparison of interview and diary data. Epidemiology. (2008) 19(6):846–50. doi: 10.1097/EDE.0b013e318187ac9e
18. Iacovides S, Baker FC, Avidon I. The 24-h progression of menstrual pain in women with primary dysmenorrhea when given diclofenac potassium: a randomized, double-blinded, placebo-controlled crossover study. Arch Gynecol Obstet. (2014) 289(5):993–1002. doi: 10.1007/s00404-013-3073-8
19. Chen CX, Murphy T, Ofner S, Yahng L, Krombach P, LaPradd M, et al. Development and testing of the dysmenorrhea symptom interference (DSI) scale. West J Nurs Res. (2021) 43(4):364–73. doi: 10.1177/0193945920942252
20. Kennedy I. Sample size determination in test-retest and cronbach alpha reliability estimates. Br J Contemp Educ. (2022) 2(1):17–29. doi: 10.52589/BJCE-FY266HK9
21. de Arruda GT, Driusso P, de Godoy AG, de Sousa AP, Avila MA. Measurement properties of patient-reported outcome measures for women with dysmenorrhea: a systematic review. J Clin Nurs. (2024) 33(11):4167–83. doi: 10.1111/jocn.17293
22. Mantovan SGM, de Arruda GT, Da Roza T, da Silva BI, Avila MA, da Luz SCT. Translation, cross-cultural adaptation, and measurement properties of the dysmenorrhea symptom interference (DSI) scale-Brazilian version. Braz J Phys Ther. (2024) 28(3):101065. doi: 10.1016/j.bjpt.2024.101065
23. Yost KJ, Eton DT, Garcia SF, Cella D. Minimally important differences were estimated for six patient-reported outcomes measurement information system-cancer scales in advanced-stage cancer patients. J Clin Epidemiol. (2011) 64(5):507–16. doi: 10.1016/j.jclinepi.2010.11.018
24. Chen CX, Kroenke K, Stump TE, Kean J, Carpenter JS, Krebs EE, et al. Estimating minimally important differences for the PROMIS pain interference scales: results from 3 randomized clinical trials. Pain. (2018) 159(4):775–82. doi: 10.1097/j.pain.0000000000001121
25. Dworkin RH, Turk DC, Wyrwich KW, Beaton D, Cleeland CS, Farrar JT, et al. Interpreting the clinical importance of treatment outcomes in chronic pain clinical trials: iMMPACT recommendations. J Pain. (2008) 9(2):105–21. doi: 10.1016/j.jpain.2007.09.005
26. de Vet HC, Terwee CB, Ostelo RW, Beckerman H, Knol DL, Bouter LM. Minimal changes in health status questionnaires: distinction between minimally detectable change and minimally important change. Health Qual Life Outcomes. (2006) 4:54. doi: 10.1186/1477-7525-4-54
27. Cohen J. Statistical Power Analysis for the Behavioral Sciences. New York: Lawrence Erlbaum Associates (1988).
28. Kroenke K, Stump TE, Chen CX, Kean J, Bair MJ, Damush TM, et al. Minimally important differences and severity thresholds are estimated for the PROMIS depression scales from three randomized clinical trials. J Affect Disord. (2020) 266:100–8. doi: 10.1016/j.jad.2020.01.101
29. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika. (1951) 16:297–334. doi: 10.1007/BF02310555
30. Tighe J, McManus IC, Dewhurst NG, Chis L, Mucklow J. The standard error of measurement is a more appropriate measure of quality for postgraduate medical assessments than is reliability: an analysis of MRCP(UK) examinations. BMC Med Educ. (2010) 10:40. doi: 10.1186/1472-6920-10-40
31. Harvill LM. Standard error of measurement. Educ Meas: Issues Pract. (1991) 10(2):33–41. doi: 10.1111/j.1745-3992.1991.tb00195.x
32. Wyrwich K, Nienaber N, Tierney W, Wolinsky F. Linking clinical relevance and statistical significance in evaluating intra-individual changes in health-related quality of life. Med Care. (1999) 37(5):469–78. doi: 10.1097/00005650-199905000-00006
33. Wyrwich KW, Tierney WM, Wolinsky FD. Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health-related quality of life. J Clin Epidemiol. (1999) 52(9):861–73. doi: 10.1016/S0895-4356(99)00071-2
34. Wyrwich KW. Minimal important difference thresholds and the standard error of measurement: is there a connection? J Biopharm Stat. (2007) 14(1):97–110. doi: 10.1081/BIP-120028508
35. Franceschini M, Boffa A, Pignotti E, Andriolo L, Zaffagnini S, Filardo G. The minimal clinically important difference changes greatly based on the different calculation methods. Am J Sports Med. (2023) 51(4):1067–73. doi: 10.1177/03635465231152484
36. Ousmen A, Touraine C, Deliu N, Cottone F, Bonnetain F, Efficace F, et al. Distribution- and anchor-based methods to determine the minimally important difference on patient-reported outcome questionnaires in oncology: a structured review. Health Qual Life Outcomes. (2018) 16(1):228. doi: 10.1186/s12955-018-1055-z
37. Puth MT, Neuhäuser M, Ruxton GD. On the variety of methods for calculating confidence intervals by bootstrapping. J Anim Ecol. (2015) 84(4):892–7. doi: 10.1111/1365-2656.12382
38. Devji T, Carrasco-Labra A, Qasim A, Phillips M, Johnston BC, Devasenapathy N, et al. Evaluating the credibility of anchor based estimates of minimal important differences for patient reported outcomes: instrument development and reliability study. Br Med J. (2020) 369:m1714. doi: 10.1136/bmj.m1714
39. Trigg A, Griffiths P. Triangulation of multiple meaningful change thresholds for patient-reported outcome scores. Qual Life Res. (2021) 30(10):2755–64. doi: 10.1007/s11136-021-02957-4
40. Middel B, van Sonderen E. Statistical significant change versus relevant or important change in (quasi) experimental design: some conceptual and methodological problems in estimating magnitude of intervention-related change in health services research. Int J Integr Care. (2002) 2:e15. doi: 10.5334/ijic.65
41. Askew RL, Cook KF, Revicki DA, Cella D, Amtmann D. Clinical validity of PROMIS (®) pain interference and pain behavior in diverse clinical populations. J Clin Epidemiol. (2016) 73:103. doi: 10.1016/j.jclinepi.2015.08.035
42. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. (1988) 44(3):837–45. doi: 10.2307/2531595
43. Kroenke K, Theobald D, Wu J, Tu W, Krebs EE. Comparative responsiveness of pain measures in cancer patients. J Pain. (2012) 13(8):764–72. doi: 10.1016/j.jpain.2012.05.004
44. Nahm FS. Receiver operating characteristic curve: overview and practical use for clinicians. Korean J Anesthesiol. (2022) 75(1):25–36. doi: 10.4097/kja.21209
45. Suzuki H, Aono S, Inoue S, Imajo Y, Nishida N, Funaba M, et al. Clinically significant changes in pain along the pain intensity numerical rating scale in patients with chronic low back pain. PLoS One. (2020) 15(3):e0229228. doi: 10.1371/journal.pone.0229228
46. McDonagh MS, Selph SS, Buckley DI, Holmes RS, Mauer K, Ramirez S, et al. AHRQ Comparative effectiveness reviews. In: Agency for Healthcare Research and Quality, editors. Nonopioid Pharmacologic Treatments for Chronic Pain. Rockville (MD): Agency for Healthcare Research and Quality (US) (2020). p. 9–34.
47. Chen CX, Kroenke K, Stump T, Kean J, Krebs EE, Bair MJ, et al. Comparative responsiveness of the PROMIS pain interference short forms with legacy pain measures: results from three randomized clinical trials. J Pain. (2019) 20(6):664–75. doi: 10.1016/j.jpain.2018.11.010
48. Kroenke K, Stump TE, Chen CX, Kean J, Damush TM, Bair MJ, et al. Responsiveness of PROMIS and patient health questionnaire (PHQ) depression scales in three clinical trials. Health Qual Life Outcomes. (2021) 19(1):41. doi: 10.1186/s12955-021-01674-3
49. Kean J, Monahan PO, Kroenke K, Wu J, Yu Z, Stump TE, et al. Comparative responsiveness of the PROMIS pain interference short forms, brief pain inventory, PEG, and SF-36 bodily pain subscale. Med Care. (2016) 54(4):414–21. doi: 10.1097/mlr.0000000000000497
50. Mosher CE, Secinti E, Johns SA, Kroenke K, Rogers LQ. Comparative responsiveness and minimally important difference of fatigue symptom inventory (FSI) scales and the FSI-3 in trials with cancer survivors. J Patient Rep Outcomes. (2022) 6(1):82. doi: 10.1186/s41687-022-00488-1
51. Kroenke K, Baye F, Lourens SG. Comparative responsiveness and minimally important difference of common anxiety measures. Med Care. (2019) 57(11):890–7. doi: 10.1097/mlr.0000000000001185
52. Chien CW, Bagraith KS, Khan A, Deen M, Strong J. Comparative responsiveness of verbal and numerical rating scales to measure pain intensity in patients with chronic pain. J Pain. (2013) 14(12):1653–62. doi: 10.1016/j.jpain.2013.08.006
53. Revicki DA, Cella D, Hays RD, Sloan JA, Lenderking WR, Aaronson NK. Responsiveness and minimal important differences for patient reported outcomes. Health Qual Life Outcomes. (2006) 4(1):1–5. doi: 10.1186/1477-7525-4-70
54. Chen CX, Carpenter JS, LaPradd M, Ofner S, Fortenberry JD. Perceived ineffectiveness of pharmacological treatments for dysmenorrhea. J Womens Health (Larchmt). (2020) 30(9):1334–43. doi: 10.1089/jwh.2020.8581
55. Chen CX, Kwekkeboom KL, Ward SE. Beliefs about dysmenorrhea and their relationship to self-management. Res Nurs Health. (2016) 39(4):263–76. doi: 10.1002/nur.21726
Keywords: dysmenorrhea, pelvic pain, psychometrics, patient reported outcome measures, minimally important clinical difference
Citation: Chen CX, Wu J, Lee C, Park J, Ahn H, Lin L and Kroenke K (2025) Minimally important difference and responsiveness to change for numerical rating scale of menstrual pain severity: a psychometric study. Front. Pain Res. 6:1655464. doi: 10.3389/fpain.2025.1655464
Received: 27 June 2025; Revised: 15 November 2025;
Accepted: 24 November 2025;
Published: 15 December 2025.
Edited by:
Karin Meissner, Hochschule Coburg, GermanyReviewed by:
Manuela Deodato, University of Trieste, ItalyMatthew J. Kmiecik, 23andMe Research Institute, United States
Copyright: © 2025 Chen, Wu, Lee, Park, Ahn, Lin and Kroenke. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Chen X. Chen, Y3hjaGVuQGFyaXpvbmEuZWR1
Chiyoung Lee1