Repeated measures of implementation variables

It is commonly acknowledged that implementation work is long-term and contextual in nature and often takes years to accomplish. Repeated measures are needed to study the trajectory of implementation variables over time. To be useful in typical practice settings, measures that are relevant, sensitive, consequential, and practical are needed to inform planning and action. If implementation independent variables and implementation dependent variables are to contribute to a science of implementation, then measures that meet these criteria must be established. This exploratory review was undertaken to “see what is being done” to evaluate implementation variables and processes repeatedly in situations where achieving outcomes was the goal (i.e., more likely to be consequential). No judgement was made about the adequacy of the measure (e.g., psychometric properties) in the review. The search process resulted in 32 articles that met the criteria for a repeated measure of an implementation variable. 23 different implementation variables were the subject of repeated measures. The broad spectrum of implementation variables identified in the review included innovation fidelity, sustainability, organization change, and scaling along with training, implementation teams, and implementation fidelity. Given the long-term complexities involved in providing implementation supports to achieve the full and effective use of innovations, repeated measurements of relevant variables are needed to promote a more complete understanding of implementation processes and outcomes. Longitudinal studies employing repeated measures that are relevant, sensitive, consequential, and practical should become common if the complexities involved in implementation are to be understood.


Introduction
Measurement of implementation variables in practice has been a challenge because of the complexities in human service environments, the novelties encountered in different domains (e.g., education, child welfare, global public health, pharmacy), and the ongoing development of implementation as a profession and as a science.
Greenhalgh et al. (1) conducted an extensive review of the diffusion and dissemination literature. They reflected a commonly held view when they concluded: "Context and "confounders" lie at the very heart of the diffusion, dissemination, and implementation of complex innovations. They are not extraneous to the object of study; they are an integral part of it. The multiple (and often unpredictable) interactions that arise in particular contexts and settings are precisely what determine the success or failure of a dissemination initiative." For a science of implementation to develop, measures of implementation-specific independent and dependent variables must be established and used in multiple studies. Those variables and measures must be usable across the "multiple (and often unpredictable)" situations Greenhalgh et al. described.
Implementation is viewed by many as a process that takes time and planned activities at multiple levels so that innovations can be used fully and effectively and at scale (2). Yet, studies labeled as "implementation science" predominately use unique measures and one-time assessments of something of interest to the investigator (3,4). Currently, avid readers of the "implementation science" literature are hard pressed to find a measure of an implementation-specific independent or dependent variable. Even when one is found, one data point at one point in time is a poor fit with the complexity of implementation as described in the literature. For example, Allen et al. (4) reviewed the literature related to the "inner setting" of organizations as defined by the Consolidated Framework for Implementation Research (CFIR). Consistent with previous findings from a review and synthesis of the implementation evaluation literature (3), Allen et al. found only one measure that was used in more than one study and noted that the definitions of constructs with the same name varied widely across the measures.
Repeated measures of multiple variables are needed to match the complexity of the practice, organization, and system environments in which changes must occur to support the full and effective uses of innovations in practice. Researchers have documented the years typically required to accomplish implementation goals (5, 6) even when skilled implementation teams are available (7). To advance a science of implementation, repeated measures and methods must be well suited to cope with research in applied settings where there are too few cases, too many variables, and too little control over multi-level variables that may impact outcomes (8,9). Implementation specialists and researchers who are doing the work of implementation and studying the results over time continually deal with complexity and confounders to accomplish their implementation practice and science aims (10). In their description of implementation practice and science, Fixsen et al. (10, chapter 16) proposed criteria for "action evaluation" measures that can be used to inform action planning and monitor progress toward full and effective use of innovations. Action evaluation measures are: (1) relevant and include items that are indicators of key leverage points for improving practices, organization routines, and system functioning, (2) sensitive to changes in capacity to perform with scores that increase as capacity is developed and decrease when setbacks occur, (3) consequential in that the items are important to the respondents and users and scores inform prompt action planning; repeated assessments each year monitor progress of action planning as capacity develops and outcomes are produced, and (4) practical with modest time required to learn how to administer assessments with fidelity to the protocol, and modest time required of staff to respond to rate the items or prepare for an observation visit.
While the lack of assessment of psychometric properties has been cited as a deficiency (11)(12)(13), what is missing from nearly all of the existing implementation-related measures is a test of consequential validity (14). That is, evidence that the variable under study, and the information generated by the measure of the variable, is highly related to using an innovation with fidelity and related to producing intended outcomes to benefit a population of recipients. Given that implementation practice and science are mission-driven (15), consequential validity is an essential test of any measure, an approach that favors external validity over internal validity (16,17). Galea (18), working in a health context, stated the problem and the solution clearly: A consequentialist approach is centrally concerned with maximizing desired outcomes, and a consequentialist epidemiology would be centrally concerned with improving health outcomes. We would be much more concerned with maximizing the good that can be achieved by our studies and by our approaches than we are by our approaches themselves. A consequentialist epidemiology inducts new trainees not around canonical learning but rather around our goals. Our purpose would be defined around health optimization and disease reduction, with our methods as tools, convenient only insofar as they help us get there. Therefore, our papers would emphasize our outcomes with the intention of identifying how we may improve them.
By thinking of "our methods as tools, convenient only insofar as they help us get there" psychometric properties may be the last concern, not the first (and too often, only) question to be answered. The consequential validity question is "so what?" Once there is a measure of a variable it is incumbent on the researcher (the measure developer) to provide data that demonstrates how knowing that information "helps us get there." Once a measure of a variable has demonstrated consequential validity then it is worth investing in establishing its psychometric properties to fine tune the measure. It is worth it because the variable matters and the measure detects its presence and strength.
This exploratory review was undertaken to "see what is being done" to measure implementation variables and processes in situations where achieving outcomes was the goal (i.e., more likely to be consequential). The interest is in measures that are relevant, sensitive, consequential, and practical. In particular, given the long-term and contextual nature of implementation work that often takes years to accomplish, the search is for studies that have used repeated measures to study the trajectory of implementation variables over time. For this review, a measure that has been used more than once in a study is an indication that the measure is relevant to the variable under study, sensitive to change in the variable from one data point to the next, consequential with respect to informing planning for change, and practical by virtue of being able to be used two or more times to study a variable.

Materials and methods
The review was conducted within the Active Implementation Research Network (AIRN) EndNotes data base. The AIRN EndNotes data base contains 3,950 references (March 20, 2021) that pertain to implementation with a bias toward implementation evaluation and quantitative data articles. The oldest reference relates to detecting and evaluating the core components of independent variables (19). The most recent article describes over 10 years of work to scale up innovations in a large state system (20).
In 2003 the AIRN EndNotes data base was initiated by entering citations from the boxes of paper copies of articles accumulated by the authors since 1971, the year of the first implementation failure experienced by the first author (21). Beginning in 2003 AIRN EndNotes was expanded with references produced from literature searches conducted through university library services (3). Since 2006, articles routinely have been added through Google Scholar searches. Weekly lists of articles identified with the implementation-related search terms are scanned and relevant citations, abstracts (when available), and PDFs (when available) are downloaded into AIRN EndNotes. For inclusion in the database, articles reporting quantitative data are favored over qualitative studies or opinion pieces. Reflecting the universal relevance of implementation factors, the data base includes a wide variety of articles from multiple fields and many points of view. About 2/3 of the articles in AIRN EndNotes were published in 2000-2021.
The majority of articles in AIRN EndNotes published since 2000 include the Abstract and/or a PDF, and the full text of about half of all the articles has been reviewed by the authors and their colleagues. The reviewer of an article enters information in the "Notes" section of EndNotes regarding concepts and terms that relate to the evidence-based Active Implementation Frameworks as they are defined, operationalized, evaluated, and revised (3,7,15,(22)(23)(24)(25)(26)(27). Given the lack of clarity in the implementation field, the Notes provide common concepts and common language that facilitate searches of the AIRN EndNotes data base.
For this study, the AIRN EndNotes data base was searched for articles that used repeated measures of one or more implementation variables. Using the search function in EndNotes, "Any Field" (i.e., title, abstract, keywords, notes) in the data base was searched using the word "measure" and the term "repeated," or "longitudinal," or "pattern," or "stepped wedge." The search returned 260 references. Because searches of the literature were less systematic in the pre-internet days, references published prior to the year 2000 were eliminated leaving 213 references. The title and abstract of each of the 213 articles was reviewed and those with apparent repeated measures of an implementation variable were retained (n = 58).
The full text of the remaining 58 references was reviewed. For the full text review, "repeated" was defined as two or more uses and "measure" was defined as the same method (observation, record review, survey, systematic interview) with the same content used at Time 1, Time 2, etc. No judgement was made about the adequacy of the measure or time frames. Thus, psychometric properties of a measure were not considered in the review. "Implementation" was defined as any specific support (e.g., training, coaching, leadership, organization changes) for the full and effective use of an identified innovation.
The full text review eliminated 26 articles. The reasons for elimination are provided in Table 1. For example, 13 articles were eliminated because the repeated measure concerned an intervention and not an implementation variable and 7 were eliminated because the same measure was not repeated from one time period to the next.

Results
The search process resulted in 32 articles that met the criteria for a "repeated" "measure" of "implementation" variables: in 17 articles 2 or more variables were measured and in 15 articles one variable was measured. Fourteen (14) of the articles were published in 2000-2010 and 18 were published in 2011-2019.
As noted in Table 2, 23 different implementation variables were the subject of repeated measures. In Table 2 the Implementation Variable names are grouped using the Active Implementation Frameworks as a guide (10). The broad spectrum of implementation variables included innovation fidelity (assessed in 17 articles), sustainability (assessed in 8 articles), organization change (assessed in 6 articles), and scaling (assessed in 5 articles). Training, implementation teams, and implementation fidelity were the subject of 2 articles each. Table 3 details the individual articles, the assessments they reported, and the implementation variables that were studied.

Discussion
It is heartening to see the breadth of implementation-specific variables that have been measured two or more times in one or more studies. Given the long-term complexities involved in providing implementation supports to achieve the full and effective use of innovations, repeated measurements of relevant variables are needed to promote a more complete understanding of implementation processes and outcomes. Yet, this exploratory review found few examples in the literature.
It is discouraging to see so few articles reporting repeated measures. The review found only 32 articles among the 3,950 mostly implementation evaluation articles, and provide an indicator of what must be done to advance the field. Implementation practice and science would be well served by investing in using and improving the measures identified in this review. The measures already have been developed and used in practice and appear to be relevant (they assess the presence and strength of an implementation-specific variable), sensitive (results showed change from one administration to the next), and practical (able to be administered repeatedly in practice). The field would benefit by using these measures in a variety of studies to establish more fully their consequential validity (does the variable being assessed impact the use and effectiveness of innovations). Meeting the criteria for action evaluations is a good start for the development of any measure. As found in this study, there are good examples of repeated measures of implementation-specific variables in complex settings. Szulanski and Jensen (43) studied innovation fidelity and outcomes for over 3,500 franchises, each with 16 data points spanning 20 years for a total of 56,000 fidelity assessments that showed detrimental outcomes associated with lower fidelity in the early years and improved outcomes associated with lower fidelity after year 17. McIntosh et al. (35) studied innovation fidelity in 5,331 schools for 5 years, a total of 26,655 fidelity assessments that allowed the researchers to detect patterns in achieving and sustaining fidelity of the use of an evidence-based program. For 10 years Fixsen and Blase (41) studied innovation fidelity every six months for practitioners in 41 residential treatment units, a total of 820 fidelity assessments that detected positive trends among new hires as the implementation supports in the organization matured. Datta et al. (32) collected data for two years with 41 data points to track the effectiveness of continual attempts to produce improved outcomes for neonates admitted to the neonatal intensive care unit.
Innovation fidelity also has been assessed at an organization level. McGovern et al. (45) developed the Dual Diagnosis Capability in Addiction Treatment (DDCAT) to assess the dual diagnosis (substance abuse and mental health) capability of addiction treatment services. DDCAT items assess: (1) (7) Training. Organization dual diagnosis treatment capacity was measured at baseline and at 9-month follow-up in a cohort of 16 addiction treatment programs, 32 data points that found assessment, feedback, training, and implementation support were most effective in changing organization capacity. The DDCAT has been used in other studies by different authors to assess capacity (33,47).
In these and other examples cited in this article, the measures of implementation variables are meaningful (relevant) and are repeated (practical) so that trends (sensitive) can be detected and corrected (if needed). Consequential validity was reported in these examples and in other articles (e.g., 43,48,49) and requires further study.
Innovation fidelity (n = 17) was the most frequent repeated measure. Innovation fidelity always is specific to the innovation under consideration. Implementation fidelity, on the other hand, refers to implementation-specific variables being used as intended. A science of implementation and assessments of implementation fidelity are intended to be universal. For example, the drivers best practices assessment (DBPA; 59, 60) measures the presence and strength of the implementation drivers (10,15,26,27) that relate to (a) competency (selection, training, coaching, fidelity), (b) organization (facilitative administration, decision support data system, system intervention), and c) leadership (technical, adaptive). As shown in Table 2, over half of the measures (n = 30; Table 2) reported in the articles assessed one or more variables related to the implementation drivers. The DBPA has been used to assess implementation fidelity in a variety of settings and organizations, demonstrating a strong association with intended uses of evidence-based programs (61)(62)(63)(64). As action measures are used in practice, the statistical (psychometric) properties can be assessed (61, 65).
These longitudinal studies are not typical, but they should be. After, before and after, one-time, or short-term assessments may be interesting but may add little to the science of implementation. To do something once or even a few times is interesting. To be able to do something repeatedly with useful outcomes and documented improvements over decades will produce socially significant benefits for whole populations (66). Data on the processes of implementation over time are badly needed.
There is much to be done to establish a science of implementation that has useful and reliable measures of implementation-specific independent (if…) and dependent (then…) variables. Implementation theory (67-69) can become the source of predictions (if…then) that can be tested in practice.

Article Repeated Measure Implementation Variable
Strand et al. (28) Each of the 6 sites selected an implementation team to carry out the initiative. Measures developed for this project included a key informant interview to assist in agency selection and an Organizational Readiness Assessment (ORA). The ORA was used across sites every 6 months. The ORA eight factors included three that aligned with the Organization driver, two factors that aligned with the Competency driver, two that aligned with the Leadership driver, and one stand-alone factor, Attitude Toward Evidence-based Treatment that consisted of one item.
Competency Drivers; Organization Drivers, Leadership Drivers; Attitude toward EBPs Panzano et al. (29) A longitudinal study designed to collect a range of interview, survey, and implementation outcome data in 91 agencies and relate the data from earlier stages to later stages. At 9-month intervals Panzano and colleagues followed a group of 91 agencies that had committed to and were funded to use one of several evidence-based programs in a state mental health system. The 91 agencies engaged in Exploration and Installation activities but 44 (48%) never used a selected program (i.e., did not reach the Initial Implementation Stage).

Innovation fidelity
Hoekstra et al. (38) Evidence-informed physical activity promotion program in Dutch rehabilitation care. Fidelity scores were calculated based on annual surveys filled in by involved professionals (n = ± 70). Fidelity scores of 17 organizations at three different time points. Three trajectories were identified as the following: "stable high fidelity" (n = 9), 'moderate and improving fidelity' (n = 6), and 'unstable fidelity' (n = 2).

Innovation fidelity
Chinman et al. (39) Fidelity (adherence, quality of delivery, dosage) and the proximal outcomes of the youth participants (aged 10-14) -attitudes and intentions regarding cigarettes, alcohol, and marijuana use. Fidelity was assessed at all sites by observer ratings and attendance logs. Proximal outcomes were assessed via survey at baseline, 3, and 6 months.
Fidelity was assessed at all sites by observer ratings and attendance logs. Proximal outcomes were assessed via survey at baseline, 3, and 6 months. A 2-year implementation support intervention. It compares 15 Boys and Girls Club sites implementing CHOICE (control group), a five-session evidence-based alcohol and drug prevention program, with 14 similar sites implementing CHOICE supported by GTO (intervention group).

Innovation fidelity
Rahman et al. (40) In the first 3 months, functional water seals were detected in 33% (14/42)   Article Repeated Measure Implementation Variable qualitative investigations determined that households concurrently used their own latrines with broken water seals in parallel with those provided by the trial. In consultation with the households, the authors closed pre-existing latrines without water seals, increased the CHWs' visit frequency to encourage correct maintenance of latrines with water seals, and discouraged water-seal removal or breakage. At the sixth assessment, 86% of households in sanitation-only; 92% in the combined WSH; and 93% in the combined WSH and Nutrition arms had latrines with functional water seals.
Fixsen and Blase (41) Innovation fidelity for practitioners in 41 units assessed every six months for 10 years. Practitioners employed for 2 years or more remained at high fidelity each year even with turnover in staffing. Over 10 years, repeated measures noted substantial improvements for practitioners in the newly hired 0-6 month group and the 7-12 month group at each data point. The fidelity scores for these less experienced practitioners increased over 10 years from around 3 to over 5 on a 7-point scale. In the setting of an intervention to increase preventive service delivery (PSD), the authors assessed practice capacity for change by rating motivation to change and instrumental ability to change on a one to four scale. After combining these ratings into a single score, random effects models tested its association with change in PSD rates from baseline to immediately after intervention completion and 12 months later. In this way, like any science, a science of implementation can be cumulative and "crowdsourced" globally as theory-based predictions are tested and theory itself is improved or discarded.

Limitations
In the current study, the AIRN EndNotes data base provided a convenient sample for the search that was conducted. Thus, the results of the search offer an indication regarding repeated measures of implementation variables. An exhaustive search of all available sources may produce a different view of the field.

Data availability statement
The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author/s.

Ethics statement
Ethical review and approval was not required for this study in accordance with the local legislation and institutional requirements.

Author contributions
All authors contributed to the article and approved the submitted version.

Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Ryan Jackson et al. (55) Measuring implementation capacity at every level of the system for full and effective use of a practice that benefits all students is critical to alignment and cohesion of implementation efforts. Over 40 months, capacity is measured every 6 months using the State Capacity Assessment (SCA: Fixsen et al. (70) Assess the number of attempted replications of an evidence-based program over 10 years. Proximity discriminated early failures from successes (the group homes closer to the training staff got more personal, on-site observation and feedback). Given this, the focus shifted to developing Teaching-Family Sites instead of individual group homes. Longer term data showed that this had a large impact on sustainability (about 17% of the individual homes lasted 6 + years while over 80% of the group homes associated with a Teaching-Family Site lasted 6 + years).

Scaling; Sustainability
Massatti et al. (57) IDARP is a longitudinal study with data gathered at approximately 9-month intervals. This analysis uses data gathered from the first three contact points. To collect information at each interval, a trained two-person team conducted confidential semi-structured interviews with multiple key informants. For organizations that discontinued their chosen evidence-based program, researchers asked key informants during the open-ended portion of the interview to provide the primary reasons for de-adoption.