Comparing the Psychometric Properties of Two Physical Activity Self-Efficacy Instruments in Urban, Adolescent Girls: Validity, Measurement Invariance, and Reliability

Aims: This study compared the psychometric properties of two self-efficacy instruments related to physical activity. Factorial validity, cross-group and longitudinal invariance, and composite reliability were examined. Methods: Secondary analysis was conducted on data from a group randomized controlled trial investigating the effect of a 17-week intervention on increasing moderate to vigorous physical activity among 5th–8th grade girls (N = 1,012). Participants completed a 6-item Physical Activity Self-Efficacy Scale (PASE) and a 7-item Self-Efficacy for Exercise Behaviors Scale (SEEB) at baseline and post-intervention. Confirmatory factor analyses for intervention and control groups were conducted with Mplus Version 7.4 using robust weighted least squares estimation. Model fit was evaluated with the chi-square index, comparative fit index, and root mean square error of approximation. Composite reliability for latent factors with ordinal indicators was computed from Mplus output using SAS 9.3. Results: Mean age of the girls was 12.2 years (SD = 0.96). One-third of the girls were obese. Girls represented a diverse sample with over 50% indicating black race and an additional 19% identifying as mixed or other race. Both instruments demonstrated configural invariance for simultaneous analysis of cross-group and longitudinal invariance based on alternative fit indices. However, simultaneous metric invariance was not met for the PASE or the SEEB instruments. Partial metric invariance for the simultaneous analysis was achieved for the PASE with one factor loading identified as non-invariant. Partial metric invariance was not met for the SEEB. Longitudinal scalar invariance was achieved for both instruments in the control group but not the intervention group. Composite reliability for the PASE ranged from 0.772 to 0.842. Reliability for the SEEB ranged from 0.719 to 0.800 indicating higher reliability for the PASE. Reliability was more stable over time in the control group for both instruments. Conclusions: Results suggest that the intervention influenced how girls responded to indicator items. Neither of the instruments achieved simultaneous metric invariance making it difficult to assess mean differences in PA self-efficacy between groups.


INTRODUCTION
Despite the benefits of physical activity (PA), less than 25% of adolescents meet recommended guidelines (Fakhouri et al., 2014;Kann et al., 2016) calling for 60 min or more per day of at least moderate-intensity PA (U.S. Department of Health and Human Services, 2008). Compared to boys, girls attain less PA and have greater declines in the behavior throughout adolescence (Dumith et al., 2011). Increased understanding of factors underlying this occurrence among adolescent girls is urgently needed, particularly given that most interventions have not increased PA in this population (Camacho-Minano et al., 2011).
While many psychosocial factors have been theorized to increase PA among adolescents, self-efficacy is an important correlate and determinant of PA (Craggs et al., 2011;Bauman et al., 2012) and mediator of PA intervention effects (Lubans et al., 2008). However, some researchers have reported contradictory findings regarding the relationship between self-efficacy and adolescent PA suggesting that inadequate and varied measurement of the concept may explain the inconsistencies (Dewar et al., 2013;Plotnikoff et al., 2013).
Establishing multi-group and longitudinal invariance of PA self-efficacy instruments in intervention studies is necessary to demonstrate that the same construct is measured over time with the same metric (Widaman et al., 2010). Unfortunately, psychometric assessment of PA self-efficacy instruments is rarely reported in the literature (Brown et al., 2009). This study aims to fill this gap by comparing the factorial validity, measurement invariance, and reliability of two PA self-efficacy instruments used in a large-scale study to test a PA intervention with urban, adolescent girls. The results may contribute to better understanding of PA self-efficacy and its role in fostering PA among girls.
The concept of PA self-efficacy has its origins within social cognitive theory (Bandura, 1986). Bandura (1997, p. 2) defines self-efficacy as 'the belief in one's capabilities to organize and execute courses of action required to produce given attainments.' When this definition is applied to PA, self-efficacy is defined as a belief in one's capability to participate in PA and to choose PA over existing barriers (Voskuil and Robbins, 2015). PA self-efficacy has been incorporated into several health behavior models used to develop theory-based interventions and explain PA in adolescents .
While several studies have included PA self-efficacy as a key construct in theory-based interventions with youth (Bauman et al., 2012), psychometric assessment of PA selfefficacy instruments has been insufficient. Brown et al. (2009) examined reporting of PA self-efficacy instrument validity and reliability in 15 studies. They noted that while the majority of studies included acceptable internal consistency with Cronbach's alpha (n = 12), fewer than half reported reliability over time (n = 7), only two reported acceptable factor analyses, and none reported criterion validity. Furthermore, psychometric assessment of PA self-efficacy instruments utilized with specific populations, such as girls, has been limited (Dewar et al., 2013).
This study examines the potential for an intervention to alter the way in which participants understand and respond to a PA self-efficacy instrument. In the Trial of Activity for Adolescent Girls (TAAG), Lytle et al. (2009) reported that girls in the intervention group had lower self-efficacy scores at the end of the study compared to girls in the control group. These authors hypothesized that exposure to the intervention likely heightened girls' awareness of their difficulties related to PA. Dunton et al. (2007) also reported declines in scores for PA self-efficacy among an intervention group of adolescent girls. Other researchers have noted similar findings for selfefficacy among youth, reporting lower PA self-efficacy after exposure to an intervention (Haerens et al., 2008;Bergh et al., 2012). Reporting mean differences in PA self-efficacy between an intervention and control group may be inaccurate if researchers assume measurement invariance without confirming it through invariance testing (Dishman et al., 2010).
Few studies have demonstrated support for the factorial validity and measurement invariance of PA self-efficacy instruments (Motl et al., 2000;Dishman et al., 2002Dishman et al., , 2010Roesch et al., 2013) with only the TAAG study demonstrating satisfactory cross-group and longitudinal invariance between intervention and control groups (Dishman et al., 2010). Roesch et al. (2013) established longitudinal invariance of a PA self-efficacy instrument measured in adolescents; however, the analysis did not separate the intervention and control groups. Additional investigation of longitudinal invariance of self-efficacy measures in intervention studies is warranted to better understand changes in the concept over time, influences by intervention effects, and effect on PA among adolescent girls.
Another concern regarding measurement of PA self-efficacy is that researchers often adapt established instruments without conducting psychometric analyses to confirm that their changes have not affected the measurement properties (Johnson et al., 2011;Bergh et al., 2012;Dewar et al., 2013). Deleting items, changing item wording, and altering response choices may have a significant impact on the reliability and validity of these instruments and can change the meaning of the underlying concept. For example, Sherwood et al. (2004) adapted a PA self-efficacy instrument for use with 8-to 10-year old girls by changing the main stem of item questions from how sure are you to how hard do you think it would be. The authors point out the modified items may have more accurately reflected perceived behavioral control than self-efficacy.
This study also aims to improve upon current reliability analysis for self-efficacy measures. Although frequently reported in psychometric studies (Brown et al., 2009), Cronbach's alpha may underestimate the true reliability for scales with a limited number of items (Furr and Bacharach, 2014). Furthermore, alpha assumes that items are tau-equivalent, which is often not the case (Thurber and Bonynge, 2011). Assessing composite reliability via confirmatory factor analysis (CFA) may provide better support for internal consistency (Raykov, 2004) and be more accurate for multi-dimensional instruments than alpha (Barbaranelli et al., 2015). A specialized method for assessing composite reliabilty is also most appropriate for latent factors measured by ordinal Likert items with few response options (Yang and Green, 2011) of the sort typically used to assess self-efficacy among adolescents.
The purpose of this study was to compare the psychometric properties of two PA self-efficacy instruments used with urban 5th-8th grade girls in the "Girls on the Move" group randomized controlled trial (RCT; Robbins et al., 2013). The specific aims were to examine: (1) factorial validity; (2) multi-group and longitudinal invariance; and (3) composite reliability of the selfefficacy instruments in the group RCT's control and intervention groups.

Design
The psychometric properties of two PA self-efficacy instruments were examined using secondary data from the first 2 years of the "Girls on the Move" group RCT. The group RCT was conducted to examine the effect of a 17-week multi-component intervention on increasing moderate-to-vigorous PA (MVPA) among racially diverse, underserved 5th-8th grade girls (Robbins et al., 2013). The group RCT included 24 urban schools in the Midwestern United States over three intervention years from 2012 to 2015. At baseline and again at the end of the 17-week intervention, girls Percentages may not add to 100 due to rounding error. a Free/reduced-price lunch program used as an indicator of socioeconomic status. * p < 0.05.
Frontiers in Psychology | www.frontiersin.org completed an iPad-delivered survey that included the PA selfefficacy instruments. We used data collected during intervention years one and two of the group RCT to fulfill our aims.

Sample and Setting
Sample A total of 1,012 girls (M = 12.2; SD = 0.96) participated during the first two intervention years of the group RCT. Inclusion criteria for participants were: (1) 5th-7th grade girls (8th grade girls in schools with only 7th and 8th grades); (2) able to participate in a PA club 3 days a week after school; (3) anticipated availability to complete 9-month post-intervention follow-up measures; and (4) able to read, understand, and speak English. Girls were excluded if they had a health condition that prevented safe PA or were involved in after-school sports or a community program that included PA. Girls represented a diverse population with 526 (52.0%) blacks, 256 (25.3%) whites, and 133 (13.1%) mixed race with 108 (81.2%) of these girls selecting black as part of a mixed race. No significant differences between groups were found with the exception of race (χ 2 = 6.385, p = 0.01) with more black girls in the control group. Table 1 includes additional sample characteristics.

Setting
Data for this study were collected in 16 schools. Eight schools, half of which served as controls, were involved in each of the two intervention years. School-level data indicated that the majority of girls in each school were black and of low socioeconomic status (SES), as determined by participation in the free or reduced-price lunch program. Schools were randomly assigned to receive either the intervention or control condition after baseline data collection. All school administrators, parents/guardians, and participants agreed to this randomization procedure.

Demographics
Data on each girl's age, grade, ethnicity, race, and participation in a free or reduced-price lunch program were collected via the consent form completed by girls' parents/guardians.

Body Mass Index (BMI)
Body mass index (BMI) was included for use in describing the sample. Each girl's measured weight and height were used to calculate BMI. Weight was measured to the nearest 0.1 kg using a foot-to-foot bioelectric impedance scale (Tanita Corporation, Tokyo, Japan). Height without shoes was measured to the nearest 0.1 cm using a Shorr Board 1 . BMI was calculated based on the formula of kg/meters 2 . BMI percentiles for age were calculated using the 2000 Centers for Disease Control (CDC) interpretation of BMI for children and teens (Centers for Disease Control and Prevention [CDC], 2015). Weight was classified as: (1) underweight (<5th percentile); (2) healthy weight (5th percentile to <85th percentile); (3) overweight (≥85th to <95th percentile); and (4) obese (≥95th percentile). 1 www.weighandmeasure.com

PA Self-Efficacy
Physical activity self-efficacy was measured using two instruments. The first was developed by Saunders et al. (1997) as a 17-item scale with three factors: support-seeking, barriers, and positive alternatives. Additional psychometric testing using CFA resulted in a unidimensional 8-item instrument that previously demonstrated multi-group and longitudinal invariance (Motl et al., 2000;Dishman et al., 2002Dishman et al., , 2010. The revised instrument included items from each of the three factors identified by Saunders et al. (1997) with five response options ranging from (1) disgaree a lot to (5) agree a lot that replaced the dichotomous yes/no used initially.
This instrument was reduced to 6-items for use in the group RCT. Two social support items were excluded: (1) "I can ask my parent or other adult to do physically active things with me"; and (2) "I can ask my best friend to be physically active with me during my free time on most days." The response choices were reduced from five to four to avoid a neutral response option: (0) disagree a lot to (3) agree a lot. Previous research suggests that eliminating a neutral response and offering four response choices may be optimal when surveying youth (Borgers et al., 2004). A sample item is "I can be physically active in my free time on most days even when I am busy." We refer to the 6-item scale as PA Self-Efficacy (PASE).
The second instrument was developed as a 12-item Self-Efficacy for Exercise Behaviors Scale for use with adults (Sallis et al., 1988). Exploratory factor analysis (EFA) resulted in two factors, a 5-item resisting-relapse factor and a 7-item making-time-for-exercise factor. This instrument was further modified to a 10-item scale for use with adolescents and demonstrated adequate predictive validity (Wilson et al., 2002) and reliability (Wilson et al., 2008). Neither factorial validity nor measurement invariance testing of the instrument was found in the peer-reviewed literature. However, in an unpublished study, CFA did not support a unidimensional scale and showed inadequate fit to the data (Lawman et al., 2011). One item, "How sure are you that you can stick to your exercise program when you are alone and no one is watching you?" was found to be non-invariant between boys and girls and was deleted resulting in a 9-item scale.
In the group RCT, this instrument was revised to include 7-items with four response choices ranging from (0) not at all sure to (3) very sure. Two items were removed to reduce response burden and increase the relevance of items for adolescent girls: (1) "How sure are you that you can stick to your exercising when you have guests staying in your home?" and (2) "How sure are you that you can stick to exercising even when you have limited amounts of time? A sample item from the scale is "How sure are you that you can stick to your exercise program even when your friends want to hang out?" We refer to the 7-item scale as Self-Efficacy for Exercise Behaviors (SEEB).

Recruitment
The Michigan State University Institutional Review Board and school administrators provided approval to conduct the group Frontiers in Psychology | www.frontiersin.org RCT. Data collectors visited each school and community center to share information about the study with girls. Recruitment packets with study information, assent and consent forms, and an eligibility screening tool were provided to girls interested in participating. Girls were asked to share the packets with their parents or guardians and return completed packets to the researchers at their school within 2 days.

Data Collection
Eligible girls with signed consent and assent forms completed an iPad-delivered survey, including the PASE and SEEB, at baseline and after the 17-week intervention. Trained research assistants measured height and weight behind privacy screens. Details of the group RCT procedures have been reported elsewhere (Robbins et al., 2013).

Data Analysis
Data analysis was performed using Stata 14 (StataCorp, 2015), Mplus 7.4 (Muthén and Muthén, 2015), and SAS 9.3 (SAS Institute Inc., 2011). Stata was used to calculate descriptive statistics of the sample, review characteristics of the PASE and SEEB items, assess missing data, and assess non-independence of the data. We used t-tests and chi-square tests to check for baseline differences between intervention and control groups.
The self-efficacy items for both instruments underwent a missing data analysis. Dummy variables for each of the selfefficacy items were created as dependent variables and a series of logistic regressions were performed to identify predictors of missingness. Although none of the demographic variables were found to predict missingness, we still included age, BMI, race, ethnicity, grade, school, pubertal status, and treatment group in the imputation model. All of the self-efficacy items were used as model predictors. The multiple imputation procedure with Stata 14 was used to impute the missing data using a single newly created data set. This decision was based on the fact that both instruments had <1% missing data at baseline and <10% missing data post-intervention, and this proportion was not likely to result in biased results (Dong and Peng, 2013). Each imputed value was a random draw from the conditional distribution of the variable being imputed given the observed values of the imputation predictors. Mean scores before and after imputation for all of the self-efficacy items were comparable with no significant differences found.
The potential for a clustering effect existed due to the group RCT multi-level structure with girls nested in schools. We computed item-specific intra-class correlation coefficients (ICCs) for the PASE and SEEB at both time points to assess non-independence of the data and ensure that single-level CFA was appropriate. Brown (2015) states that ICC values below 0.05 likely indicate that a multi-level CFA model may not be warranted. However, Musca et al. (2011) caution that even with ICC values as low as 0.01, Type I error rates can be greater than 5%. Of the 26 ICC values, the majority were close to zero. Only three items had ICCs > 0.01. The highest ICC value was 0.013 for one of the SEEB items at baseline. Given the low ICCs for both instruments at both time points and small number of schools in the sample, we decided to conduct invariance testing using singlelevel CFA. We doubt that a multilevel CFA is computationally feasible with so little school-level variance in the indicators and that trustworthy parameter estimates could be obtained from a school-level covariance matrix representing data from only 16 schools.
We ran CFA models using Mplus to determine factorial validity and measurement invariance of the PASE and SEEB instruments. Parameters were estimated using weighted least squares with mean and variance adjustment (WLSMV) with delta parameterization in which data are fitted to a polychoric correlation matrix. This estimation method is recommended for ordinal indicators with fewer than five response choices and is also robust to skewness and kurtosis of items (Flora and Curran, 2004;Flora et al., 2012;Brown, 2015). We scaled each latent factor by fixing the factor loading for the first indicator to 1. We chose referent indicators by selecting items with the greatest variability and satisfactory standardized parameter estimates (Johnson et al., 2009).
Our invariance testing began with an assessment of separate single group models for the intervention and control groups at each time point for both instruments. Analysis then proceeded to cross-group measurement invariance between the intervention and control groups as well as longitudinal invariance for each group from baseline to post-intervention. The last step involved setting simultaneous cross-group and longitudinal constraints on parameters. Invariance analysis was conducted following guidelines from Muthén and Muthén (2015) for ordinal data using WLSMV estimation and included: (1) configural invariance -non-referent factor loadings and all thresholds free for both groups/time points, scale factors fixed at one for both groups/time points, factor means fixed at zero for both groups/time points, and factor variances free for both groups/time points; (2) metric invariance -non-referent factor loadings set equal for both groups/time points, scale factors fixed at one for the control group/at baseline and free for the intervention group/at post-intervention, factor means fixed at zero for the control group/at baseline and free for the intervention group/at post-intervention, first threshold of each indicator set equal for both groups/time points as well as the second threshold of the referent indicator to identify the latent factor mean, and factor variances free for both groups/time points; and (3) scalar invariance -non-referent factor loadings and all thresholds set equal for both groups/time points, scale factors fixed at one for the control group/at baseline and free for the intervention group/at post-intervention, factor means fixed at zero for the control group/at baseline and free for the intervention group/at post-intervention, and factor variances free for both groups/time points.
Because invariant factor loadings are vital for construct validity (Brown, 2015), partial metric invariance was not undertaken with the separate cross-group and longitudinal invariance models. Instead, if the metric model resulted in non-invariant factor loadings, we re-specified the model. However, we pursued partial metric invariance for the final simultaneous invariance models to determine which factor loadings were non-invariant. We explored partial scalar (threshold) invariance by examining threshold differences between groups or time points as well as suggested modification indices (MIs). We used guidelines from Coertjens et al. (2012) to explore partial threshold invariance as well as Dimitrov (2010) who suggested that partial invariance might be satisfactory if <20% of parameters are non-invariant. We used the model chi-square test to evaluate initial fit in single group models as well as measurement invariance models. Because this test can be sensitive to large sample sizes, we also used alternative fit indices for model evaluation, including the comparative fit index (CFI) and the root mean square of approximation (RMSEA). Per recommendations from Kline (2016) and Brown (2015), we used the following guidelines to assess model fit: CFI ≥ 0.95; RMSEA ≤ 0.05 for close fit, ≤0.08 for approximate fit, >0.08 to <0.1.0 for marginal fit, and ≥1.0 for poor fit, and evaluation of lower-and upperbound RMSEA 90% confidence intervals. Many researchers using structural equation models or CFA ignore significant chi-square tests as evidence against a model, citing reasons such as the sensitivity of the test to large samples (Ropovik, 2015). Brown (2015), Ropovik (2015), and Kline (2016) argue against blithely accepting models solely on the basis of other global fit indices when the chi-square is significant, advising careful assessment of other evidence regarding model tenability first. Therefore, when the chi-square test was significant, we used recommendations from these authors to identify localized areas of strain by closely evaluating the correlation residuals and MIs. For this study, when the majority of correlated residuals were <0.10 (Kline, 2016), MIs were small, and alternative fit indices indicated acceptable fit, measurement invariance continued based on the fact that these models were considered plausible (Byrne et al., 1989;Raykov et al., 2012).
After evaluating single group models, we evaluated measurement invariance via chi-square difference testing between baseline and nested models. A corrected chi-square difference test was used because the differences are not distributed as chi-square using WLSMV (Brown, 2015). We evaluated RMSEA and CFI fit indices at each step of invariance testing, along with change in CFI. Cheung and Rensvold (2002) recommend using a change in CFI between models ≥ −0.01 as potentially indicating non-invariance. However, Meade et al. (2008) suggested a CFI change of >−0.002 as an indication of non-invariance. Therefore, we interpreted changes in CFI with caution because these guidelines were based on simulation studies using maximum likelihood estimation with normally distributed data. These cutoffs have not been evaluated with WLSMV estimation and ordinal data.
Cronbach's alpha assumes tau-equivalence (equal factor loadings across items within a scale), often underestimates reliability for scales with few items, and is inappropriate for ordinal Likert scale data (Yang and Green, 2011;Furr and Bacharach, 2014). Composite reliability of the instruments was therefore estimated via Green and Yang's (2009) non-linear structural equation modeling (SEM) reliability coefficient, which is based on a parallel-forms definition of reliability. This coefficient is designed for latent factors measured by ordinal indicators (Green and Yang, 2009;Yang and Green, 2015). Kelley and Pornprasertmanit (2016) called this coefficient "categorical omega" to differentiate it from the more commonly used omega coefficient, which uses linear SEM parameter estimates to compute reliability and assumes that the indicators are continuous. Simulations show that Green and Yang's coefficient performs as well as or better than either omega or Cronbach's alpha in a variety of conditions (Green and Yang, 2009;Yang and Green, 2011;Yang and Green, 2015). Computations occurred in three steps: (1) estimation of thresholds and polychoric correlations; (2) fitting the CFA model to the polychoric correlation matrix using WLSMV; and (3) inputting factor loadings and thresholds into the equation using a SAS program to calculate the reliability coefficient. Steps one and two were conducted in Mplus. Results were transferred to SAS to carry out step 3.

Descriptive Statistics of Items
We ran item-level analyses for each indicator of the PASE and SEEB instruments at baseline and post-intervention for the full sample and separately in intervention and control groups. Table 2 includes item descriptions for the PASE and SEEB. Descriptive statistics and polychoric correlations of items can be found in the Supplementary Material.
Overall, girls primarily selected agree a little or agree a lot responses for the PASE items, yielding skewed distributions and PASE, Physical Activity Self-Efficacy Scale; SEEB, Self-Efficacy for Exercise Behaviors Scale.
ceiling effects. The exception was the item, "I can be active in my free time on most days even when I am busy" in which girls' responses had greater variability. Mean inter-item polychoric correlations for the full sample were 0.44 (minimum-maximum: 0.38-0.58) and 0.52 (minimum-maximum: 0.46-0.63) at baseline and post-intervention, respectively. Mean scores for the PASE for the full sample were 2.20 (SD = 0.59) at baseline and 2.17 (SD = 0.59) post-intervention. Compared to the PASE, the SEEB items had lower mean scores and were less skewed with girls being more likely to endorse the 0 and 1 response options (i.e., not at all sure or not very sure). However, 1-item, "How sure are you that you can stick to participating in activities that include exercise?" had over 50% of girls endorsing the highest response option of very sure at baseline. This item had the highest mean score of the SEEB items with limited variance and marked skewness and kurtosis. Mean inter-item polychoric correlations for the full sample were 0.39 (minimum-maximum: 0.30-0.51) and 0.41 (minimum-maximum: 0.29-0.57) at baseline and postintervention, respectively. For the full sample, the SEEB mean score at baseline was 1.90 (SD = 0.59) and 1.82 (SD = 0.60) at post-intervention.

Cross-Group Invariance
Our examination of cross-group measurement invariance began with single group, cross-sectional CFA models for the intervention and control groups at baseline and post-intervention using pre-specified fit criteria. Configural, metric, and scalar invariance tests were conducted following this analysis. Figures 1, 2 show the hypothesized path diagrams for the PASE and SEEB. Parameter estimates, including factor loadings, thresholds, and r-square values for each instrument, are summarized in the first author's dissertation (Voskuil, 2016). Model results are presented in Table 3.

Physical Activity Self-Efficacy Scale (PASE)
CFA models for the intervention group demonstrated an excellent fit to the data at baseline (χ 2 = 8.091, df = 9, p = 0.525, RMSEA = 0.000, CFI = 1.000) and post-intervention (χ 2 = 10.660, df = 9, p = 0.300, RMSEA = 0.019, CFI = 0.999). Based on RMSEA and CFI values, model fit for the control group was acceptable at baseline (χ 2 = 32.503, df = 9, p < 0.001, RMSEA = 0.072, CFI = 0.984) and marginal at post-intervention (χ 2 = 39.893, df = 9, p < 0.001, RMSEA = 0.083, CFI = 0.978). Given the significant chi-square and higher RMSEA values in the control group, we looked for areas of strain at both time points. All residual correlations at baseline were <0.10, but at post-intervention one residual correlation >0.10 was noted between items 3 and 4 (−0.106) in the control group. Suggested MIs at both time points were low in value, not substantively justifiable, and not indicated in the intervention group. Therefore, cross-group invariance testing continued because the models appeared plausible. Others have suggested that baseline models may not need to entirely meet pre-determined fit criteria if the model appears reasonable (Byrne et al., 1989;Raykov et al., 2012;Bowen and Masa, 2015).
Testing for post-intervention scalar invariance proceeded in the same manner. We began by freeing the third threshold for item 2. This action still resulted in scalar non-invariance ( χ 2 = 19.969, df = 10, p = 0.030, RMSEA = 0.049, CFI = 0.990, CFI = −0.002). MIs and EPC values indicated that the second threshold (going from disagree a little to agree a little) for item 6 should also be freed. Doing so improved model fit and supported partial scalar invariance ( χ 2 = 13.456, df = 9, p = 0.143, RMSEA = 0.046, CFI = 0.991, CFI = −0.001). Despite the significant chi-square value for this model, all residual correlations were <0.10 with the exception of the previously mentioned residual correlation between items 3 and 4 of −0.126. Alternative fit indices demonstrated good fit to the data.
FIGURE 1 | Physical Activity Self-Efficacy Scale (PASE) hypothesized path diagram. Figure includes the latent factor, underlying latent response variables indicated by an * , and observed indicators. Small solid arrows denote disturbance and error variances for the latent response variables and observed indicators, respectively.

Self-Efficacy for Exercise Behaviors Scale (SEEB)
Initial models for the intervention group Consistent areas of strain across groups and time points indicated the need to add a residual covariance between items 1 and 2. Adding a residual covariance for these two items was theoretically justifiable given the connection to family responsibilities for each of these items. Another area of strain in the control group at baseline and post-intervention was a residual correlation between items 2 and 6. However, this residual correlation was not present in the intervention group at either time point so we chose not to add this to the model.
Each item was tested one at a time to determine which factor loadings were non-invariant at baseline. When the factor loading for item 3, "How sure are you that you can stick to exercising when you're feeling lazy," was unconstrained, metric invariance was achieved. For this study, partial metric invariance was not considered acceptable because it was assumed to be a fundamental requirement for determining that the same construct is being measured across groups and over time (Coertjens et al., 2012).
We then tested a 5-item model after dropping items 2 and 3 as both demonstrated residual correlations >0.10 with other items. Both of these items also consistently had the lowest factor loadings and R 2 -values across models, particularly at postintervention for both groups. Dropping item 2 also eliminated the need for the residual covariance between items 1 and 2. All further analyses of models for this instrument were conducted using this 5-item model.
Fit indices for the post-intervention configural model provided some support for equal form across groups but it did not fit the data as well as the baseline model: χ 2 = 47.181, df = 10, p < 0.001, RMSEA = 0.086, CFI = 0.986. All residual correlations were <0.10. A modification index of 19.795 suggested a residual covariance between items 1 and 4 but was neither theoretically justified nor indicated at baseline so we did not make this change. Because the RMSEA approached acceptable fit and the CFI was acceptable, metric and scalar invariance were also assessed at postintervention. The metric model indicated invariant factor loadings when compared to the configural model ( χ 2 = 2.184, df = 4, p = 0.702, RMSEA = 0.056, CFI = 0.992, CFI = 0.006). However, scalar invariance of thresholds was not met ( χ 2 = 18.974, df = 9, p = 0.025, RMSEA = 0.052, CFI = 0.988, CFI = −0.004). All residual correlations were <0.10 except that between items 1 and 4 in the control group (0.115).
We investigated SEEB partial scalar invariance using the same process described for the PASE. MIs and EPC values pointed to the thresholds for item 4, "How sure are you that you can stick to participating in activities that include exercise," as potentially non-invariant. The third threshold (going from somewhat sure to very sure) had the largest difference between groups. Partial scalar invariance was met by freeing it: χ 2 = 5.650, df = 8, p = 0.686, RMSEA = 0.041, CFI = 0.993, CFI = 0.001. All residual correlations were <0.10 except between items 1 and 4 in the control group (0.116). One of 15 thresholds was non-invariant (6.7%) with 4 of 5-items still having fully invariant thresholds.

Longitudinal Invariance
Longitudinal invariance was assessed for each instrument. Error covariances for each item across time points were built into the model to account for expected method effects for repeated measures (Brown, 2015). Figures 3, 4 show the hypothesized longitudinal measurement models for the PASE and SEEB.    Table 4. Parameter estimates for the CFA models are provided in the first author's dissertation (Voskuil, 2016).

Self-Efficacy for Exercise Behaviors Scale (SEEB)
Similar to the PASE longitudinal results, the configural models for both groups resulted in significant model chi-square values. Areas of strain were assessed in both groups at both time points.
For the control group, one correlated residual ≥0.10 was noted: item 4 at baseline with item 5 at post-intervention (−0.102). The intervention group also had one residual correlation ≥0.10 as follows: item 4 at baseline with item 6 at post-intervention (0.106). Alternative fit indices demonstrated adequate fit to the data with RMSEA values <0.08 and CFI values >0.95, so we proceeded to test metric models without freeing any additional parameters.

Simultaneous Cross-Group and Longitudinal Invariance
For each instrument, we then tested for simultaneous crossgroup, longitudinal invariance by constraining parameters in multiple ways. For example, to test metric invariance, factor loadings at baseline and post-intervention were constrained to be equal in the control group model, as were intervention group loadings. Results for both instruments are presented in Table 5.
Parameter estimates from the PASE and SEEB CFA models are reported in the Supplementary Material.
Although a partial metric invariance solution was considered unacceptable for this study, we attempted to identify if there was a single non-invariant factor loading by freeing each loading one at a time. This procedure showed that when item 6, "I can be active in my free time on most days," was freed, partial metric invariance was met (χ 2 = 170.224, df = 112, p < 0.001, χ 2 = 24.308, df = 18, p = 0.145, RMSEA = 0.032, CFI = 0.991, CFI = −0.001). For this model, all of the residual correlations for the intervention group were <0.10. In the control group, four residual correlations ≥0.10 were found: (1) item 1 at baseline with item 6 at baseline (−0.100); (2) item 6 at baseline with item 3 at post-intervention (0.114); (3) item 3 at postintervention with item 4 at post-intervention; and (4) item 3 at post-intervention with item 6 at post-intervention.

Self-Efficacy for Exercise Behaviors Scale (SEEB)
The configural model demonstrated adequate fit based on alternative fit indices (χ 2 = 138.445, df = 66, p < 0.001, RMSEA = 0.047, CFI = 0.984). Areas of strain included a residual correlation of −0.102 between item 4 at baseline and item 5 at post-intervention in the control group and a residual correlation of 0.106 between item 7 at baseline and item 6 at postintervention. All other residual correlations were <0.10. Similar to the PASE, the metric model indicated that factor loadings varied across group and time simultaneously (χ 2 = 163.702, df = 76, p < 0.001, χ 2 = 48.136, df = 18, p < 0.001, RMSEA = 0.048, CFI = 0.981, CFI = −0.006). We assessed each item, but found that achieving partial metric invariance would require freeing multiple non-invariant factor loadings.

Physical Activity Self-Efficacy Scale (PASE)
Reliability for the PASE invariance models ranged from 0.772 to 0.842. For the control group, reliability was quite consistent from baseline to post-intervention with minor differences in coefficients ( range: −0.001-0.003). On the other hand, the intervention group had larger longitudinal changes in reliability, all indicating increases over time ( range: 0.042-0.059).

Self-Efficacy for Exercise Behaviors Scale (SEEB)
For the SEEB invariance models, reliability coefficients ranged from 0.719 to 0.800. Similar to the PASE, the reliability coefficients were more stable in the control group than the intervention group over time. From baseline to post-intervention, changes in coefficients for the control group ( range: 0.009-0.018) were smaller than those for the intervention group ( range: 0.035-0.042). Reliability coefficients increased longitudinally. Overall, reliability estimates for the SEEB were slightly lower than those for the PASE. Reliability coefficients for invariance models are summarized in Table 6.

DISCUSSION
We investigated the factorial validity, cross-group and longitudinal invariance, and composite reliability of two PA self-efficacy instruments used in the "Girls on the Move" group Results computed using standardized factor loadings and thresholds; Time 1, baseline; Time 2, post-intervention; PASE, Physical Activity Self-Efficacy Scale; SEEB, Self-Efficacy for Exercise Behaviors Scale.
RCT. Both instruments demonstrated configural invariance for simultaneous analysis of cross-group and longitudinal invariance based on alternative fit indices. However, simultaneous metric invariance was not met for the PASE or the SEEB instruments. Partial metric invariance for the simultaneous analysis was achieved for the PASE with one factor loading identified as non-invariant. Partial metric invariance was not met for the SEEB. Longitudinal scalar invariance was achieved for both instruments in the control group but not the intervention group. Reliability was more stable over time in the control group for both instruments and higher for the PASE than the SEEB. Our findings regarding the measurement of PA self-efficacy are important for advancing the science of adolescent PA research, particularly because invariance testing for psychosocial constructs proposed to influence PA is not routinely conducted. These findings also point out the importance of investigating measurement invariance prior to making mean comparisons between groups for constructs included in intervention studies.
The PASE single factor model supported the hypothesized unidimensional factor structure. These findings are similar to those reported by earlier researchers with an 8-item instrument (Motl et al., 2000;Dishman et al., 2002Dishman et al., , 2010. In addition, Steele et al. (2013) reported a unidimensional factor structure with the same 6-item instrument used in the current study. The fact that the intervention group demonstrated better fit at both time periods than the control group was not anticipated as comparability of groups should be expected in a group RCT. While significant differences emerged in the racial composition between groups, with significantly more black girls in the control group, Dishman et al. (2010) reported racial invariance for the 8-item version of this instrument with black and white girls of similar age.
The hypothesized SEEB factor structure did not fit the data well and required several model re-specifications, including deletion of two items. This instrument was first created and tested among adults resulting in a 12-item two-factor instrument, including a 5-item resisting-relapse factor and a 7-item makingtime-for-exercise factor (Sallis et al., 1988). In subsequent psychometric studies that adapted the current SEEB, researchers mention the use of a resisting-relapse factor, but utilized items from both factors to measure PA self-efficacy among adolescents (Lawman et al., 2011;Peterson et al., 2013).
The PASE exhibited equal factor loadings between intervention and control groups at baseline and post-intervention but only partial scalar invariance. Longitudinal metric invariance was achieved for both groups separately with the control group also demonstrating scalar invariance. However, the factor loadings were not fully invariant when we simultaneously tested invariance across groups and time, indicating non-equivalent measurement of PA self-efficacy.
Partial metric invariance was investigated for the PASE for the simultaneous invariance analysis with one item found to be non-invariant: "I can be active in my free time on most days." This finding has important measurement implications for the concept because deleting this item may better reflect the conceptual definition of PA self-efficacy. This item is the only one that does not include overcoming a barrier to PA or having the needed skills to participate in PA, which are theorized to be two dimensions of the concept (Voskuil and Robbins, 2015). Bandura (2004) stresses that self-efficacy should be assessed in the context of the challenges related to completing a particular behavior in order to maintain conceptual precision.
The modified 5-item SEEB exhibited cross-group metric and scalar invariance at baseline. Post-intervention results demonstrated metric invariance but only partial scalar invariance. Factor loadings were invariant over time in both groups. The modified SEEB demonstrated complete scalar invariance for the control group but partial scalar invariance in the intervention group. Similar to the PASE, the factor loadings varied in the simultaneous analysis. Unlike the PASE, partial metric invariance was not met. Freeing a single loading was insufficient to achieve invariance, implying that at least two of the five items had non-invariant factor loadings.
While neither instrument achieved scalar invariance for the simultaneous analysis, both measures demonstrated longitudinal scalar invariance for the control group but only partial scalar invariance for the intervention group. Additionally, reliability coefficients demonstrated less stability over time in the intervention group compared to the control group. These findings, along with the absence of equal factor loadings for the simultaneous invariance analysis, offer support for the theory that the intervention itself may influence how girls respond to the self-efficacy items and imply that the same concept is not being measured in the same way across groups and time.
Evaluation of group differences for adolescent girls' PA selfefficacy should be interpreted cautiously due to the possibility of confounding from non-equivalent measurement. For example, if self-efficacy mean scores for girls in the intervention group were significantly higher or lower compared to the control group, these differences could be related to systematic response bias rather than an intervention effect on self-efficacy. Likewise, any conclusions regarding the mediational effects of PA self-efficacy in the "Girls on the Move" intervention study may be misleading in the absence of equivalent measurement across groups and time.
Our results underscore the importance of assessing the psychometric properties of adapted instruments rather than assuming that revised versions will be equally reliable and valid as the original. The PASE was created by deleting two items that were closely related to social support for PA, specifically questions about parental support and friend support, from an existing 8-item PA self-efficacy instrument. Steele et al. (2013) reported lower factor loadings for these two items in a sample of 6th-8th grade youth: 0.267 for parental support; 0.444 for friend support. Dewar et al. (2013) point out that the original 8-item instrument also had some lower factor loadings in earlier psychometric studies. Motl et al. (2000) reported factor loadings ranging from 0.390 to 0.610 indicating the possibility that some items may have been weakly related to the self-efficacy latent construct.
The absence of simultaneous metric invariance for the PASE may have been a consequence of these changes. In the original psychometric development study, three factors were described: support seeking, barriers, and positive alternatives (Saunders et al., 1997). The 8-item version of this instrument was unidimensional when items from all three factors were included. Deleting the items of parental and friend support, which represented the support seeking factor, may have led to this lack of invariance. Reducing response burden for adolescent girls is certainly a worthy endeavor, but may jeopardize reliability and validity.
The SEEB may be more useful than the PASE for assessing a participant's ability to actually adhere to a physical activity regimen in an ongoing intervention. An awareness of issues interfering with adherence at certain time points during an intervention may allow interventionists to tailor their approach to specifically address the identified needs. In contrast, when no structured PA program exists, such as during the pre-or postintervention time periods, the PASE may be more appropriate and lead to increased accuracy of participants' responses. Thus the use of the SEEB may have been a poor fit for girls in this study given that they were enrolled in the group RCT only if they did not meet national guidelines for PA. Asking girls questions about sticking to an exercise program if none exists may have contributed to the non-invariance of the instrument in this study. Items in the PASE specifically asked about PA which is not limited to a structured exercise regimen.
Interestingly, the SEEB items demonstrated more variability in responses and were less skewed compared to the PASE items. The PASE items resulted in ceiling effects with the majority of girls choosing agree a lot for most items. Bandura (2006) recommended increasing the difficulty level for endorsing items when the majority of the sample selects the highest efficacy category. For the PASE, how the item stem is phrased and what response choices are provided could use revision. Given that the SEEB achieved better distribution of responses using the wording how sure are you, changing the item stem for the PASE from I can. . . to How sure are you that you can. . . could be examined in a future study. Bandura's (1997, p. 43) instruction to include 'degrees of assurance' using words such as how certain, how confident, or how sure when rating indicator items supports this recommendation. Additionally, having girls rate their PA selfefficacy on a scale of 1-10, as recommended by Bandura, may decrease ceiling effects.

Strengths and Limitations
A major strength of this study was the simultaneous cross-group and longitudinal assessment of invariance of an intervention and control group from a large, group RCT. Very few studies have conducted this level of invariance analysis for selfefficacy instruments between intervention and control groups in adolescent PA intervention research.
An added strength was the use of an appropriate estimator for ordinal data (i.e., WLSMV estimation). A frequently encountered issue in psychometric studies has been the application of the maximum likelihood estimator to Likert-type scales, particularly when ≤4 response options are used to assess a latent construct (Flora and Curran, 2004). This approach leads to biased parameter estimates such as lower factor loadings and elevated standard errors compared to parameter estimates using WLSMV estimation or other appropriate estimators for handling non-normal data (Brown, 2015).
This study involved rigorous invariance testing using recent recommendations by Kline (2016) in which chi-square tests and chi-square difference testing were not simply disregarded as inflated based on sample size, but rather used as a guide to evaluate potential areas of strain in a model. Relying only on changes in CFI between models to establish levels of invariance may not provide sufficient evidence of measurement invariance. While Cheung and Rensvold (2002) recommend using CFI changes <−0.01 as confirmation of invariance, their simulation relied on normally distributed data and maximum likelihood estimation. Additional simulations based on WLSMV estimation are needed to confirm these recommendations when using ordinal indicators (Dimitrov, 2010;Brown, 2015).
A limitation of this study was our use of single regression imputation for missing data. Although overall the amount of missing data was small, this approach may have reduced variability in the data, particularly at post-intervention when missing data increased due to attrition. While this study contributes to understanding factorial validity of self-efficacy measures, it did not address other aspects of validity such as convergent, discriminant, and criterion-related validity. Another limitation is that the sample included only urban, adolescent girls, and findings cannot be generalized to other populations, such as boys. Finally, while partial invariance was assessed in this study, the process for doing so with WLSMV estimation has not been fully investigated and remains an ongoing issue in SEM research (Dimitrov, 2010;Brown, 2015).

CONCLUSION
Our results provide important information regarding the factorial validity, measurement invariance, and reliability of two PA self-efficacy instruments. This study indicated that neither of the adapted instruments achieved full metric invariance implying that these instruments did not measure the same concept equally between groups at both time points. These findings offer some support for the notion that participation in a PA intervention changes girls' perceptions about a psychological construct such as self-efficacy. Perhaps as girls participated in various components of the intervention (Robbins et al., 2013), their enhanced understanding of the challenges related to PA may have altered the meaning of the construct over time and changed how they responded to items.
Several implications for future research arise from this study. We encourage investigators to conduct invariance analysis when adapting instruments that have previously been confirmed as valid and reliable. Essentially an adapted measure is an entirely different measure, and these alterations can potentially undermine the psychometric properties of an instrument. Assessing simultaneous group and longitudinal invariance in intervention studies involving girls could help to clarify whether decreases in self-efficacy have actually occurred or can be attributed to measurement problems.
Qualitative research using focus groups of adolescent girls may be one way to revise items in PA self-efficacy instruments. This approach may help ensure that items accurately reflect current challenges to PA in this population, particularly given technological advances over the last decade. The items in the instruments used for this study did not ask about use of computers, cell phones, iPads, or other devices that may interfere with an adolescent's capability to be physically active. Other researchers have pointed out this deficit and have revised items on instruments to reflect current technology use among adolescents (Dewar et al., 2013).
Adequate psychometric evaluation of scales used to measure psychosocial constructs, such as self-efficacy, has been identified as a significant gap in the literature (Brown et al., 2009). This study contributes to an increased understanding of the psychometric properties of PA self-efficacy instruments. Continued assessment of PA self-efficacy instruments will help ensure that researchers measure this concept appropriately, providing a solid foundation for the science of adolescent PA research. Effort in this area is critical for furthering understanding of the role of PA self-efficacy in health behavior and how the concept might influence or be influenced by interventions designed for adolescent girls.

ETHICS STATEMENT
This study and the group randomized controlled trial were conducted in accordance with the recommendations of the Michigan State University Institutional Review Board (MSU IRB). In accordance with the Declaration of Helsinki, all parents/guardians and adolescents provided written informed consent and assent, respectively. The MSU IRB approved the protocol.

AUTHOR CONTRIBUTIONS
VV, SP, and LR contributed to the study conception and design. VV conducted the data analysis and drafted the manuscript. SP and LR edited and provided critical revisions to the manuscript. SP offered assistance with data analysis and interpretation. LR provided acquisition to the data.

FUNDING
This study, which includes content originally published in the first author's dissertation (Voskuil, 2016), was partially funded by a Michigan State University Graduate School Dissertation Completion Fellowship and the Jonas Nurse Leaders Scholars Program. Data for this study were obtained from the Girls on the Move intervention group RCT, which was funded by a 5-year grant (2011-2016; no-cost extension 2016-2017), R01HL109101, $3,657,706, from the National Heart, Lung, and Blood Institute (NHLBI) at the National Institutes of Health (NIH); PI: LR, Michigan State University College of Nursing. The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of NIH.

ACKNOWLEDGMENTS
The following research team members are acknowledged for their work on the Girls on the Move intervention group RCT: Stacey Wesolek, who served as Project Manager throughout the intervention phase; Kelly Bourne, Current Project Manager and former Measurement Coordinator; and Patrice Patrick-Banks (year 1 of the intervention) and Ann Kostin-McGill (years 2 and 3 of the intervention), Intervention Coordinators. The authors are also thankful for the Michigan State University undergraduate and graduate kinesiology and nursing students who assisted with data collection for the group RCT. Most importantly, the authors are grateful for the girls who provided the data for the current study.

SUPPLEMENTARY MATERIAL
The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2017.01301/full#supplementary-material