Rasch Analysis of the Japanese Version of Teacher Efficacy for Inclusive Practices Scale: Scale Unidimensionality

This study aimed to examine the construct validity of the Japanese version of the Teacher Efficacy for Inclusive Practices (TEIP) scale. The sample consisted of 250 teachers in Japan. Rasch analysis was used to examine the psychometric properties of the scale. Results did not support the 18-item Japanese version of the TEIP scale as a unidimensional scale for measuring TEIP. However, they do support the final 14-item Japanese version of the TEIP scale as a unidimensional scale for measuring TEIP. Four items were removed from the original 18-item scale (items 12, 8, 5, and 3) for violation of the local independency assumption. No item with differential item functioning (DIF) was detected. Only one item (item 18) was rescored to solve a threshold disorder. Further studies with different samples are warranted to confirm the study findings.


INTRODUCTION
The past 30 years have seen increasingly rapid advances toward inclusive education in educational policies and systems reinforced by international policy documents (e.g., UNESCO, 1994;United Nations, 2006;United Nations General Assembly, 2015). Inclusive education can broadly be defined as including all children in mainstream classrooms regardless of their gender, their being from ethnic or linguistic minorities, or having disabilities (UNESCO, 2009). Many countries have adopted new educational strategies toward inclusive education, and Japan is no exception. A drastic change occurred in 2007 when the traditional special education (tokushukyoiku) system was replaced by the current special needs education (tokubetsushienkyoiku) system, in which children with special educational needs can officially receive appropriate support even in regular classes (Central Council for Education, 2005). Because of rapid political change toward inclusive education, Japanese teachers' readiness to implement inclusive education has been questioned by several researchers. For instance, surveys conducted on Japanese regular classroom teachers by Ueno and Nakamura (2011) have shown that the teachers expressed anxiety and difficulty about including children with disabilities under the current support system, even though they agreed on the concept of inclusion. Similarly, Fujii (2014) studied the teachers' awareness of keywords related to inclusive education (e.g., "reasonable accommodation" and "Convention on the Rights of Persons with Disabilities") and suggested that the teachers' knowledge level regarding inclusive education is relatively low and thus they need more in-service training.
Several studies on inclusive education have been conducted which report the global trend on inclusion. Investigating inclusive education from the teachers' perspective has been one of the issues of greatest interest because they play an important role in implementing inclusive education (Avramidis and Norwich, 2002;de Boer et al., 2011). It has conclusively been shown that teachers' self-efficacy for inclusive practices is one of the crucial factors associated with teachers' attitudes toward inclusive education and their intention to include children with disabilities in mainstream classrooms (Savolainen et al., 2012;Sharma et al., 2018;Yada et al., 2018). The concept of selfefficacy was introduced by Bandura (1977), who defined it as a person's belief in his/her capability to plan and execute specific performances that produce expected outcomes (Bandura, 1997). Teachers' self-efficacy is specific to the teaching profession, and it has been found to be related not only to student achievement (Tschannen-Moran et al., 1998) but also teachers' outcomes such as willingness to apply new teaching strategies (Allinder, 1994), and job-related stress and burnout (Betoret, 2006). Previous research has demonstrated that teachers' self-efficacy beliefs have context-specific and goal-oriented characteristics (Wyatt, 2014), and therefore they have been studied in several teaching domains (e.g., math, language, and physical education).
Teachers' self-efficacy has been investigated in relation to many teaching contexts; however, there was no specific instrument that assesses teachers' efficacy beliefs in implementing inclusive education . Therefore, Sharma et al. (2012) developed a scale named the Teacher Efficacy for Inclusive Practices (TEIP). The scale consists of 18 items, which can be divided into three sub-scales: (1) efficacy to use inclusive instruction, (2) efficacy in managing behavior, and (3) efficacy in collaboration . The scale has been translated and used in many countries such as China, Finland, Japan, Saudi Arabia, and South Africa, and the reliability and validity of translated versions have been demonstrated (Savolainen et al., 2012;Malinen et al., 2013;Yada et al., 2018;Alnahdi, 2019b). Although there are a number of international studies focusing on teachers' self-efficacy in inclusive practices, data about Japanese teachers' self-efficacy are limited (Yada and Savolainen, 2017). There is the pressing need to develop a valid and reliable instrument that can measure Japanese teachers' self-efficacy in implementing inclusive education.
Although high reliability (Cronbach's alpha = 0.93) and good construct validity (good fit to a theoretical model of confirmatory factor analysis) were addressed for the Japanese version of the TEIP scale (Yada and Savolainen, 2017;Yada et al., 2018), these analyses were conducted according to classical test theory (CTT). In CTT, the focus is on the observed score as a whole and not on each item, such as in an item response theory like Rasch analysis (De Ayala, 2013). Rasch analysis has the benefit of providing different statistical parameters for each item. This will allow one to understand the difficulty of each item "specific perceived ability." In the context of inclusive education, understanding the item difficulty would help to target specific teaching skills for improvement (Lai et al., 2016). The item location in Rasch analysis represents how difficult (difficult to endorse) the respective item's "specific perceived ability" was perceived by participants, which would afford us the opportunity to pay more attention to these aspects in future training programs for teachers rather than conducting an overall assessment of teachers' perceived self-efficacy as a whole. In addition, Rasch analysis enables us to fit the observed data to a unidimensional model, to examine the scoring structure that is being used and find the best scoring structure based on the observed data, to provide misfit persons and misfit items to be removed, and to examine measurement invariance, and it enables transformation of ordinal data into interval data (Tennant and Conaghan, 2007a;Bond and Fox, 2015). In addition, it offers the benefit of a person-item map for easy visual comparison of item difficulty and participants' ability (Lee et al., 2010;Cappelleri et al., 2014), help to scan the scale targeting at a glance, and provide the Wright map of item-person relationship (Cheng et al., 2009). In sum, "Rasch analysis is a powerful tool for evaluating construct validity" (Baghaei, 2008b(Baghaei, , p. 1146) that allows us to apply a unified approach to examine several measurement properties (Tennant and Conaghan, 2007a). Other reasons and advantages of using Rasch models have been documented in numerous studies (see, e.g., Andrich, 1995;Fisher, 1995Fisher, , 1996Linacre, 1996;Smith, 2000;Andrich and Marais, 2019).
Therefore, this study aims to examine the construct validity of the Japanese version of the TEIP scale using Rasch analysis. Since the total score of the TEIP scale (e.g., Ahsan et al., 2012;Miesera et al., 2019) is often used, it is important to check the scale's unidimensionality, as "the use of the total score as a summary of a person's value on the variable implies that persons can be compared by their total scores, or the estimates of their true scores, and this implies a unidimensional construct" (Andrich and Marais, 2019, p. 50). According to Park et al. (2016) and Lai et al. (2016) the TEIP is a unidimensional scale, and the use of the Rasch model analysis would help to examine the unidimensionality (Wright and Stone, 1999) of the Japanese version of the TEIP.

Sample and Instrument
The sample consisted of 250 Japanese teachers who were working in primary or secondary level (grade 1-12) schools. The data were collected using convenience sampling technique, in which schools and teachers who agreed to engage in the study were included in the sample. The sample was selected from Tokyo metropolis and eight prefectures, including Chiba, Fukui, Kagoshima, Kochi, Miyazaki, Saitama, and Yamaguchi. A sample of 250 would be needed for calibration of definitive items with 99% confidence (Linacre, 1994). The sample included 45% males and 55% females. As regards age, 36% were between 22 and 35 years old, 35% were between 35 and 50 years old, and around 29% were older than 60 years. The rate of gender and age distribution were close to those of overall Japanese teacher population (MEXT (Ministry of Education, Culture, Sports, Science and Technology), 2017).
The Japanese version of the TEIP scale was utilized to assess teachers' self-efficacy beliefs in implementing inclusive education. Bold = indicator of unidimensionality; a = the lower limit of 95% for the binomial proportion confidence intervals, M = models, A = unidimensional, B = multi-dimensional, EMB = efficacy in managing behavior, EII = efficacy in inclusive instruction, EC = efficacy in collaboration, four items that showed local independency indictors composite together, * = after removing two items, the remaining two items (item 3 with item 9 and item 5 with item 6) that showed indicators of local independence are combined. The TEIP scale consists of 18 items with 6 items allocated to each sub-scale. The participants were asked to evaluate to what extent they agree/disagree with the statements on a six-point Likert scale ranging from "1 = strongly disagree" to "6 = strongly agree." The translation of the Japanese and confirmatory factor analysis of the three factor model was confirmed in previous studies (Yada and Savolainen, 2017;Yada et al., 2018). A good level of reliability was indicated (Cronbach's alpha = 0.93), and a good fit to the hypothesized three-factor model via confirmatory factor analysis was determined for the Japanese version of the TEIP scale (Yada and Savolainen, 2017;Yada et al., 2018).

Rasch Analysis
In this study, the Rasch analysis steps were based on Tennant and Conaghan's recommendations on conducting and reporting the results of a Rasch analysis study (Tennant and Conaghan, 2007a). The Rasch Unidimensional Measurement Model (RUMM2030) software (Andrich et al., 2010) was used for analysis in this study. The default model in RUMM2030, the partial credit model (Masters, 1982), is used in this study, as it is recommended with significant likelihood ratio test (Tennant et al., 2011;Vincent et al., 2015), and that means that the thresholds were estimated for each item in this type of model (Andrich and Marais, 2019). While in the rating scale model, threshold discriminations are equal across all items (Andrich, 1978).
In this analysis, we were looking for non-significant itemtrait interaction chi-square as an indicator for overall fit. Another indicator would be to have normally distributed residuals with item residual mean close to zero, and a standard deviation close to 1 (Alnahdi, 2018). Threshold map and item characteristic curve (ICC) were checked for items with disordered thresholds, and any item that showed disordered thresholds was rescored to combine adjacent categories to solve this disorder (Tennant et al., 2004;Tennant and Conaghan, 2007a). We considered items with item-fit residual outside the range ±2.5 with significant p value to be misfit items to be removed (Tennant and Conaghan, 2007a). We considered persons outside the range of ±2.5 personfit residual as misfit persons, to be removed from the sample (Tennant and Conaghan, 2007a). Local dependency was checked by looking for high correlation between the item residuals (Andrich and Marais, 2019). Due to the assumption that in a unidimensional scale, item residuals would not show high correlation after extracting the latent variable (self-efficacy), we considered all items with residual correlation of 0.20 above the average as an indicated violation of the local dependency assumption (Hissbach et al., 2011;Makransky and Bilenberg, 2014;Christensen et al., 2017;Lundgren-Nilsson et al., 2019). This is an important step, as it has been argued that the traditional fit statistics used in Rasch analysis could be insensitive to violations of unidimensionality under certain circumstances (Hattie, 1985;Smith, 2002;Hagquist et al., 2009).
A principal components analysis for the residuals was conducted to examine the scale unidimensionality (Smith, 2002;Tennant and Conaghan, 2007a;Andrich and Marais, 2019). Based on this analysis we got two sets of items: items that loaded positively on the first component and items that loaded negatively on the first component. Then, two ability estimates were calibrated based on these two sets of items. Next, t-tests were conducted to examine whether the two ability estimates were statistically significant. A 5% or less significant test would be considered acceptable or the lower limit of 95% for the binomial proportion confidence intervals at 5% level or less (Smith, 2002;Tennant and Conaghan, 2007a;Alnahdi, 2018).
We examined Differential item functioning (DIF) for items to ensure that items function similarly for both gender and regardless of participants' age (Tennant and Pallant, 2007b). We examined internal consistency looking for the value of the person separation index (PSI) >0.7 as a good indicator (Tennant and Conaghan, 2007a). Finally, we transformed the raw scores to interval scores using the formula: "Y = M + (S × logit score). S = range of interval-level scale [(60; for a 0-60 scale)] divided by the actual range of logit scores, and M = (minimum score of interval-level scale) -(minimum logit score × S)" (Alnahdi, 2018, p. 355). This step made the interpretation of the scores much easier because any change in one unit would have the same weight across scale (Alnahdi, 2018).

RESULTS
In the first analysis, we examined the 18-items scale fit to the Rasch model. Table 1 shows that chi-square for item-trait interaction was non-significant [χ 2 78.05 (72) = p > 0.05], which was a good indicator. However, the results did not support the scale unidimensionality as 14% of t-tests were significant in comparison with the recommended limit of 5% (see Table 1).
Next, in the second run, 18 misfit persons with residuals outside the ±2.5 range were removed. As shown in Table 1, the unidimensionality issue was still not solved by this modification, and the percentage of significant t-tests decreased only to 13.8%, which was still far above the recommended 5%. Next, we examined the threshold map looking for items with disordered threshold. Only item 18 was found to have threshold disorder (see Figure 1).
In the next run, item 18 was rescored to improve the threshold order, and instead of the previous score of 012345, it became 001234. This new score improved the threshold order (see Figure 2). However, there was no improvement with regards to the unidimensionality test. Next, we examined local dependency of items by reviewing residuals correlation of items, looking for a value of 0.20 or higher. Four items were found to have that level of correlation with other items. Next, to overcome local dependency, the four items whose residuals showed high correlations with other item residuals were combined into four super items (Marais and Andrich, 2008) (item 8 with item 7, item 12 with item 13, item 5 with item 6, item 3 with item 9) to check whether this improved scale unidimensionality. This is recommended as a solution, for "when a set of items are locally dependent they can be bundled into polytomous super-items" (Baghaei, 2008a(Baghaei, , p. 1105, and it has been applied in several studies (such as Marais and Andrich, 2008;Brogårdh et al., 2013;Peter et al., 2014;Medvedev et al., 2016;Milinis et al., 2017;Finger et al., 2019). However, this step did somewhat improve the percentage of significant t-tests in the unidimensionality test, 12.9%, though this is still far from the recommended 5%.
In the next four runs, we removed these items one by one and examined improvement in the unidimensionality test. In the last run, after removing the four items, the unidimensionality was supported by finding that the lower limit of the 95% CI for the binominal test was less than 5% (4.46%). The removed items were from all three subscales: item 8 was from efficacy in managing behavior; item 5 was from efficacy in inclusive instruction subscale; and items 3 and 12 were from efficacy in collaboration subscale. Since the scale showed unidimensionality and had no items with high residual correlations, the assumption of the local independence of items was fulfilled (Tennant and Conaghan, 2007a) in the final 14-item Japanese TEIP scale.
In addition, as an alternative to removing the four items, three extra runs were conducted separately for each subscale to see whether the scale would show better fit as a multidimensional scale with three unidimensional subscales. The results showed that only the efficacy in collaboration subscale fit the Rasch model and the unidimensionality test supported a unidimensional subscale. However, the results did not support the unidimensionality of the other two subscales. Therefore, we continued with the 14 items as a unidimensional scale in further analysis.
After we ensured the unidimensionality of the 14-item scale, a DIF analysis was conducted to ensure the 14 items function similarly regardless of the sample age or gender. The results indicated that no item showed indicators of DIF. Figure 3 shows an example of item 9 that functioned similarly regardless of participants' age or gender. Table 2 shows item parameters, and items were sorted based on location, from most to least difficult to comply with the lowest location value. Table 3 shows the scale with the new scoring applied. One item (18) was rescored as 001234, and the remaining 13 items were not changed and they were 012345. Table 4 shows the transformation of raw scores to interval scores, which will make it easier to interpret any differences in interval scores, because any change in one unit has the same weight across the scale (Alnahdi, 2018).
In addition, we examined the internal consistency of the 14item scale, and the person separation index (PSI) was 0.91 which indicated a high level of internal consistency as the adequate value is 0.70 (Tennant and Conaghan, 2007a). Figure 4 shows the person-item threshold plot with good spread of item thresholds covering the partisan threshold as a good indicator for targeting.

DISCUSSION
This study aimed to examine the construct validity of the Japanese version of TEIP using Rasch analysis. The Rasch analysis of the 18-items scale did not support a unidimensional scale. This result is not consistent with the result of Park et al. (2016) via confirmatory factor analysis that the 18-item TEIP is a unidimensional scale. However, Lincacre argued that "there is no way of knowing from factor analysis alone whether each factor is a dimension or a slice of a shared dimension" (Linacre, 1998, p. 603). This result is consistent with the findings of another Rasch study of the Arabic version of the scale (Alnahdi, 2019a) that did not support the unidimensionality of the 18-item TEIP. This shows the importance of more combined studies with data from different countries to examine the crosscultural psychometric properties of the scale.
Modifications were made to improve the fit for the Rasch model; removing misfit persons, rescoring one item (18) for threshold disorder, and removing four items for violating the local independency assumption. After these changes took place, the 14-item scale did fit the Rasch model and support a unidimensional scale.
The four items that were removed were items 3, 5, 8, and 12. If we look at these items, which have high correlated residuals, to understand the source of the local dependency issue, we see that the residual of item 12 "I can collaborate with other professionals (e.g., itinerant teachers or speech pathologists) in designing educational plans for students with disabilities" was correlated with residual from item 13 "I am able to work jointly with other professionals and staff (e.g., aides, other teachers) to teach students with disabilities in the classroom." In these two items, 12 and 13, it appears that there is a level of repetition of the collaborative idea that might cause a local dependency issue of items, as more redundant items could increase the dependence in the data (Marais and Andrich, 2008). A similar observation was made with item 8 "I can control disruptive behavior in the classroom" as its residual was correlated highly with the residual from item 7 "I am confident in my ability to prevent disruptive behavior in the classroom before it occurs," and for residuals for items 5 and 6, and items 3 and 9. Moreover, it is important that we deal with items with local dependency as "dependency among items can inflate reliability and give a false impression of the precision and quality of the test" (Baghaei, 2008a(Baghaei, , p. 1105. Hattie (1985) believes that "the most critical and fundamental assumption of the latent trait models is that of local independence" (p. 151), and Wang et al. (2005) also discussed the misspecification that can be a result of violating the local item independence assumption.
In addition, before we removed the four items, we tested the super-item solution, which did not solve the problem of local dependency in the 18-item TEIP scale unidimensionality. Therefore, we continued by removing four items and proposed the 14-item scale with as a unidimensional measure, as the "violations of the unidimensionality requirement influence person measurement" (Smith, 2002, p. 205). In line with this, Clark and Watson discussed the issues related to scales construct validity, and they stated that "in selecting scale items, the goal is unidimensionality rather than internal consistency" (p. 306,  Clark and Watson, 1995). Two of the four items removed in this study, items 8 and 12, were also removed from the Arabic version of the scale to reach a unidimensional scale (Alnahdi, 2019a) for the same reason, the violation of the local item independence assumption. We found that item 7 "I am confident in my ability to prevent disruptive behavior in the classroom before it occurs" was the most difficult item to be endorsed by participants, followed by items 16, 17, and 10. These results are consistent with those of Yoshitoshi (2014) who administered the TEIP scale to 59 high school teachers in Japan. He indicated that Japanese teachers reported more negative responses to items 7 and 17 and concluded that teachers did not know of concrete strategies to work with children who show difficult behavior. Further, item 16 is related to knowledge about laws and policies of inclusive education and this was the second difficult question for the Japanese teachers to endorse. This finding is somewhat in line with that of a previous study indicating that Japanese teachers showed less awareness of the keywords related to inclusive education than those related to special needs education (Fujii, 2014). These findings have important implications for developing in-and pre-service teacher training in Japan, where teachers can acquire necessary skills to deal with students' problematic behavior and knowledge regarding inclusive educational laws and policies. The easiest item to be endorsed was item 14 followed by items 6, 13, and 11. Items 11 and 14 were among the consistently weakest indicators for self-efficacy for this sample and for the Saudi sample studied earlier (Alnahdi, 2019a). Similarly, Chao et al. (2018) found item 11 to be convenient for endorsement in Hong Kong ("I am able to get children to follow classroom rules").
The final 14-item scale that fit the Rasch model was different from the 13-item scale that fit the Rasch model with a sample from Saudi Arabia (Alnahdi, 2019a). This shows that items are perceived somewhat differently in different populations for different reasons. Therefore, it would be recommended for future studies to have combined data from different countries to examine the unidimensionality of the TEIP scale for different populations. Also, this will allow one to examine if all items function similarly in different countries. It will also allow examining the best scoring structure of the scale, as this scale has been used with different scoring structures in different studies with 5, 6, and 9-point Likert scales (Lai et al., 2016;Park et al., 2016;Yada et al., 2018;Alnahdi, 2019a).

IMPLICATIONS
As mentioned above, developing inclusive education is on the global educational agenda to realize an inclusive society (United Nations General Assembly, 2015). Thus, there are increasing needs to examine continuously whether the new inclusive educational policies and systems are operated well in practice. One way for that is to measure teachers' self-efficacy for inclusive practices, which reveals the teachers' perspectives on inclusive education. Our findings confirmed that the 14-item Japanese version of the TEIP scale is valid to assess Japanese teachers' self-efficacy for inclusive practices and can be used not only by researchers but also by government and municipal administrators. In addition, this finding will help by providing researchers with a clear order of the items in the scale, for items with strength indicator for teachers' efficacy, and this would be helpful in cases where researchers would need to use only a few items as part of their studies. So, items with highest value on location can be chosen. Understanding tasks (items) order would help to arrange tasks that included in TEIP according to the difficulty level from the teachers' point of view, which is important in designing curricula and training programs for teachers to focus more on the difficult tasks. In addition, it will be helpful to measure improvements as a result of intervention studies. Transformation tables of scores help researchers to easily interpret differences in scores and determine how significant a change is in some score after receiving an intervention, for example. The findings of this study would support researches to calculate a total score on the 14-item Japanese TEIP to represent a teacher's efficacy to work in inclusive education. Finally, further research is required to confirm the findings of this study.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to AY, ado.eika@gmail.com.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.